ArcheoVLM
Datasets & Data Sources
Detailed breakdown of each dataset required for the project, what it contains, and why it is essential
Dataset Name
"LiDAR Surveys over Selected Forest Research Sites, Brazilian Amazon, 2008-2018"
What It Is
High-resolution raw point cloud files in .laz format. Each tile covers 1 km² with millions of 3D points (X, Y, Z coordinates).
Why It's Needed
Core source for archaeological discoveries. LiDAR penetrates forest canopies to reveal subtle human-made topographical features invisible in standard imagery.
Dataset Name
cms_brazil_lidar_tile_inventory.csv (provided with ORNL DAAC dataset)
What It Is
Metadata manifest containing filename, acquisition date, and geographic footprint for each of the 3,154 LiDAR tiles.
Why It's Needed
Essential for spatial indexing and triage. Enables spatial joins and cross-referencing with other geographic datasets to identify areas of interest.
Dataset Name
"AmazonGeoArchDB: Amazon Archaeology GIS"
What It Is
Curated geospatial data with shapefiles delineating locations of discovered archaeological sites (geoglyphs, settlements, causeways).
Why It's Needed
Filtering: Identifies tiles covering known sites to avoid rediscovery. Prioritization: Creates 15km buffer zones to prioritize nearby unknown areas.
Dataset Name
PRODES Deforestation Layer from Brazil's National Institute for Space Research
What It Is
Shapefile mapping historical and recent deforestation in the Amazon, monitored since the 1980s.
Why It's Needed
Strong correlation between deforestation and archaeological discovery. Forest clearing reveals features for the first time, increasing detection probability.
Dataset Name
OpenStreetMap (OSM), Brazilian Institute of Geography and Statistics (IBGE), Global Human Settlement Layer (GHSL)
What It Is
Shapefiles delineating boundaries of cities, towns, and significant infrastructure like roads and industrial areas.
Why It's Needed
Critical for eliminating false positives. Modern activities create ground disturbances that could be mistaken for archaeological features.
Dataset Name
Sentinel-2 Multispectral Instrument (MSI) data via Google Earth Engine or Copernicus Open Access Hub
What It Is
Global satellite imagery with 10-meter resolution, 5-day revisit time, capturing 13 spectral bands from visible to short-wave infrared.
Why It's Needed
Primary verification tool. Creates True-Color images and NDVI products to detect vegetation anomalies that corroborate buried earthworks.
Dataset Name
Custom corpus from Internet Archive, Project Gutenberg, university archives, and museum collections
What It Is
Digitized historical documents including explorer journals (Percy Fawcett, Alfred Russel Wallace), missionary reports, and colonial records.
Why It's Needed
Provides narrative context through NLP geoparsing. Historical accounts describe settlements and populations, adding justification to discoveries.
Spatial Analysis Pipeline
Verification Pipeline
Dataset Synergy
The power of this approach lies in the integration of multiple independent data sources, each providing different types of evidence that together create a comprehensive archaeological analysis framework.