ArcheoVLM

Datasets & Data Sources

Detailed breakdown of each dataset required for the project, what it contains, and why it is essential

Phase 1 & 2: Data Ingestion, Triage, and Prioritization
Core datasets for initial processing and spatial analysis
1. Primary LiDAR Data
Core archaeological discovery dataset
ORNL DAAC

Dataset Name

"LiDAR Surveys over Selected Forest Research Sites, Brazilian Amazon, 2008-2018"

3,154 files

What It Is

High-resolution raw point cloud files in .laz format. Each tile covers 1 km² with millions of 3D points (X, Y, Z coordinates).

Essential

Why It's Needed

Core source for archaeological discoveries. LiDAR penetrates forest canopies to reveal subtle human-made topographical features invisible in standard imagery.

2. LiDAR Tile Inventory
Spatial metadata and file management
CSV Format

Dataset Name

cms_brazil_lidar_tile_inventory.csv (provided with ORNL DAAC dataset)

Metadata

What It Is

Metadata manifest containing filename, acquisition date, and geographic footprint for each of the 3,154 LiDAR tiles.

Critical

Why It's Needed

Essential for spatial indexing and triage. Enables spatial joins and cross-referencing with other geographic datasets to identify areas of interest.

3. Known Archaeological Sites
Reference data for "known to unknown" strategy
Kaggle

Dataset Name

"AmazonGeoArchDB: Amazon Archaeology GIS"

Shapefiles

What It Is

Curated geospatial data with shapefiles delineating locations of discovered archaeological sites (geoglyphs, settlements, causeways).

Fundamental

Why It's Needed

Filtering: Identifies tiles covering known sites to avoid rediscovery. Prioritization: Creates 15km buffer zones to prioritize nearby unknown areas.

4. Deforestation Data
Forest clearing correlation analysis
INPE/PRODES

Dataset Name

PRODES Deforestation Layer from Brazil's National Institute for Space Research

Temporal

What It Is

Shapefile mapping historical and recent deforestation in the Amazon, monitored since the 1980s.

Correlation

Why It's Needed

Strong correlation between deforestation and archaeological discovery. Forest clearing reveals features for the first time, increasing detection probability.

5. Modern Populated Areas
Contemporary development exclusion zones
Multiple Sources

Dataset Name

OpenStreetMap (OSM), Brazilian Institute of Geography and Statistics (IBGE), Global Human Settlement Layer (GHSL)

Vector Data

What It Is

Shapefiles delineating boundaries of cities, towns, and significant infrastructure like roads and industrial areas.

Filtering

Why It's Needed

Critical for eliminating false positives. Modern activities create ground disturbances that could be mistaken for archaeological features.

Phase 6: Automated Verification & Contextualization
Multi-modal verification and historical context datasets
6. High-Resolution Satellite Imagery
Independent verification through optical analysis
ESA/Sentinel-2

Dataset Name

Sentinel-2 Multispectral Instrument (MSI) data via Google Earth Engine or Copernicus Open Access Hub

10m Resolution

What It Is

Global satellite imagery with 10-meter resolution, 5-day revisit time, capturing 13 spectral bands from visible to short-wave infrared.

Verification

Why It's Needed

Primary verification tool. Creates True-Color images and NDVI products to detect vegetation anomalies that corroborate buried earthworks.

7. Digitized Historical Texts
Contextual evidence from historical records
Compiled Corpus

Dataset Name

Custom corpus from Internet Archive, Project Gutenberg, university archives, and museum collections

Text Files

What It Is

Digitized historical documents including explorer journals (Percy Fawcett, Alfred Russel Wallace), missionary reports, and colonial records.

Contextual

Why It's Needed

Provides narrative context through NLP geoparsing. Historical accounts describe settlements and populations, adding justification to discoveries.

Data Integration Strategy
How datasets work together in the pipeline

Spatial Analysis Pipeline

1. LiDAR Inventory → Spatial indexing
2. Known Sites → Reference filtering
3. Deforestation → Priority weighting
4. Modern Areas → Exclusion zones
5. LiDAR Data → Feature detection

Verification Pipeline

1. Detected Features → Coordinate extraction
2. Satellite Imagery → Visual verification
3. Historical Texts → Contextual support
4. Multi-modal Package → Expert review

Dataset Synergy

The power of this approach lies in the integration of multiple independent data sources, each providing different types of evidence that together create a comprehensive archaeological analysis framework.