Skip to content

✅ Data Validation in Accelerator

The Accelerator platform includes a powerful, built-in data validation system that ensures datasets conform to expected formats, standards, and quality rules before being used in computational workflows.


🧩 Supported Data Types

1. CSV Timeseries

  • Tabular data representing values over time
  • Required columns: time, variable, value

2. Regional Timeseries

  • Tabular data with spatial breakdown
  • Required columns: region, variable, value, time

3. Raster Timeseries

  • Spatial datasets (e.g., GeoTIFF) representing time-indexed grids
  • One file per timestep, with appropriate metadata (CRS, nodata)

Note: Vector datasets (e.g., polygons) are not directly validated. They are typically used as supporting spatial layers (e.g., via GeoJSON or PMTiles) and integrated within routines that consume regional timeseries.


🔍 Validation Layers

🧱 Type Validation (Built-in)

Each data type includes a set of core validation rules:

  • File type and format checks
  • Structural column requirements (for CSV)
  • Metadata validation (for raster)

⚙️ Custom Validation Rules

Users can define additional JSON-based rule sets to validate content-specific expectations:

  • Required or allowed variable names
  • Value ranges (e.g., temperatures between -50 and 60)
  • Allowed units or categories
  • Logical checks (e.g., monotonic time, no missing values)

These rules enhance quality control and enforce domain-specific standards.


📦 Validation Schemas

Validation rules can be bundled into schemas — reusable JSON templates registered on the platform.

  • Each schema has a unique identifier
  • Can be referenced in:
    • Routines
    • Pipelines
    • Manual dataset validation processes

Benefits

  • Reuse: Apply the same validation across multiple datasets
  • Share: Collaborate with teams using common standards
  • Enforce: Automate checks before workflows consume data

💡 Use Cases

🔁 Harmonizing Datasets

  • Apply validation schemas to datasets from different sources
  • Standardize structure before ingestion
  • Improve interoperability across workflows

🔄 Reusable Computational Modules

  • Declare validation schema requirements in a routine
  • Ensure routines only accept datasets with expected shape
  • Avoid hidden data assumptions, simplify reuse

🔗 Pluggable Workflows

  • Define dataset requirements as part of routine metadata
  • Allow upstream producers to align to schema
  • Enable modular, composable data pipelines

🛠️ Integration

  • Validation can be triggered as part of:

    • Routine execution
    • Manual validation tool
    • Dataset registration step
  • Routines like Regional Timeseries Validator use this feature automatically.


🧭 Summary

FeatureDescription
Built-in Type ChecksValidate CSV and raster structure
User-Defined RulesCreate schemas with custom constraints
Schema IdentifiersReuse validation rules across datasets and workflows
Integrated with RoutinesCompatible with data loading and validation flows
Optional Vector UseVector data is not validated directly, but used to interpret region fields

Data validation is not just about integrity — it's the foundation for **trustworthy, reproducible, and modular workflows ** on the Accelerator platform.