Skip to content

๐ŸŒ Accelerator Guide: Scientific Workflow Framework for Data to Indicators โ€‹

๐Ÿ“˜ Objective of This Guide โ€‹

This guide introduces a structured approach to executing scientific workflows using the Accelerator platform, with a particular focus on environmental, agricultural, and biodiversity-related data pipelines. The document is intended for technical practitioners โ€” data scientists, modelers, and research engineers โ€” who are looking to:

  • Integrate public raw datasets into model-ready formats
  • Validate and standardize model outputs
  • Derive and inspect domain-specific indicators (e.g., biodiversity indicators)
  • Version, reproduce, and share end-to-end simulations and transformations

Rather than covering tool-specific instructions, this guide articulates the intent and architecture of such workflows, setting the stage for deeper technical execution in companion documents.


๐Ÿ” What This Guide Covers โ€‹

We highlight three representative use cases that collectively demonstrate the lifecycle of scientific data usage on Accelerator:

1. ๐Ÿ“ฅ From Raw FAOSTAT Data to Model Input โ€‹

This case focuses on how open-access data (e.g., FAOSTAT) can be downloaded, processed, and converted into formats such as GDX, suitable for models like GLOBIOM and CAPRI. It reflects early-stage data ingestion and transformation before simulation.

Objective: Standardize and prepare external data for scientific modeling pipelines.

โžก๏ธ Detailed implementation covered in: "Use Case 1: FAOSTAT to Model Input"


2. โœ… From Model Output to Validated Indicators โ€‹

After model runs, we address how their outputs can be harmonized, validated, and passed through * indicator-generating modules* โ€” with biodiversity indicators as a case in point. Outputs are checked for structural integrity and consistency.

Objective: Enhance model outputs with value-added derived metrics that are validated and interpretable.

โžก๏ธ Detailed implementation covered in: "Use Case 2: Harmonization and Indicators"


3. ๐Ÿ” Full Pipeline as Jobflow and Shared Model โ€‹

Finally, we illustrate how the entire process โ€” from ingestion to indicator โ€” can be assembled into a single reproducible pipeline using Accelerator's jobflow orchestration. The jobflow can be hosted, parameterized, and shared with collaborators or the public, ensuring repeatable science.

Objective: Deploy the pipeline as a hosted scientific service with access controls and user-configurable parameters.

โžก๏ธ Detailed implementation covered in: "Use Case 3: Hosted Reproducible Jobflow"


alt text


๐Ÿ”„ Versioning for Trust and Transparency โ€‹

A core principle in scientific data work is reproducibility. Accelerator supports this by enabling complete versioning of both code and data using integrated technologies:

  • ๐Ÿ“ DVC-powered routines let you manage and push/pull datasets from remote storage, keeping versions synchronized
  • ๐Ÿงพ Git-integrated routines preserve the exact code version used in each job, including dependencies and configuration
  • ๐Ÿ” Each run is traceable, including:
    • Code hash
    • Dataset mapping state
    • Parameters used
    • Runtime logs and duration

This ensures outputs can be audited, re-used, and trusted across time, people, and institutions.


๐Ÿ”š Summary โ€‹

This guide introduces the intent and scope of scientific workflow support in the Accelerator platform. Through three connected cases, we demonstrate how a modeler or researcher can:

  • Operate across raw data ingestion, simulation, enrichment, and visualization
  • Apply validation checkpoints at multiple stages
  • Convert workflows into hosted and parameterized tools
  • Enable reproducible science through built-in versioning

Each section is elaborated further in dedicated documents linked above.