๐ Accelerator Guide: Scientific Workflow Framework for Data to Indicators โ
๐ Objective of This Guide โ
This guide introduces a structured approach to executing scientific workflows using the Accelerator platform, with a particular focus on environmental, agricultural, and biodiversity-related data pipelines. The document is intended for technical practitioners โ data scientists, modelers, and research engineers โ who are looking to:
- Integrate public raw datasets into model-ready formats
- Validate and standardize model outputs
- Derive and inspect domain-specific indicators (e.g., biodiversity indicators)
- Version, reproduce, and share end-to-end simulations and transformations
Rather than covering tool-specific instructions, this guide articulates the intent and architecture of such workflows, setting the stage for deeper technical execution in companion documents.
๐ What This Guide Covers โ
We highlight three representative use cases that collectively demonstrate the lifecycle of scientific data usage on Accelerator:
1. ๐ฅ From Raw FAOSTAT Data to Model Input โ
This case focuses on how open-access data (e.g., FAOSTAT) can be downloaded, processed, and converted into formats such as GDX, suitable for models like GLOBIOM and CAPRI. It reflects early-stage data ingestion and transformation before simulation.
Objective: Standardize and prepare external data for scientific modeling pipelines.
โก๏ธ Detailed implementation covered in: "Use Case 1: FAOSTAT to Model Input"
2. โ From Model Output to Validated Indicators โ
After model runs, we address how their outputs can be harmonized, validated, and passed through * indicator-generating modules* โ with biodiversity indicators as a case in point. Outputs are checked for structural integrity and consistency.
Objective: Enhance model outputs with value-added derived metrics that are validated and interpretable.
โก๏ธ Detailed implementation covered in: "Use Case 2: Harmonization and Indicators"
3. ๐ Full Pipeline as Jobflow and Shared Model โ
Finally, we illustrate how the entire process โ from ingestion to indicator โ can be assembled into a single reproducible pipeline using Accelerator's jobflow orchestration. The jobflow can be hosted, parameterized, and shared with collaborators or the public, ensuring repeatable science.
Objective: Deploy the pipeline as a hosted scientific service with access controls and user-configurable parameters.
โก๏ธ Detailed implementation covered in: "Use Case 3: Hosted Reproducible Jobflow"

๐ Versioning for Trust and Transparency โ
A core principle in scientific data work is reproducibility. Accelerator supports this by enabling complete versioning of both code and data using integrated technologies:
- ๐ DVC-powered routines let you manage and push/pull datasets from remote storage, keeping versions synchronized
- ๐งพ Git-integrated routines preserve the exact code version used in each job, including dependencies and configuration
- ๐ Each run is traceable, including:
- Code hash
- Dataset mapping state
- Parameters used
- Runtime logs and duration
This ensures outputs can be audited, re-used, and trusted across time, people, and institutions.
๐ Summary โ
This guide introduces the intent and scope of scientific workflow support in the Accelerator platform. Through three connected cases, we demonstrate how a modeler or researcher can:
- Operate across raw data ingestion, simulation, enrichment, and visualization
- Apply validation checkpoints at multiple stages
- Convert workflows into hosted and parameterized tools
- Enable reproducible science through built-in versioning
Each section is elaborated further in dedicated documents linked above.