π Getting Started / Prerequisites β
- Getting Started with Accelerator
- Routine Introduction
- Routine Basics
- Routine Base Stacks
- Routine Data Mapping
- Common Inbuilt Routines
ποΈ Use Case 3: Data Versioning with DVC and Git β
This guide demonstrates how the Accelerator platform enables versioning of output data β from either a **single job ** or an entire jobflow β using Git + DVC.
Versioning is critical to ensuring reproducibility. By versioning output data, you make it traceable, auditable, and reusable.
π― Objective β
- Version outputs from jobs and jobflows
- Push datasets to Git + DVC repositories
- Enable reproducibility, audit trails, and collaboration
- Use either:
- Your own Git + DVC + DVC remote storage
- Future platform-managed DVC storage (coming soon)
πΊοΈ Workflow Overview β
| Scenario | How to Use DVC Push Routine |
|---|---|
| Single routine output versioning | Run DVC Push as a standalone routine |
| Full jobflow output versioning | Attach DVC Push as final step in a jobflow |
βοΈ Core Mechanism β
β
Accelerator provides an inbuilt routine β DVC powered git push
β This routine can be:
- Attached to any routine to capture the output of each job instantiated from that routine
- Added to the end of a jobflow as a child
- Run standalone
β It works via standard data mapping:
- You map the outputs you want to version into the DVC Push routine's input path β
/code/workdir/newfiles
β The routine pushes data to:
- Git repo (metadata)
- DVC remote (typically S3 bucket)
β Users can configure:
- Their own Git repo and DVC remote storage
- Credentials and secrets via standard secrets schema
β In the future:
- Accelerator will offer an optional managed DVC remote storage β no extra setup required
β Users can also implement their own versioning adapters if needed.
π Execution Patterns β
A. Standalone DVC Push Routine β
- Run DVC Push routine selected from the Routine List Page
- Select files or folders manually β map to
/code/workdir/newfiles - Configure Git + DVC settings via form
B. DVC Push as Last Step in Jobflow β
- Add
DVC powered git pushroutine at the end of your Jobflow - Map output of previous step (usually
/mnt/pipe/or explicit output folder) - Automatically versions final jobflow output


π§βπ» Wkube.py Example: Jobflow with DVC Push β
from accli import WKubeTask
# Core processing routine (example)
core_task = WKubeTask(
name="FAO Downloader",
repo_url="[email protected]:ACT4CAP27/faodata.git",
repo_branch="master",
base_stack="R4_4",
command="Rscript main.R",
required_cores=1,
required_ram=1024 * 1024 * 1024,
required_storage_local=1024 * 1024 * 1024,
required_storage_workflow=1024,
timeout=3600,
conf={
"input_mappings": "selected_files:/code/inputs/",
"output_mappings": "/code/outputs/:/mnt/pipe"
}
)
# DVC Push routine
dvc_push = WKubeTask(
name="DVC powered git push",
repo_url="https://github.com/iiasa/accelerator-common-routines.git",
repo_branch="master",
docker_filename="git_dvc_push/Dockerfile",
command="python main.py",
required_cores=1,
required_ram=1024 * 1024 * 1024,
required_storage_local=1024 * 1024 * 1024,
required_storage_workflow=1024,
timeout=3600,
conf={
"input_mappings": "/mnt/pipe/:/code/workdir/newfiles",
"GIT_REPO_URL_HTTP": "https://github.com/myorg/my-data-repo.git",
"BRANCH_NAME": "main",
"DVC_S3_ENDPOINT_URL": "https://s3.example.com",
"DVC_S3_BUCKET": "my-dvc-bucket",
"DVC_S3_PREFIX": "model-outputs/exp123",
"REPO_DATA_FOLDER": "data/model-outputs",
"COMMIT_MESSAGE": "Versioned model output for exp123"
},
job_secrets={
"GIT_PAT": "<git personal access token>",
"AWS_ACCESS_KEY_ID": "<aws access key>",
"AWS_SECRET_ACCESS_KEY": "<aws secret key>"
}
)
# Attach DVC Push to jobflow
core_task.add_child(dvc_push)π¦ Current Capabilities β
β You can use:
- Your own Git repo (any Git provider)
- Your own DVC remote storage (for now only S3-compatible storage is supported)
β You can implement your own versioning adapters if you need alternatives to DVC.
π Roadmap β
π§ In the future, Accelerator will offer:
- A platform-managed storage bucket β no need to set up external S3/DVC.
π Related Topics β
β Summary β
This use case demonstrates how to:
- Version outputs from jobs and jobflows
- Push data to Git + DVC-based storage
- Support reproducibility and transparency
- Flexibly use either single-job or jobflow-based execution
By providing the DVC Push routine, Accelerator enables data versioning of intermediate and output data.