Skip to content

πŸŽ’ Getting Started / Prerequisites ​


πŸ—‚οΈ Use Case 3: Data Versioning with DVC and Git ​

This guide demonstrates how the Accelerator platform enables versioning of output data β€” from either a **single job ** or an entire jobflow β€” using Git + DVC.

Versioning is critical to ensuring reproducibility. By versioning output data, you make it traceable, auditable, and reusable.


🎯 Objective ​

  • Version outputs from jobs and jobflows
  • Push datasets to Git + DVC repositories
  • Enable reproducibility, audit trails, and collaboration
  • Use either:
    • Your own Git + DVC + DVC remote storage
    • Future platform-managed DVC storage (coming soon)

πŸ—ΊοΈ Workflow Overview ​

ScenarioHow to Use DVC Push Routine
Single routine output versioningRun DVC Push as a standalone routine
Full jobflow output versioningAttach DVC Push as final step in a jobflow

βš™οΈ Core Mechanism ​

βœ… Accelerator provides an inbuilt routine β†’ DVC powered git push

βœ… This routine can be:

  • Attached to any routine to capture the output of each job instantiated from that routine
  • Added to the end of a jobflow as a child
  • Run standalone

βœ… It works via standard data mapping:

  • You map the outputs you want to version into the DVC Push routine's input path β†’ /code/workdir/newfiles

βœ… The routine pushes data to:

  • Git repo (metadata)
  • DVC remote (typically S3 bucket)

βœ… Users can configure:

  • Their own Git repo and DVC remote storage
  • Credentials and secrets via standard secrets schema

βœ… In the future:

  • Accelerator will offer an optional managed DVC remote storage β†’ no extra setup required

βœ… Users can also implement their own versioning adapters if needed.


πŸš€ Execution Patterns ​

A. Standalone DVC Push Routine ​

  • Run DVC Push routine selected from the Routine List Page
  • Select files or folders manually β†’ map to /code/workdir/newfiles
  • Configure Git + DVC settings via form

B. DVC Push as Last Step in Jobflow ​

  • Add DVC powered git push routine at the end of your Jobflow
  • Map output of previous step (usually /mnt/pipe/ or explicit output folder)
  • Automatically versions final jobflow output
Adding DVC push routine to jobflow
Adding DVC Push routine at the end of a jobflow
Configuring DVC push routine
Configuring DVC Push routine

πŸ§‘β€πŸ’» Wkube.py Example: Jobflow with DVC Push ​

python
from accli import WKubeTask

# Core processing routine (example)
core_task = WKubeTask(
    name="FAO Downloader",
    repo_url="[email protected]:ACT4CAP27/faodata.git",
    repo_branch="master",
    base_stack="R4_4",
    command="Rscript main.R",
    required_cores=1,
    required_ram=1024 * 1024 * 1024,
    required_storage_local=1024 * 1024 * 1024,
    required_storage_workflow=1024,
    timeout=3600,
    conf={
        "input_mappings": "selected_files:/code/inputs/",
        "output_mappings": "/code/outputs/:/mnt/pipe"
    }
)

# DVC Push routine
dvc_push = WKubeTask(
    name="DVC powered git push",
    repo_url="https://github.com/iiasa/accelerator-common-routines.git",
    repo_branch="master",
    docker_filename="git_dvc_push/Dockerfile",
    command="python main.py",
    required_cores=1,
    required_ram=1024 * 1024 * 1024,
    required_storage_local=1024 * 1024 * 1024,
    required_storage_workflow=1024,
    timeout=3600,
    conf={
        "input_mappings": "/mnt/pipe/:/code/workdir/newfiles",
        "GIT_REPO_URL_HTTP": "https://github.com/myorg/my-data-repo.git",
        "BRANCH_NAME": "main",
        "DVC_S3_ENDPOINT_URL": "https://s3.example.com",
        "DVC_S3_BUCKET": "my-dvc-bucket",
        "DVC_S3_PREFIX": "model-outputs/exp123",
        "REPO_DATA_FOLDER": "data/model-outputs",
        "COMMIT_MESSAGE": "Versioned model output for exp123"
    },
    job_secrets={
        "GIT_PAT": "<git personal access token>",
        "AWS_ACCESS_KEY_ID": "<aws access key>",
        "AWS_SECRET_ACCESS_KEY": "<aws secret key>"
    }
)

# Attach DVC Push to jobflow
core_task.add_child(dvc_push)

🚦 Current Capabilities ​

βœ… You can use:

  • Your own Git repo (any Git provider)
  • Your own DVC remote storage (for now only S3-compatible storage is supported)

βœ… You can implement your own versioning adapters if you need alternatives to DVC.


πŸš€ Roadmap ​

🚧 In the future, Accelerator will offer:

  • A platform-managed storage bucket β†’ no need to set up external S3/DVC.


βœ… Summary ​

This use case demonstrates how to:

  • Version outputs from jobs and jobflows
  • Push data to Git + DVC-based storage
  • Support reproducibility and transparency
  • Flexibly use either single-job or jobflow-based execution

By providing the DVC Push routine, Accelerator enables data versioning of intermediate and output data.