MINDIMENSIONS Post - Tips without Tears: How to Setup an Effective CI/CD pipeline for Apache Airflow

Here are some "best practices" that we use at MINDIMENSIONS:

Version Control (Git) for DAGs, Plugins, Scripts, Tests and Configuration

We store all DAGs, plugins, custom operators, tests and Airflow configurations in a Git repository following the structured layout below:

airflow-project/
  ├── dags/               # DAG files
  ├── plugins/            # Custom operators/hooks
  ├── scripts/            # Deployment/helper scripts
  ├── tests/              # Tests
  ├── requirements.txt    # Python dependencies
  ├── Dockerfile          # For containerized deployments
  └── airflow.cfg         # (Optional) Configuration overrides

DAG Deployment Isolation from Core Airflow

We decouple DAGs from the Airflow infrastructure. This is done either by using volume mounts through Kubernetes/Docker or cloud storage via (S3/GCS/VMs) to sync DAGs.

Environment Separation

We maintain separate environments:

Development: local/dev Airflow instances
Staging: mirrors production, for validation
Production: stable and monitored

Automated Linting, Formatting and Testing

We use ruff, black, isort and flake8 to lint and format Python files
We use pytest to run the unit tests and we mock Airflow dependencies
We test DAGs execution in a staging environment with sample data

Dependency Management

As any other Python project, we use requirements.txt or pyproject.toml to manage Python dependencies and we pin versions to avoid conflicts ; Airflow is very sensitive to its dependencies.

Secrets and Configuration Management

We know that you know that you should avoid hardcoding secrets in DAGs 😉
We use Airflow Connections stored in metadata DB or in external secrets backend.
And we manage Environment variables via Kubernetes Secrets, AWS Secrets Manager, or any other Provider.

Deployment Strategies

The choice depends on each project. Here are some typical options:

Git-Sync when using Kubernetes to pull DAGs from a Git repo. Here is an example:

gitSync:
    enabled: true
    repo: https://your-server/your-repo/airflow-dags.git
    branch: main

With Cloud Storage, we push DAGs to S3/GCS via CI (e.g., GitHub Actions, GitLab CI), then the other Airflow instances read the DAGs from the bucket:

aws s3 sync ./dags s3://airflow-dags-bucket/

A Complete Example of a CI/CD Pipeline using GitHub Actions

name: Airflow CI/CD
on: push
jobs:
  format:
    name: Code format & Lint
    runs-on: ubuntu-latest
    steps:
      - name: Git checkout
        uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install ruff
        uses: astral-sh/ruff-action@v3
        with:
          version: "0.11.13"
      - name: Check the code formatting
        run: |
          ruff check --output-format=github .
          git diff --exit-code
  test:
    name: Test
    needs: format
    runs-on: ubuntu-latest
    steps:
      - name: Git checkout
        uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - uses: actions/cache@v4
        with:
          path: ~/.cache/pip
          key: pip
          restore-keys: |
            pip
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run unit tests
        run: pytest
  deploy:
    name: Deploy
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Sync DAGs to S3
        run: aws s3 sync dags/ s3://airflow-dags-bucket/
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_KEY }}

Our Favorite CI/CD Pipeline that we use the Most at MINDIMENSIONS

Check the coming Part II for a detailed description.. Stay tuned 😉

Tips without Tears: How to Setup an Effective CI/CD pipeline for Apache Airflow - Part I

The process of setting up a CI/CD pipeline for Apache Airflow requires careful consideration of Airflow's unique architecture, including DAGs, plugins, dependencies, and the Airflow environment itself.