2021-07-08

Using the Official Airflow Helm Chart for Production

At infarm, we use Apache Airflow heavily in our data engineering team for virtually all of our ETL tasks. For nearly two years, we were running on a single instance VM but as the business grew, our Airflow instance had to too. Thankfully, the official Helm chart was released at around the same time we began our migration to using Airflow atop Kubernetes.

The official first release of the Apache Airflow Helm chart was released this past May. There are already resources to get you started locally on kind and even a YouTube video by Marc Lamberti. This post is meant to help you get from these basic tutorials to deploying Airflow on a production cluster.

It’s useful to bookmark the documentation page containing the parameters for the Helm chart and the actual values.yaml source code. These are the most useful references throughout your work deploying Airflow with Helm.

Using `KubernetesExecutor`

If you plan on using the KubernetesExecutor¹, you need to disable the other Airflow services in your values.yaml:

executor: KubernetesExecutor

redis:
  enabled: false

flower:
  enabled: false

Using an External Database

Just like the official Airflow documentation recommends², it’s a good idea to use an external database. You can probably get far with a persistent volume with the included postgres subchart, but you’ll run into the following issue eventually:

could not translate host name "airflow-postgresql.airflow" to address: Name or service not known

You can fix the issue above by debugging it first and likely needing to set up DNS Horizontal Autoscaling.

We use a managed Postgres database instance for our Airflow metadata database and set up pgbouncer with it:

postgresql:
  enabled: false

pgbouncer:
  enabled: true
  maxClientConn: 48
  metadataPoolSize: 10
  resultBackendPoolSize: 5

Make sure to check the number of allowed maximum number of connections with your managed plan otherwise you’ll find that Postgres itself will block new connections from being made to reserve the remaining connection slots for non-replication superuser connections.

To set up the database credentials, in your values.yaml:

data:
  metadataSecretName: chart-secrets

chart-secrets is a Kubernetes Secret containing a connection key with the postgres connection string as the value ³. Since we use SealedSecrets, my secret.yaml looks something like this:

apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: chart-secrets
  namespace: airflow
spec:
  encryptedData:
    connection: dGVzdGZvbwo=

Setting Up the Scheduler

The Airflow scheduler needs the appropriate Role and RoleBinding so that it can manage pods for tasks:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: airflow
  name: pod-manager
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs:
      - get
      - watch
      - list
      - create
      - update
      - patch
      - delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-manager-binding
  namespace: airflow
subjects:
  - kind: ServiceAccount
    name: airflow-scheduler
    namespace: airflow
roleRef:
  kind: Role
  name: pod-manager
  apiGroup: rbac.authorization.k8s.io

`LocalFilesystemBackend`

As part of the migration, I exported our Airflow connections and variables into JSON files. This also made it much easier to have the Airflow connections and variables to be “stored” somewhere and not just on the UI. That somewhere for us was Kubernetes Secrets with the LocalFilesystemBackend.

Essentially, the files are encrypted as secrets and mounted onto filesystem as a volume.

scheduler:
  ...

  extraVolumeMounts: &secretsMount
    - name: secrets
      mountPath: /opt/airflow/secrets

  extraVolumes: &secretsVolume
    - name: secrets
      secret:
        secretName: chart-secrets

config:
  secrets:
    backend: airflow.secrets.local_filesystem.LocalFilesystemBackend
    backend_kwargs: |
      {"variables_file_path": "/opt/airflow/secrets/variables.json", "connections_file_path": "/opt/airflow/secrets/connections.json"}      

webserver: &workers
  ...
  extraVolumeMounts: *secretsMount
  extraVolumes: *secretsVolume

workers: *workers

In my Kubernetes Secret manifest, I had variables.json and connections.json as keys under the same Secret name. If you exec into a pod, you can actually find the files in /opt/airflow/secrets.

Logging

At the time of this writing, there seems to be an issue with the logging parameters⁴. As a workaround to get it working with Google Stackdriver, I added the following configuration in my values.yaml:

env:
  - name: AIRFLOW__LOGGING__REMOTE_LOGGING
    value: 'True'
  - name: AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER
    value: 'stackdriver://airflow'

You can verify that the logging is set up correctly by exec’ing into a pod and running airflow info. The task handler should be StackdriverTaskHandler (or whatever your remote logging solution is).

You can read more about the KubernetesExecutor architecture here. ↩︎
https://airflow.apache.org/docs/helm-chart/1.0.0/production-guide.html#database ↩︎
Here’s a great reference for how the connection string should look like. ↩︎
I’ve submitted an issue here. ↩︎