Using the Official Airflow Helm Chart for Production
At infarm, we use Apache Airflow heavily in our data engineering team for virtually all of our ETL tasks. For nearly two years, we were running on a single instance VM but as the business grew, our Airflow instance had to too. Thankfully, the official Helm chart was released at around the same time we began our migration to using Airflow atop Kubernetes.
The official first release of the Apache Airflow Helm chart was released this past May. There are
already resources to get you started locally on kind
and even a YouTube video by Marc Lamberti.
This post is meant to help you get from these basic tutorials to deploying Airflow on a production cluster.
It’s useful to bookmark the documentation page containing the parameters for the Helm chart and the actual values.yaml source code. These are the most useful references throughout your work deploying Airflow with Helm.
Using KubernetesExecutor
If you plan on using the KubernetesExecutor
1, you need to disable the other Airflow services in your values.yaml
:
executor: KubernetesExecutor
redis:
enabled: false
flower:
enabled: false
Using an External Database
Just like the official Airflow documentation recommends2, it’s a good idea to use an external database. You can probably get far with a persistent volume with the included postgres subchart, but you’ll run into the following issue eventually:
could not translate host name "airflow-postgresql.airflow" to address: Name or service not known
You can fix the issue above by debugging it first and likely needing to set up DNS Horizontal Autoscaling.
We use a managed Postgres database instance for our Airflow metadata database and set up pgbouncer with it:
postgresql:
enabled: false
pgbouncer:
enabled: true
maxClientConn: 48
metadataPoolSize: 10
resultBackendPoolSize: 5
Make sure to check the number of allowed maximum number of connections with your managed plan otherwise you’ll find that Postgres itself will block new connections from being made to reserve the remaining connection slots for non-replication superuser connections.
To set up the database credentials, in your values.yaml
:
data:
metadataSecretName: chart-secrets
chart-secrets
is a Kubernetes Secret containing a connection
key with the postgres connection string as the value 3. Since we use SealedSecrets
, my secret.yaml looks something like this:
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
name: chart-secrets
namespace: airflow
spec:
encryptedData:
connection: dGVzdGZvbwo=
Setting Up the Scheduler
The Airflow scheduler needs the appropriate Role and RoleBinding so that it can manage pods for tasks:
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: airflow
name: pod-manager
rules:
- apiGroups: [""]
resources: ["pods"]
verbs:
- get
- watch
- list
- create
- update
- patch
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-manager-binding
namespace: airflow
subjects:
- kind: ServiceAccount
name: airflow-scheduler
namespace: airflow
roleRef:
kind: Role
name: pod-manager
apiGroup: rbac.authorization.k8s.io
LocalFilesystemBackend
As part of the migration, I exported our Airflow connections and variables into JSON files. This also made it much easier
to have the Airflow connections and variables to be “stored” somewhere and not just on the UI. That somewhere for us was Kubernetes Secrets with
the LocalFilesystemBackend
.
Essentially, the files are encrypted as secrets and mounted onto filesystem as a volume.
scheduler:
...
extraVolumeMounts: &secretsMount
- name: secrets
mountPath: /opt/airflow/secrets
extraVolumes: &secretsVolume
- name: secrets
secret:
secretName: chart-secrets
config:
secrets:
backend: airflow.secrets.local_filesystem.LocalFilesystemBackend
backend_kwargs: |
{"variables_file_path": "/opt/airflow/secrets/variables.json", "connections_file_path": "/opt/airflow/secrets/connections.json"}
webserver: &workers
...
extraVolumeMounts: *secretsMount
extraVolumes: *secretsVolume
workers: *workers
In my Kubernetes Secret manifest, I had variables.json
and connections.json
as keys under the same Secret name. If you exec into a pod,
you can actually find the files in /opt/airflow/secrets
.
Logging
At the time of this writing, there seems to be an issue with the logging parameters4. As a workaround to get it working with Google Stackdriver, I
added the following configuration in my values.yaml
:
env:
- name: AIRFLOW__LOGGING__REMOTE_LOGGING
value: 'True'
- name: AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER
value: 'stackdriver://airflow'
You can verify that the logging is set up correctly by exec’ing into a pod and running airflow info
. The task handler should be StackdriverTaskHandler
(or whatever your remote logging solution is).