How To Run Airflow on Windows (with Docker) (2024)

Josh Holbrook

Posted on • Updated on

How To Run Airflow on Windows (with Docker) (2) How To Run Airflow on Windows (with Docker) (3) How To Run Airflow on Windows (with Docker) (4) How To Run Airflow on Windows (with Docker) (5) How To Run Airflow on Windows (with Docker) (6)

#dataengineering #etl #airflow

A problem I've noticed a lot of aspiring data engineers running into recently is trying to run Airflow on Windows. This is harder than it sounds.

For many (most?) Python codebases, running on Windows is reasonable enough. For data, Anaconda even makes it easy - create an environment, install your library and go. Unfortunately, Airbnb handed us a pathologically non-portable codebase. I was flabbergasted to find that casually trying to run Airflow on Windows resulted in a bad shim script, a really chintzy pathing bug, a symlinking issue* and an attempt to use the Unix-only passwords database.

How To Run Airflow on Windows (with Docker) (7)

So running Airflow in Windows natively is dead in the water, unless you want to spend a bunch of months rewriting a bunch of the logic and arguing with the maintainers**. Luckily, there are two fairly sensible alternate approaches to consider which will let you run Airflow on a Windows machine: WSL and Docker.

WSL stands for the "Windows Subsystem for Linux", and it's actually really cool. Basically, steps look something like this:

How To Run Airflow on Windows (with Docker) (8)

I have WSL 2 installed, which is faster and better in many ways aside but which (until recently? unclear) needs an insider build of Windows.

Given that this is a fully operational Ubuntu environment, any tutorial that you follow for Ubuntu should also work in this environment.

The alternative, and the one I'm going to demo in this post, is to use Docker.

Docker is a tool for managing Linux containers, which are a little like virtual machines without the virtualization, making them act like self-contained machines but much more lightweight than a full VM. Surprisingly it works on Windows - casually, even.

Brief sidebar: Docker isn't a silver bullet, and honestly it's kind of a pain in the butt. I personally find it tough to debug and its aggressive caching makes both cache busting and resource clearing difficult. Even so, the alternatives - such as Vagrant - are generally worse. Docker is also a pseudo-standard and Kubernetes - the heinously confusing thing your DevOps team makes you deploy to - works with Docker images, so it's overall a useful tool to reach for especially for problems like this one.

Docker containers can be ran in two ways: either in a bespoke capacity via the command line, or using a tool called Docker Compose that takes a yaml file which specifies which containers to run and how, and then does what's needed. For a single container the command line is often the thing you want - and we use it later on - but for a collection of services that need to talk to each other, Docker Compose is what we need.

So to get started, create a directory somewhere - mine's in ~\software\jfhbrook\airflow-docker-windows but yours can be anywhere - and create a docker-compose.yml file that looks like this:

version: '3.8'services: metadb: image: postgres environment: POSTGRES_USER: airflow POSTGRES_PASSWORD: airflow POSTGRES_DB: airflow networks: - airflow restart: unless-stopped volumes: - ./data:/var/lib/postgresql/data scheduler: image: apache/airflow command: scheduler depends_on: - metadb networks: - airflow restart: unless-stopped volumes: - ./airflow:/opt/airflow webserver: image: apache/airflow command: webserver depends_on: - metadb networks: - airflow ports: - 8080:8080 restart: unless-stopped volumes: - ./airflow:/opt/airflownetworks: airflow:

There's a lot going on here. I'll try to go over the highlights, but I recommend referring to the file format reference docs.

First of all, we create three services: a metadb, a scheduler and a webserver. Architecturally, Airflow stores its state in a database (the metadb), the scheduler process connects to that database to figure out what to run when, and the webserver process puts a web UI in front of the whole thing. Individual jobs can connect to other databases, such as RedShift, to do actual ETL.

Docker containers are created based on Docker images, which hold the starting state for a container. We use two images here: apache/airflow, the official Airflow image, and postgres, the official PostgreSQL image.

Airflow also reads configuration, DAG files and so on, out of a directory specified by an environment variable called AIRFLOW_HOME. The default if installed on your MacBook is ~/airflow, but in the Docker image it's set to /opt/airflow.

We use Docker's volumes functionality to mount the directory ./airflow under /opt/airflow. We'll revisit the contents of this directory before trying to start the cluster.

The metadb implementation is pluggable and supports most SQL databases via SQLAlchemy. Airflow uses SQLite by default, but in practice most people either use MySQL or PostgreSQL. I'm partial to the latter, so I chose to set it up here.

On the PostgreSQL side: you need to configure it to have a user and database that Airflow can connect to. The Docker image supports this via environment variables. There are many variables that are supported, but the ones I used are POSTGRES_USER, POSTGRES_PASSWORD and POSTGRES_DB. By setting all of these to airflow, I ensured that there was a superuser named airflow, with a password of airflow and a default database of airflow.

Note that you'll definitely want to think about this harder before you go to production. Database security is out of scope of this post, but you'll probably want to create a regular user for Airflow, set up secrets management with your deploy system, and possibly change the authentication backend. Your DevOps team, if you have one, can probably help you here.

PostgreSQL stores all of its data in a volume as well. The location in the container is at /var/lib/postgresql/data, and I put it in ./data on my machine.

Docker has containers connect over virtual networks. Practically speaking, this means that you have to make sure that any containers that need to talk to each other are all connected to the same network (named "airflow" in this example), and that any containers that you need to talk to from outside have their ports explicitly exposed. You'll definitely want to expose port 8080 of the webserver to your host so that you can visit the UI in your browser. You may want to expose PostgreSQL as well, though I haven't done that here.

Finally, by default Docker Compose won't bother to restart a container if it crashes. This may be desired behavior, but in my case I wanted them to restart unless I told them to stop, and so set it to unless-stopped.

As mentioned, a number of directories need to exist and be populated in order for Airflow to do something useful.

First, let's create the data directory, so that PostgreSQL has somewhere to put its data:

mkdir ./data

Next, let's create the airflow directory, which will contain the files inside Airflow's AIRFLOW_HOME:

mkdir ./airflow

When Airflow starts it looks for a file called airflow.cfg inside of the AIRFLOW_HOME directory, which is ini-formatted and which is used to configure Airflow. This file supports a number of options, but the only one we need for now is core.sql_alchemy_conn. This field contains a SQLAlchemy connection string for connecting to PostgreSQL.

Crack open ./airflow/airflow.cfg in your favorite text editor and make it look like this:

[core]sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@metadb:5432/airflow

Some highlights:

  • The protocol is "postgresql+psycopg2", which tells SQLAlchemy to use the psycopg2 library when making the connection
  • The username is airflow, the password is airflow, the port is 5432 and the database is airflow.
  • The hostname is metadb. This is unintuitive and tripped me up - what's important here is that when Docker Compose sets up all of the networking stuff, it sets the hostnames for the containers to be the same as the name of the container as typed into the docker-compose.yml file. This service was called "metadb", so the hostname is likewise "metadb".

Once you have those pieces together, you can let 'er rip:

docker-compose up

However, you'll notice that the Airflow services start crash-looping immediately, complaining that various tables don't exist. (If it complains that the db isn't up, shrug, ctrl-c and try again. Computers amirite?)

This is because we need to initialize the metadb to have all of the tables that Airflow expects. Airflow ships with a CLI command that will do this - unfortunately, our compose file doesn't handle it.

Keep the Airflow containers crash-looping in the background; we can use the Docker CLI to connect to the PostgreSQL instance running in our compose setup and ninja in a fix.

Create a file called ./Invoke-Airflow.ps1 with the following contents:

$Network = "{0}_airflow" -f @(Split-Path $PSScriptRoot -Leaf)docker run --rm --network $Network --volume "${PSScriptRoot}\airflow:/opt/airflow" apache/airflow @Args

The --rm flag removes the container after it's done running so it doesn't cutter things up. The --network flag tells docker to connect to the virtual network you created in your docker-compose.yml file, and the --volume flag tells Docker how to mount your AIRFLOW_HOME. Finally, @Args uses a feature of PowerShell called splatting to pass arguments to your script through to Airflow.

Once that's saved, we can run initdb against our Airflow install:

.\Invoke-Airflow.ps1 initdb

You should notice that Airflow is suddenly a lot happier. You should also be able to connect to Airflow by visiting localhost:8080 in your browser:

How To Run Airflow on Windows (with Docker) (9)

For bonus points, we can use the postgres container to connect to the database using the psql CLI using a very similar trick. Put this in Invoke-Psql.ps1:

$Network = "{0}_airflow" -f @(Split-Path $PSScriptRoot -Leaf)docker run -it --rm --network $Network postgres psql -h metadb -U airflow --db airflow @Args

and then run .\Invoke-Psql in the terminal.

Now you should be able to run \dt at the psql prompt and see all of the tables that airflow initdb created:

psql (12.3 (Debian 12.3-1.pgdg100+1))Type "help" for help.airflow=# \dt List of relations Schema | Name | Type | Owner--------+-------------------------------+-------+--------- public | alembic_version | table | airflow public | chart | table | airflow public | connection | table | airflow public | dag | table | airflow public | dag_code | table | airflow public | dag_pickle | table | airflow public | dag_run | table | airflow public | dag_tag | table | airflow public | import_error | table | airflow public | job | table | airflow public | known_event | table | airflow public | known_event_type | table | airflow public | kube_resource_version | table | airflow public | kube_worker_uuid | table | airflow public | log | table | airflow public | rendered_task_instance_fields | table | airflow public | serialized_dag | table | airflow public | sla_miss | table | airflow public | slot_pool | table | airflow public | task_fail | table | airflow public | task_instance | table | airflow public | task_reschedule | table | airflow public | users | table | airflow public | variable | table | airflow public | xcom | table | airflow(25 rows)

Now we have a working Airflow install that we can mess with. You'll notice that I didn't really go into how to write a DAG - there are other tutorials for that which should now be follow-able - whenever they say to run the airflow CLI tool, run Invoke-Airflow.ps1 instead.

Using Docker, Docker Compose and a few wrapper PowerShell scripts, we were able to get Airflow running on Windows, a platform that's otherwise unsupported. In addition, we were able to build tooling to run multiple services in a nice, self-contained way, including a PostgreSQL database. Finally, by using a little PowerShell, we were able to make using these tools easy.

Cheers!

* Symbolic links in Windows are a very long story. Windows traditionally has had no support for them at all - however, recent versions of NTFS technically allow symlinks but require Administrator privileges to create them, and none of the tooling works with them.

** I'm not saying that the Airflow maintainers would be hostile towards Windows support - I don't know them for one, but also I have to assume they would be stoked. However, I also have to assume that they would have opinions. Big changes require a lot of discussion.

How To Run Airflow on Windows (with Docker) (2024)

FAQs

How to run Airflow on Windows using Docker? ›

How to Run Airflow Locally With Docker
  1. Step 1: Fetch docker-compose. yaml. ...
  2. Step 2: Create directories. ...
  3. Step 3: Setting the Airflow user. ...
  4. Step 4: Initialise the Airflow Database. ...
  5. Step 5: Start Airflow services. ...
  6. Step 6: Access Airflow UI. ...
  7. Step 7: Enter the Airflow Worker container. ...
  8. Step 8: Cleaning up the mess.
May 13, 2022

Can I run Airflow in Docker? ›

Depending on your OS, you may need to configure Docker to use at least 4.00 GB of memory for the Airflow containers to run properly. Please refer to the Resources section in the Docker for Windows or Docker for Mac documentation for more information. Install Docker Compose v1.29.1 or newer on your workstation.

How do I run Airflow on Windows? ›

How to Install Apache Airflow on Windows without Docker
  1. Step 1: Set Up the Virtual Environment. To work with Airflow on Windows, you need to set up a virtual environment. ...
  2. Step 2: Set Up the Airflow Directory. ...
  3. Step 3: Install Apache Airflow. ...
  4. Step 4: Create an Airflow User. ...
  5. Step 5: Run the Webserver.
Feb 2, 2023

Does Docker work well with Windows? ›

Docker image containers can run natively on Linux and Windows. However, Windows images can run only on Windows hosts and Linux images can run on Linux hosts and Windows hosts (using a Hyper-V Linux VM, so far), where host means a server or a VM. Developers can use development environments on Windows, Linux, or macOS.

How do I run a Windows Docker container in Windows? ›

Select the image you want to run, and click Run. On the Run menu, set up the configuration for the container, such as the container name, the isolation type, which ports to publish, and memory and CPU allocation. Additionally, you can append Docker run commands that are not in the UI, such as -v for persistent volume.

Can you run Airflow locally? ›

Running Airflow Locally helps Developers create workflows, schedule and maintain the tasks. Running Airflow Locally allows Developers to test and create scalable applications using Python scripts.

Can Docker act as a VM? ›

Docker isn't a virtual machine - it is a configuration management tool. let's not forget that Docker for Mac and Docker for Windows do use the virtualization layer.

Can you run a VM in Docker? ›

In general, Docker recommends running Docker Desktop natively on either Mac, Linux, or Windows. However, Docker Desktop for Windows can run inside a virtual desktop provided the virtual desktop is properly configured.

Should you run DB in Docker? ›

In Conclusion

Docker is great for running databases in a development environment! You can even use it for databases of small, non-critical projects which run on a single server. Just make sure to have regular backups (as you should in any case), and you'll be fine.

What are the limitations of running Docker on Windows? ›

Here are examples of things Docker can't do or can't do well:
  • Run applications as fast as a bare-metal server. Docker containers have less overhead than virtual machines. ...
  • Provide cross-platform compatibility. ...
  • Run applications with graphical interfaces. ...
  • Solve all your security problems.
Sep 25, 2017

Why not use Docker on Windows? ›

You pay a price--in the form of resource overhead-to isolate applications inside containers. In other words, you have to devote a certain amount of system resources to running Docker. That leaves fewer resources available to your applications, and can degrade performance.

Is Docker slower on Windows? ›

One of the more common problems for Developers that use Windows is that the projects with Docker configuration work really slowly, to a point when sometimes a single browser request needs to wait 30-60 seconds to be completed.

Do you need two Windows open for Airflow? ›

By opening windows to let that fresh air in, you can improve your indoor air quality. Opening two windows on opposite sides of a room provides a cross breeze, letting the bad air out and the good air in.

Is Airflow supported on Windows? ›

Airflow currently can be run on POSIX-compliant Operating Systems. For development it is regularly tested on fairly modern Linux Distros and recent versions of MacOS. On Windows you can run it via WSL2 (Windows Subsystem for Linux 2) or via Linux Containers.

How do I turn on Airflow in Windows 10? ›

How to install and run Airflow locally with Windows subsystem for Linux (WSL) with these steps:
  1. Open Microsoft Store, search for Ubuntu , install it then restart.
  2. Open cmd and type wsl.
  3. Update everything: sudo apt update && sudo apt upgrade.
  4. Install pip3 like this.

How to run Docker on Windows without Docker Desktop? ›

Install Docker in WSL 2 without Docker Desktop
  1. Step 1: Uninstall Docker Desktop. Since we're installing Docker directly inside of WSL 2 you won't need Docker Desktop installed to make this work. ...
  2. Step 2: Install Docker / Docker Compose v2 in WSL 2. ...
  3. Step 3: Ensure the Docker Service Runs in WSL 2.
Nov 22, 2022

When should you not use Airflow? ›

Use cases for which Airflow is a bad option
  1. if you need to share data between tasks.
  2. if you need versioning of your data pipelines → Airflow doesn't support that.
  3. if you would like to parallelize your Python code with Dask — Prefect supports Dask Distributed out of the box.
Aug 26, 2020

How do you test Airflow dags locally? ›

You can run the . test() method on all tasks in an individual DAG by executing python <path-to-dag-file> from the command line within your Airflow environment. You can run this command locally if you are running a standalone Airflow instance, or within the scheduler container if you are running Airflow in Docker.

Is Airflow good for ETL? ›

Airflow ETL is one such popular framework that helps in workflow management. It has excellent scheduling capabilities and graph-based execution flow makes it a great alternative for running ETL jobs.

How to install Airflow in Docker? ›

How to Install Apache Airflow with Docker
  1. Check the installation of Docker. ...
  2. Retrieve the Apache Airflow Docker Recipe. ...
  3. Create the Appropriate folder/file structure. ...
  4. Remove Sample DAGs. ...
  5. Install and Start the Docker Image. ...
  6. Access to the Web Application. ...
  7. Add and Execute a DAG. ...
  8. Access to the CLI Application.
Oct 24, 2022

Does Airflow need Kubernetes? ›

KubernetesExecutor runs as a process in the Airflow Scheduler. The scheduler itself does not necessarily need to be running on Kubernetes, but does need access to a Kubernetes cluster. KubernetesExecutor requires a non-sqlite database in the backend.

Can I run Kubernetes inside Docker? ›

Yes, it is possible to run Kubernetes directly under Docker! To deploy Kubernetes locally, the solutions until now were to install Docker Desktop or minikube. We will see here how to deploy Kubernetes in Docker using a great tool called kind.

Is Airflow Python only? ›

Airflow is Python-based but you can execute a program irrespective of the language. For instance, the first stage of your workflow has to execute a C++ based program to perform image analysis and then a Python-based program to transfer that information to S3. Possibilities are endless.

Top Articles
Latest Posts
Article information

Author: Delena Feil

Last Updated:

Views: 5729

Rating: 4.4 / 5 (45 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.