Running Airflow in Docker — Airflow Documentation (2024)

This quick-start guide will allow you to quickly get Airflow up and running with CeleryExecutor in Docker.

Caution

This procedure can be useful for learning and exploration. However, adapting it for use in real-world situations can be complicated. Making changes to this procedure will require specialized expertise in Docker & Docker Compose, and the Airflow community may not be able to help you.

For that reason, we recommend using Kubernetes with the Official Airflow Community Helm Chart when you are ready to run Airflow in production.

Before you begin

This procedure assumes familiarity with Docker and Docker Compose. If you haven’t worked with these tools before, you should take a moment to run through the Docker Quick Start (especially the section on Docker Compose) so you are familiar with how they work.

Follow these steps to install the necessary tools, if you have not already done so.

  1. Install Docker Community Edition (CE) on your workstation. Depending on your OS, you may need to configure Docker to use at least 4.00 GB of memory for the Airflow containers to run properly. Please refer to the Resources section in the Docker for Windows or Docker for Mac documentation for more information.

  2. Install Docker Compose v1.29.1 or newer on your workstation.

Older versions of docker-compose do not support all the features required by the Airflow docker-compose.yaml file, so double check that your version meets the minimum version requirements.

Tip

The default amount of memory available for Docker on macOS is often not enough to get Airflow up and running.If enough memory is not allocated, it might lead to the webserver continuously restarting.You should allocate at least 4GB memory for the Docker Engine (ideally 8GB).

You can check if you have enough memory by running this command:

docker run --rm "debian:bullseye-slim" bash -c 'numfmt --to iec $(echo $(($(getconf _PHYS_PAGES) * $(getconf PAGE_SIZE))))'

Fetching docker-compose.yaml

To deploy Airflow on Docker Compose, you should fetch docker-compose.yaml.

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.5.1/docker-compose.yaml'

This file contains several service definitions:

  • airflow-scheduler - The scheduler monitors all tasks and DAGs, then triggers thetask instances once their dependencies are complete.

  • airflow-webserver - The webserver is available at http://localhost:8080.

  • airflow-worker - The worker that executes the tasks given by the scheduler.

  • airflow-init - The initialization service.

  • postgres - The database.

  • redis - The redis - broker that forwards messages from scheduler to worker.

Optionally, you can enable flower by adding --profile flower option, e.g. docker compose --profile flower up, or by explicitly specifying it on the command line e.g. docker compose up flower.

  • flower - The flower app for monitoring the environment. It is available at http://localhost:5555.

All these services allow you to run Airflow with CeleryExecutor. For more information, see Architecture Overview.

Some directories in the container are mounted, which means that their contents are synchronized between your computer and the container.

  • ./dags - you can put your DAG files here.

  • ./logs - contains logs from task execution and scheduler.

  • ./plugins - you can put your custom plugins here.

This file uses the latest Airflow image (apache/airflow).If you need to install a new Python library or system library, you can build your image.

Initializing Environment

Before starting Airflow for the first time, you need to prepare your environment, i.e. create the necessaryfiles, directories and initialize the database.

Setting the right Airflow user

On Linux, the quick-start needs to know your host user id and needs to have group id set to 0.Otherwise the files created in dags, logs and plugins will be created with root user ownership.You have to make sure to configure them for the docker-compose:

mkdir -p ./dags ./logs ./pluginsecho -e "AIRFLOW_UID=$(id -u)" > .env

See Docker Compose environment variables

For other operating systems, you may get a warning that AIRFLOW_UID is not set, but you cansafely ignore it. You can also manually create an .env file in the same folder asdocker-compose.yaml with this content to get rid of the warning:

AIRFLOW_UID=50000

Initialize the database

On all operating systems, you need to run database migrations and create the first user account. To do this, run.

docker compose up airflow-init

After initialization is complete, you should see a message like this:

airflow-init_1 | Upgrades doneairflow-init_1 | Admin user airflow createdairflow-init_1 | 2.5.1start_airflow-init_1 exited with code 0

The account created has the login airflow and the password airflow.

Cleaning-up the environment

The docker-compose environment we have prepared is a “quick-start” one. It was not designed to be used in productionand it has a number of caveats - one of them being that the best way to recover from any problem is to clean itup and restart from scratch.

The best way to do this is to:

  • Run docker compose down --volumes --remove-orphans command in the directory you downloaded thedocker-compose.yaml file

  • Remove the entire directory where you downloaded the docker-compose.yaml filerm -rf '<DIRECTORY>'

  • Run through this guide from the very beginning, starting by re-downloading the docker-compose.yaml file

Running Airflow

Now you can start all services:

docker compose up

Note

docker-compose is old syntax. Please check Stackoverflow.

In a second terminal you can check the condition of the containers and make sure that no containers are in an unhealthy condition:

$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES247ebe6cf87a apache/airflow:2.5.1 "/usr/bin/dumb-init …" 3 minutes ago Up 3 minutes (healthy) 8080/tcp compose_airflow-worker_1ed9b09fc84b1 apache/airflow:2.5.1 "/usr/bin/dumb-init …" 3 minutes ago Up 3 minutes (healthy) 8080/tcp compose_airflow-scheduler_17cb1fb603a98 apache/airflow:2.5.1 "/usr/bin/dumb-init …" 3 minutes ago Up 3 minutes (healthy) 0.0.0.0:8080->8080/tcp compose_airflow-webserver_174f3bbe506eb postgres:13 "docker-entrypoint.s…" 18 minutes ago Up 17 minutes (healthy) 5432/tcp compose_postgres_10bd6576d23cb redis:latest "docker-entrypoint.s…" 10 hours ago Up 17 minutes (healthy) 0.0.0.0:6379->6379/tcp compose_redis_1

Accessing the environment

After starting Airflow, you can interact with it in 3 ways:

  • by running CLI commands.

  • via a browser using the web interface.

  • using the REST API.

Running the CLI commands

You can also run CLI commands, but you have to do it in one of the defined airflow-* services. For example, to run airflow info, run the following command:

docker compose run airflow-worker airflow info

If you have Linux or Mac OS, you can make your work easier and download a optional wrapper scripts that will allow you to run commands with a simpler command.

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.5.1/airflow.sh'chmod +x airflow.sh

Now you can run commands easier.

./airflow.sh info

You can also use bash as parameter to enter interactive bash shell in the container or python to enterpython container.

./airflow.sh bash
./airflow.sh python

Accessing the web interface

Once the cluster has started up, you can log in to the web interface and begin experimenting with DAGs.

The webserver is available at: http://localhost:8080.The default account has the login airflow and the password airflow.

Sending requests to the REST API

Basic username password authentication is currentlysupported for the REST API, which means you can use common tools to send requests to the API.

The webserver is available at: http://localhost:8080.The default account has the login airflow and the password airflow.

Here is a sample curl command, which sends a request to retrieve a pool list:

ENDPOINT_URL="http://localhost:8080/"curl -X GET \ --user "airflow:airflow" \ "${ENDPOINT_URL}/api/v1/pools"

Cleaning up

To stop and delete containers, delete volumes with database data and download images, run:

docker compose down --volumes --rmi all

Using custom images

When you want to run Airflow locally, you might want to use an extended image, containing some additional dependencies - forexample you might add new python packages, or upgrade airflow providers to a later version. This can be done very easilyby specifying build: . in your docker-compose.yaml and placing a custom Dockerfile alongside yourdocker-compose.yaml. Then you can use docker compose build commandto build your image (you need to do it only once). You can also add the --build flag to your docker compose commandsto rebuild the images on-the-fly when you run other docker compose commands.

Examples of how you can extend the image with custom providers, python packages,apt packages and more can be found in Building the image.

Networking

In general, if you want to use Airflow locally, your DAGs may try to connect to servers which are running on the host. In order to achieve that, an extra configuration must be added in docker-compose.yaml. For example, on Linux the configuration must be in the section services: airflow-worker adding extra_hosts: - "host.docker.internal:host-gateway"; and use host.docker.internal instead of localhost. This configuration vary in different platforms. Please check the Docker documentation for Windows and Mac for further information.

FAQ: Frequently asked questions

ModuleNotFoundError: No module named 'XYZ'

The Docker Compose file uses the latest Airflow image (apache/airflow). If you need to install a new Python library or system library, you can customize and extend it.

What’s Next?

From this point, you can head to the Tutorials section for further examples or the How-to Guides section if you’re ready to get your hands dirty.

Environment variables supported by Docker Compose

Do not confuse the variable names here with the build arguments set when image is built. TheAIRFLOW_UID build arg defaults to 50000 when the image is built, so it is“baked” into the image. On the other hand, the environment variables below can be set when the containeris running, using - for example - result of id -u command, which allows to use the dynamic hostruntime user id which is unknown at the time of building the image.

Variable

Description

Default

AIRFLOW_IMAGE_NAME

Airflow Image to use.

apache/airflow:2.5.1

AIRFLOW_UID

UID of the user to run Airflow containers as.Override if you want to use use non-default AirflowUID (for example when you map folders from host,it should be set to result of id -u call.When it is changed, a user with the UID iscreated with default name inside the containerand home of the use is set to /airflow/home/in order to share Python libraries installed there.This is in order to achieve the OpenShiftcompatibility. See more in theArbitrary Docker User

50000

Note

Before Airflow 2.2, the Docker Compose also had AIRFLOW_GID parameter, but it did not provide any additionalfunctionality - only added confusion - so it has been removed.

Those additional variables are useful in case you are trying out/testing Airflow installation via Docker Compose.They are not intended to be used in production, but they make the environment faster to bootstrap for first timeusers with the most common customizations.

Variable

Description

Default

_AIRFLOW_WWW_USER_USERNAME

Username for the administrator UI account.If this value is specified, admin UI user getscreated automatically. This is only useful whenyou want to run Airflow for a test-drive andwant to start a container with embedded developmentdatabase.

airflow

_AIRFLOW_WWW_USER_PASSWORD

Password for the administrator UI account.Only used when _AIRFLOW_WWW_USER_USERNAME set.

airflow

_PIP_ADDITIONAL_REQUIREMENTS

If not empty, airflow containers will attempt toinstall requirements specified in the variable.example: lxml==4.6.3 charset-normalizer==1.4.1.Available in Airflow image 2.1.1 and above.

Running Airflow in Docker — Airflow Documentation (2024)

FAQs

How to run Airflow command in Docker? ›

How to Run Airflow Locally With Docker
  1. Step 1: Fetch docker-compose. yaml. ...
  2. Step 2: Create directories. ...
  3. Step 3: Setting the Airflow user. ...
  4. Step 4: Initialise the Airflow Database. ...
  5. Step 5: Start Airflow services. ...
  6. Step 6: Access Airflow UI. ...
  7. Step 7: Enter the Airflow Worker container. ...
  8. Step 8: Cleaning up the mess.
May 13, 2022

Can I run Airflow in Docker? ›

Depending on your OS, you may need to configure Docker to use at least 4.00 GB of memory for the Airflow containers to run properly. Please refer to the Resources section in the Docker for Windows or Docker for Mac documentation for more information. Install Docker Compose v1.29.1 or newer on your workstation.

How many DAGs can Airflow run? ›

The default value is 32. max_active_tasks_per_dag (formerly dag_concurrency ): The maximum number of tasks that can be scheduled at once, per DAG. Use this setting to prevent any one DAG from taking up too many of the available slots from parallelism or your pools.

How to set up Airflow in Docker? ›

How to Install Apache Airflow with Docker
  1. Check the installation of Docker. ...
  2. Retrieve the Apache Airflow Docker Recipe. ...
  3. Create the Appropriate folder/file structure. ...
  4. Remove Sample DAGs. ...
  5. Install and Start the Docker Image. ...
  6. Access to the Web Application. ...
  7. Add and Execute a DAG. ...
  8. Access to the CLI Application.
Oct 24, 2022

Why use Docker to run Airflow? ›

Running Airflow in Docker is much easier compared to running it on Windows without Docker. It is because Docker saves up time needed for installing necessary dependencies which are required for running data pipelines.

Can you run Airflow locally? ›

Running Airflow Locally helps Developers create workflows, schedule and maintain the tasks. Running Airflow Locally allows Developers to test and create scalable applications using Python scripts.

Should you run DB in Docker? ›

In Conclusion

Docker is great for running databases in a development environment! You can even use it for databases of small, non-critical projects which run on a single server. Just make sure to have regular backups (as you should in any case), and you'll be fine.

Is it OK to run Docker in a VM? ›

In general, Docker recommends running Docker Desktop natively on either Mac, Linux, or Windows. However, Docker Desktop for Windows can run inside a virtual desktop provided the virtual desktop is properly configured.

Can we use Airflow as ETL? ›

Apache Airflow ETL is an open-source platform that creates, schedules, and monitors data workflows. It allows you to take data from different sources, transform it into meaningful information, and load it to destinations like data lakes or data warehouses.

What is the maximum active DAG runs in Airflow? ›

By default, this is set to 32. We may set it explicitly to 32. max_active_runs_per_dag: This determines the maximum number of active DAG Runs (per DAG) that the Airflow Scheduler can create at any given time.

What is the maximum number of DAGs? ›

There is no limit on the maximum number of dags in Airflow and it is a function of the resources (nodes, CPU, memory) available and then assuming there are resources available, the Airflow configuration options are just a limit setting that will be a bottleneck and have to be modified.

How many tasks can you execute in parallel in Airflow? ›

Apache Airflow's capability to run parallel tasks, ensured by using Kubernetes and CeleryExecutor, allows you to save a lot of time. You can use it to execute even 1000 parallel tasks in only 5 minutes.

How do you use Airflow to build the data pipeline? ›

Steps to Build Data Pipelines with Apache Airflow
  1. Step 1: Install the Docker Files and UI for Apache Airflow.
  2. Step 2: Create a DAG file.
  3. Step 3: Extract Lines Containing Exceptions.
  4. Step 4: Extract the Required Fields.
  5. Step 5: Query the Table to Generate Error Records.
Feb 17, 2022

How do you run an Airflow DAG? ›

In order to start a DAG Run, first turn the workflow on (arrow 1), then click the Trigger Dag button (arrow 2) and finally, click on the Graph View (arrow 3) to see the progress of the run. You can reload the graph view until both tasks reach the status Success.

How to install Airflow on EC2 using Docker? ›

Write your first DAG
  1. Docker configuration for Airflow.
  2. Docker configuration for Airflow's extended image.
  3. Docker configuration for AWS.
  4. Executing docker image to create container.
  5. DAG and Tasks creation in Airflow.
  6. Executing DAG from Airflow UI.
  7. Accessing S3 bucket / objects using AWS CLI.

Why is Docker more efficient than VM? ›

Docker containers are process-isolated and don't require a hardware hypervisor. This means Docker containers are much smaller and require far fewer resources than a VM. Docker is fast.

Why does Airflow use DAG? ›

The DAG will make sure that operators run in the correct order; other than those dependencies, operators generally run independently. In fact, they may run on two completely different machines. Airflow provides operators for many common tasks, including: BashOperator - executes a bash command.

Why Docker containers are better than VM? ›

Container Pros:

Containers are more lightweight than VMs, as their images are measured in megabytes rather than gigabytes. Containers require fewer IT resources to deploy, run, and manage. Containers spin up in milliseconds. Since their order of magnitude is smaller.

When should you not use Airflow? ›

Use cases for which Airflow is a bad option
  1. if you need to share data between tasks.
  2. if you need versioning of your data pipelines → Airflow doesn't support that.
  3. if you would like to parallelize your Python code with Dask — Prefect supports Dask Distributed out of the box.
Aug 26, 2020

Can Airflow replace Jenkins? ›

Airflow vs Jenkins: Production and Testing

Since Airflow is not a DevOps tool, it does not support non-production tasks. This means that any job you load on Airflow will be processed in real-time. However, Jenkins is more suitable for testing builds.

Does Airflow require a database? ›

Airflow requires a database. If you're just experimenting and learning Airflow, you can stick with the default SQLite option. If you don't want to use SQLite, then take a look at Set up a Database Backend to setup a different database.

Which database is best to run in Docker? ›

We can do our first comparison. We can compare database image size, initial memory usage in Docker and initial CPU usage. By these results, it looks like PostgreSQL is the winner, and SQL Server is the loser.

Should I run Docker in a VM or bare metal? ›

Running containers directly on an operating system installed on bare metal will always provide the best performance."

Should I run Docker or Kubernetes? ›

If you have few workloads running, don't mind managing your own infrastructure, or don't need a specific feature Kubernetes offers, then Docker Swarm may be a great choice. Kubernetes is more complex to set up in the beginning but offers greater flexibility and features.

Is Docker slower than a VM? ›

Docker containers are generally faster and less resource-intensive than virtual machines, but full VMware virtualization still has its unique core benefits—namely, security and isolation.

When should we not use Docker and Docker? ›

Docker is very useful for web applications running on a server or console-based software. But if your product is a standard desktop application, especially with a rich GUI, Docker may not be the best choice.

Is it OK to run Docker from inside Docker? ›

Yes, we can run docker in docker, we'll need to attach the unix socket /var/run/docker. sock on which the docker daemon listens by default as volume to the parent docker using -v /var/run/docker. sock:/var/run/docker.

Is Dagster better than Airflow? ›

Dagster is rigid and opinionated, while Airflow is flexible and accommodating. Dagster is built for the modern data stack with its dbt models and Airbyte connectors in mind, while Airflow is built to orchestrate tasks within every stack that ever was and that ever will.

Is Airflow like SSIS? ›

Besides those advantages, the most unique feature of Airflow compared with traditional ETL tools like SSIS, Talend, and Pentaho is that Airflow is purely Python code, meaning it is the most developer friendly. It is much easier to do code reviews, write unit tests, set up a CI/CD pipeline for jobs, etc..

Do data engineers use Airflow? ›

Apache Airflow is an open-source workflow authoring, scheduling, and monitoring application. It's one of the most reliable systems for orchestrating processes or pipelines that Data Engineers employ.

What is the difference between parallelism and concurrency in Airflow? ›

parallelism : This variable controls the number of task instances that the airflow worker can run simultaneously. User could increase the parallelism variable in the airflow. cfg . concurrency : The Airflow scheduler will run no more than $concurrency task instances for your DAG at any given time.

How do you trigger Airflow DAG automatically? ›

  1. In the Airflow webserver column, follow the Airflow link for your environment.
  2. Log in with the Google account that has the appropriate permissions.
  3. In the Airflow web interface, on the DAGs page, in the Links column for your DAG, click the Trigger Dag button.
  4. (Optional) Specify the DAG run configuration.
  5. Click Trigger.
Apr 5, 2021

What is the difference between backfill and catchup in Airflow? ›

Backfilling is the concept of running a DAG for a specified historical period. Unlike catchup, which triggers missed DAG runs from the DAG's start_date through the current data interval, backfill periods can be specified explicitly and can include periods prior to the DAG's start_date .

Can we run multiple DAGs in Airflow? ›

Airflow will execute the code in each file to dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks.

Can a DAG have multiple sources? ›

Directed Acyclic Graphs (DAGs)

A directed acyclic graph (or DAG) is a digraph that has no cycles. Example of a DAG: Theorem Every finite DAG has at least one source, and at least one sink. In fact, given any vertex v, there is a path from some source to v, and a path from v to some sink.

Can a DAG have a loop? ›

Since a DAG is defined by Python code, there is no need for it to be purely declarative; you are free to use loops, functions, and more to define your DAG.

How do I run a command in Docker? ›

Running Commands in an Alternate Directory in a Docker Container. To run a command in a certain directory of your container, use the --workdir flag to specify the directory: docker exec --workdir /tmp container-name pwd.

How do I run a git command in a Docker container? ›

To create a Docker image with git follow the below steps:
  1. Step 1: Create the Dockerfile.
  2. Step 2: Building the Image.
  3. Step 3: Verify whether Image build.
  4. Step 4: Run a Container associated with the Image.
Oct 29, 2020

How do I run terminal in Docker? ›

SSH into a Container
  1. Use docker ps to get the name of the existing container.
  2. Use the command docker exec -it <container name> /bin/bash to get a bash shell in the container.
  3. Generically, use docker exec -it <container name> <command> to execute whatever command you specify in the container.

How do I run code in Docker? ›

It will be up to the Docker environment to contain Python in order to execute your code.
  1. Install Docker on your machine. For Ubuntu: ...
  2. Create your project. ...
  3. Edit the Python file. ...
  4. Edit the Docker file. ...
  5. Create the Docker image. ...
  6. Run the Docker image.
Apr 2, 2019

How do I practice docker commands? ›

The following are the most used docker basic commands for beginners and experienced docker professionals.
  1. docker –version. This command is used to get the current version of the docker. ...
  2. docker pull. Pull an image or a repository from a registry. ...
  3. docker run. ...
  4. docker ps. ...
  5. docker exec. ...
  6. docker stop. ...
  7. docker restart. ...
  8. docker kill.
Feb 3, 2023

How to execute two commands in docker? ›

To execute multiple commands in the docker run command, we can use the && operator to chain the commands together. The && operator executes the first command, and if it's successful, it executes the second command.

What is CMD command in docker? ›

The main purpose of the CMD command is to launch the software required in a container. For example, the user may need to run an executable .exe file or a Bash terminal as soon as the container starts – t​he CMD command can be used to handle such requests.

What is the difference between Git and docker? ›

. Git is the leading version control system for software development. The Dockerfile, on the other hand, contains all the commands to automatically build an image of our application. These two products are the perfect combination for anyone seeking to adopt DevOps.

Where should I execute docker commands? ›

By default, the Docker command line stores its configuration files in a directory called .docker within your $HOME directory.

How to pull code from GitHub to docker? ›

Go to https://dso.docker.com/ and sign in using your Docker ID.
  1. Open the Repositories tab.
  2. Select Connect to GitHub and follow the authorization flow. ...
  3. Install the app. ...
  4. In the repository selection menu, select what repositories you want Atomist to start watching. ...
  5. Select Save selection.

How do I run a docker pipeline? ›

Running build steps inside containers
  1. Automatically grab an agent and a workspace (no extra node block is required).
  2. Pull the requested image to the Docker server (if not already cached).
  3. Start a container running that image.
  4. Mount the Jenkins workspace as a "volume" inside the container, using the same file path.

How to check active containers in docker? ›

To check the container status and run IBM Workload Automation commands, you need to access the containers as described below:
  1. Obtain the container ID by running the following command: docker ps. ...
  2. Access the Docker container by running the following command: docker exec -it <container_id> /bin/bash.

How to check processes in docker container? ›

Like it was mentioned, if you are already inside of a container, then just use ps -eaf command to see the running processes. By the way, it is recommended to have one user application / process per container.

Top Articles
Latest Posts
Article information

Author: Duncan Muller

Last Updated:

Views: 5717

Rating: 4.9 / 5 (79 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Duncan Muller

Birthday: 1997-01-13

Address: Apt. 505 914 Phillip Crossroad, O'Konborough, NV 62411

Phone: +8555305800947

Job: Construction Agent

Hobby: Shopping, Table tennis, Snowboarding, Rafting, Motor sports, Homebrewing, Taxidermy

Introduction: My name is Duncan Muller, I am a enchanting, good, gentle, modern, tasty, nice, elegant person who loves writing and wants to share my knowledge and understanding with you.