Create Task
Maniphest T300870
- Edit Task
- Edit Related Tasks...
- Edit Related Objects...
- Mute Notifications
- Protect as security issue
- Award Token
- Flag For Later
Assigned To
Antoine_Quhen |
Authored By
Antoine_Quhen | |
Feb 3 2022, 2:23 PM2022-02-03 14:23:46 (UTC+0) |
Tags
- Data-Engineering-Kanban (Done)
- Data-Engineering (Transform)
- Data Pipelines (Done)
Referenced Files
None
Subscribers
Aklapper |
Antoine_Quhen |
JAllemandou |
mforns |
mpopov |
Ottomata |
@JAllemandou @mforns and me are proposing this set of rules as a starting point in airflow.
At environmment level (airflow.cfg):
parallelism: This is the maximum number of tasks that can run concurrently within a single Airflow environment. For example, if this setting is set to 32 then no more than 32 tasks can run at once across all DAGs. Think of this as "maximum active tasks anywhere." If you notice that tasks are stuck queued for extended periods of time, this is a value you may want to increase. By default, this is set to 32.
We may set it explicitly to 32.
max_active_runs_per_dag: This determines the maximum number of active DAG Runs (per DAG) that the Airflow Scheduler can create at any given time. In Airflow, a DAG Run represents an instantiation of a DAG in time, much like a task instance represents an instantiation of a task. This parameter is most relevant if Airflow has to catch up from missed DAG runs, also known as backfilling. Consider how you want to handle these scenarios when setting this parameter. By default, it's set to 16.
We should set it to 2. To avoid 1 dag to take the all the resources.
max_active_tasks_per_dag: (formerly dag_concurrency) This determines the maximum number of tasks that can be scheduled at once, per DAG." Use this setting to prevent any one DAG from taking up too many of the available slots from parallelism or your pools, which helps DAGs be good neighbors to one another. By default, this is set to 16.
We should set it to 2. To make sure 1 dag can't take all resources.
At DAG level (in dag definition file):
max_active_runs_per_dag & max_active_tasks_per_dag could be overriden by:
- max_active_runs: This is the maximum number of active DAG Runs allowed for the DAG in question. Once this limit is hit, the Scheduler will not create new active DAG Runs. If this setting is not defined, the value of the environment-level setting max_active_runs_per_dag is assumed.
Coudld be set in each dag definition.
- concurrency: This is the maximum number of task instances allowed to run concurrently across all active DAG runs for a given DAG. This allows you to set 1 DAG to be able to run 32 tasks at once, while another DAG might only be able to run 16 tasks at once. If this setting is not defined, the value of the environment-level setting max_active_tasks_per_dag is assumed.
Coudld be set in each dag definition.
A clear explanation of the doc is here: https://www.astronomer.io/guides/airflow-scaling-workers
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
airflow: change max_active_runs_per_dag back to 1 | operations/puppet | production | +1 -1 | |
Set default Airflow concurrency limits for an- airflow instances | operations/puppet | production | +43 -0 | |
Set default Airflow concurrency limits | operations/puppet | production | +36 -6 |
Customize query in gerrit
- Mentions
- Mentioned In
- T351388: Add a spark global config for better file commit strategy
- Mentioned Here
- T347076: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist.
Event Timeline
Antoine_Quhen created this task.Feb 3 2022, 2:23 PM2022-02-03 14:23:46 (UTC+0)
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 3 2022, 2:23 PM2022-02-03 14:23:47 (UTC+0)
mforns added a comment.Feb 3 2022, 2:49 PM2022-02-03 14:49:38 (UTC+0)
Comment Actions
This is great @Antoine_Quhen!
I agree with your default value choices.
I think in Airflow2 the concurrency parameter is deprecated in favor of max_active_tasks, see: https://airflow.apache.org/docs/apache-airflow/stable/faq.html#how-to-improve-dag-performance
Ottomata subscribed.Feb 3 2022, 2:57 PM2022-02-03 14:57:42 (UTC+0)
Comment Actions
I wonder if we should set the global max_active_runs_per_dag higher than 2. I could see cases where we explicitly want to to a big backfill in parallel. Since the work isn't actually on the airflow nodes, the resources taken up there will mostly be waiting for dag runs to finish. We should set the default dag level max_active_runs to 2 in our base default_args, to allow folks to override this if they need. Or, wait, are you saying that max_active_runs_per_dag is overridable to a larger value already by max_active_runs? if so, then I guess we can just do as you say! :)
JAllemandou added a comment.Feb 3 2022, 2:58 PM2022-02-03 14:58:37 (UTC+0)
Comment Actions
thanks a lot for the great summary @Antoine_Quhen! I asume that we wish to set the default values you suggested, and as much as possible not use the per-dag available config overrides.
My only wonder about default values is the global parallelism couldn't be make bigger if we assume all tasks are low-ressource-consumption for the Airflow machine itself (even more if we use Skein)? Ping @Ottomata on this :)
Ottomata added a comment.Feb 3 2022, 3:09 PM2022-02-03 15:09:42 (UTC+0)
Comment Actions
Ya I'd think we could!
odimitrijevic moved this task from Incoming (new tickets) to Transform on the Data-Engineering board.Feb 6 2022, 11:21 PM2022-02-06 23:21:50 (UTC+0)
• EChetty moved this task from Backlog to Discussed (Radar) on the Data Pipelines board.Feb 7 2022, 4:48 PM2022-02-07 16:48:44 (UTC+0)
Antoine_Quhen reassigned this task from Antoine_Quhen to Ottomata.Feb 7 2022, 5:31 PM2022-02-07 17:31:35 (UTC+0)
Ottomata added a comment.Feb 7 2022, 5:35 PM2022-02-07 17:35:27 (UTC+0)
Comment Actions
Tell me precisely what to change and I will change it!
Antoine_Quhen added a comment.Feb 7 2022, 5:35 PM2022-02-07 17:35:43 (UTC+0)
Comment Actions
+ Lets set retries to 3 by default. We may remove this line:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/24931081b7133e62849a9f54bad4e0ff555690e9/wmf_airflow_common/default_args.py#L7
And add globally: with default_task_retries
Antoine_Quhen claimed this task.Feb 28 2022, 4:55 PM2022-02-28 16:55:10 (UTC+0)
Antoine_Quhen moved this task from Discussed (Radar) to Estimated on the Data Pipelines board.
gerritbot added a comment.Mar 1 2022, 5:06 PM2022-03-01 17:06:17 (UTC+0)
Comment Actions
Change 767220 had a related patch set uploaded (by Aqu; author: Aqu): [operations/puppet@production] Set default Airflow concurrency limits https://gerrit.wikimedia.org/r/767220
gerritbot added a project: Patch-For-Review.Mar 1 2022, 5:06 PM2022-03-01 17:06:18 (UTC+0)
Antoine_Quhen moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.Mar 2 2022, 2:22 PM2022-03-02 14:22:35 (UTC+0)
Antoine_Quhen moved this task from In Progress to In Code Review on the Data-Engineering-Kanban board.
Antoine_Quhen moved this task from Estimated to In Review on the Data Pipelines board.
Antoine_Quhen added a comment.Mar 2 2022, 2:24 PM2022-03-02 14:24:50 (UTC+0)
Comment Actions
Related to: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/29
gerritbot added a comment.Mar 2 2022, 2:57 PM2022-03-02 14:57:42 (UTC+0)
Comment Actions
Change 767220 merged by Ottomata: [operations/puppet@production] Set default Airflow concurrency limits https://gerrit.wikimedia.org/r/767220
gerritbot added a comment.Mar 2 2022, 3:06 PM2022-03-02 15:06:59 (UTC+0)
Comment Actions
Change 767527 had a related patch set uploaded (by Ottomata; author: Ottomata): [operations/puppet@production] Set default Airflow concurrency limits for an- airflow instances https://gerrit.wikimedia.org/r/767527
gerritbot added a comment.Mar 2 2022, 3:13 PM2022-03-02 15:13:26 (UTC+0)
Comment Actions
Change 767527 merged by Ottomata: [operations/puppet@production] Set default Airflow concurrency limits for an- airflow instances https://gerrit.wikimedia.org/r/767527
Maintenance_bot removed a project: Patch-For-Review.Mar 2 2022, 4:10 PM2022-03-02 16:10:48 (UTC+0)
• EChetty moved this task from In Review to Done on the Data Pipelines board.Mar 7 2022, 4:34 PM2022-03-07 16:34:19 (UTC+0)
Antoine_Quhen moved this task from In Code Review to Done on the Data-Engineering-Kanban board.Mar 8 2022, 4:20 PM2022-03-08 16:20:48 (UTC+0)
JArguello-WMF closed this task as Resolved.May 31 2022, 3:31 PM2022-05-31 15:31:34 (UTC+0)
mpopov subscribed.Nov 13 2023, 8:40 PM2023-11-13 20:40:42 (UTC+0)
Comment Actions
There's a good chance this is responsible for T347076: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist.
mpopov mentioned this in T351388: Add a spark global config for better file commit strategy.Nov 17 2023, 12:53 PM2023-11-17 12:53:22 (UTC+0)
gerritbot added a comment.Nov 22 2023, 12:58 PM2023-11-22 12:58:03 (UTC+0)
Comment Actions
Change 976700 had a related patch set uploaded (by Btullis; author: Btullis): [operations/puppet@production] airflow: change max_active_runs_per_dag back to 1 https://gerrit.wikimedia.org/r/976700
gerritbot added a project: Patch-For-Review.Nov 22 2023, 12:58 PM2023-11-22 12:58:04 (UTC+0)
gerritbot added a comment.Nov 22 2023, 3:37 PM2023-11-22 15:37:24 (UTC+0)
Comment Actions
Change 976700 merged by Btullis: [operations/puppet@production] airflow: change max_active_runs_per_dag back to 1 https://gerrit.wikimedia.org/r/976700
Maintenance_bot removed a project: Patch-For-Review.Nov 22 2023, 4:10 PM2023-11-22 16:10:39 (UTC+0)
Log In to Comment
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL