-
Apache Airflow version2.10.4 If "Other Airflow 2 version" selected, which one?Observed also on 2.10.2 What happened?One of the tests in our CI is creating a DagBag. After adding a DAG that's scheduled on a DatasetAlias, the bag cannot be created. Confusingly, this issue does not occur in other settings/environments (e.g. dag.test, local pytest, non-CI Airflow run in Docker). Relevant stack trace
What you think should happen instead?DagBag creation should not fail on a one-shot parse pass of the dags folder regardless of order. How to reproduce
from pendulum import datetime
from airflow import DAG
from airflow.datasets import DatasetAlias
with DAG(
dag_id="foo",
start_date=datetime(2000, 1, 1),
schedule=[
DatasetAlias("bar"),
],
catchup=False,
):
pass
from airflow.models import DagBag
DagBag(include_examples=False) Operating SystemRocky Linux 9.3 Versions of Apache Airflow Providers(Irrelevant) DeploymentOther Docker-based deployment Deployment detailsDockerfile-ci:
ci script: script:
- git config --global --add safe.directory $PWD
- pip install uv pre-commit-uv --upgrade
- uv pip install -e .[dev] --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.10.2/constraints-3.9.txt"
- uv tool install pre-commit --force --with pre-commit-uv --force-reinstall
- export PIP_USER=false && pre-commit install --install-hooks
- pre-commit run --all-files --show-diff-on-failure Anything else?Presumably, this issue isn't present in a "running" Airflow instance where there is at least one DAG that outputs a DatasetAlias, which causes the necessary tables to be created, and then the 2nd parse of the alias-scheduled DAG succeeds. I believe this happens because the docker image doesn't come with an initialized sqlite database, an issue that surfaces when the DagBag is built. As a workaround, I tried adding an Are you willing to submit PR?
Code of Conduct
|
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 1 reply
-
Looks like this was fixed in 2.10.3. |
Beta Was this translation helpful? Give feedback.
-
Reopening because this issue is still present on 2.10.4. The root cause seems to be the fact the sqlite db that comes with the docker image is uninitialized and can't handle a DatasetAlias schedule in this state. |
Beta Was this translation helpful? Give feedback.
-
Possible solution:Modify @@ 26 @@
- from sqlalchemy import select
+ from sqlalchemy import exc, select
@@ 142 @@
@internal_api_call
@provide_session
def expand_alias_to_datasets(
alias: str | DatasetAlias, *, session: Session = NEW_SESSION
) -> list[BaseDataset]:
"""Expand dataset alias to resolved datasets."""
from airflow.models.dataset import DatasetAliasModel
alias_name = alias.name if isinstance(alias, DatasetAlias) else alias
+ try:
dataset_alias_obj = session.scalar(
select(DatasetAliasModel).where(DatasetAliasModel.name == alias_name).limit(1)
)
+ except exc.OperationalError:
+ return []
if dataset_alias_obj:
return [Dataset(uri=dataset.uri, extra=dataset.extra) for dataset in dataset_alias_obj.datasets]
return [] |
Beta Was this translation helpful? Give feedback.
-
Interesting 🤔 I'll try to repro and see whether I can do it on my end. |
Beta Was this translation helpful? Give feedback.
-
@Lee-W Thank you for taking a look! Even if I'm wrong about the root cause, this issue needs a better error message. When it throws an sqla message about a missing "variable" or "dataset-alias" tables, when that isn't the case, it doesn't really help with debugging. Btw, I didn't mention it, but on CI we're running unit tests using a pytest pre-commit hook. If you need any more details, I'm also available on slack. |
Beta Was this translation helpful? Give feedback.
-
What is the exact error you are getting? Catching an OperationalError at runtime is generally the wrong thing to do. Even in a blank database, the database structure should still be there and not throw an OperationalError. If the official Docker image does, it may mean we are not cxreating the image correctly, and that should be fixed instead. |
Beta Was this translation helpful? Give feedback.
-
@uranusjr I included the stack trace in the issue - please see the collapsed section under "what happened?". |
Beta Was this translation helpful? Give feedback.
-
I just had another thought - maybe when running airflow and AIRFLOW_HOME cannot be determined, the new db created is broken. So when running pytest, it somehow (perhaps running from the wrong folder or missing some environment variables) misses the correct db and creates one that's empty or at least has some tables missing. |
Beta Was this translation helpful? Give feedback.
-
While I cannot help you to answer your question, I might try to guide you with looking at the reasoning. I think you need to describe what you are exactly doing and how you are initializing the databse of Airflow and which tests you are talking about. If your tests are accessing the DB, you have to make sure in your test setup that the database is created. This is what various fixtures are doing usually. Airflow DB tests in Airflow CI do this by auto-use fixture that creates and initializes the DB: https://github.com/apache/airflow/blob/main/tests_common/pytest_plugin.py#L317 when it has not been initialized (which is generally the first time it runs in a clear environment - say in a new docker container). And it creates a file ".airflow_db_initialised" in HOME DIR of airlow when it does so, so it does not attempt to do it again. This file does not survive container restart usually so the intialization happens every time breeze container is started for example. This behaviour can be overwritten with But this is how "airflow" test suite works - we have no idea what test suite you are talking about and how you run it, and what kind of assertions your containers have (which files are preserved between runs - for example being mounted, and which are not). This is all the question of how your CI and test suite is organized. Generally speaking - you have to make sure that your fixtures (if you use pytest) are doing the right thing and setting up the datebase for you. One of the difficulties you might have is that this also might depend on import sequence of things. Unfortunately airflow import does a lot of implicit things, some lazy loading of various components - because we are sort of trying to initialize everything when we import airflow, but we also try to avoid that initialization and do some magic with lazy loading to sometimes not to complete that intitialization to speed up things in some cases. This is a bit of duality we have - because we do "import airflow" pretty much with every possible command, but some of the commands, tests cases or direct imports should be way faster than completely importing and initializing everything that airflow needs to import (configuraiton, settings, database, plugins, providers and so on). I hope we will do it differently in Airlfow 3 - and do much more explicit initialization of whatever we need, rather than do that half-initialization and lazy-loading dance (which also causes occassional resursive import errors when modules are importing each other while not being fully imported - depending on sequence of imports). But that's something that we will likely discuss in coming weeks when we will be discussing Airflow 3 packaging and initialization. The consequence of that implicit loading is that we also attempt to make sure that all the models are imported before the database is reset. Most of this happens here: https://github.com/apache/airflow/blob/main/airflow/models/__init__.py And there are basically three ways how models are imported:
So in general it's not too explicit and there are few places/paths where creation of database of Airflow might go wrong. If you have any path where you are runnning db init but do not import/load the models in mamory - you might simply create an empty database, or maybe you have similar "sentinel file" or some kind of guard that prevents the database from being initialized when needed and this file is not removed in your CI ? But I am only guessing. My guess is that your tests do not do proper initialization, or maybe delete/remove the tables created at some point in time or maybe some combination of these. BTW. I am converting this into a discussion. This is not really "issue" in Airlfow - it's more troubleshooting of what your test suite does or does not. |
Beta Was this translation helpful? Give feedback.
-
Also - I have just found (another issue) that we are likely missing some imports in "import_all_models" #45244 |
Beta Was this translation helpful? Give feedback.
Also - I have just found (another issue) that we are likely missing some imports in "import_all_models" #45244