Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better integration between datasets and data intervals #45187

Open
1 of 2 tasks
casperhart opened this issue Dec 23, 2024 · 6 comments
Open
1 of 2 tasks

Better integration between datasets and data intervals #45187

casperhart opened this issue Dec 23, 2024 · 6 comments
Labels
area:datasets Issues related to the datasets feature kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet

Comments

@casperhart
Copy link

Description

Currently, one is able to trigger a DAG based on a dataset or a time schedule or a DatasetOrTimeSchedule, but it would be good if the dataset itself (or dataset event) could be associated with a schedule or logical_date. E.g. a monthly dataset, where an event is emitted by a DAG at most once for a given month, and such that the catchup argument of a downstream DAG is respected.

For example a DAG with two dataset dependencies, if dataset 1 has been produced for month1 and dataset2 gets produced for month2, the DAG will be triggered even though the two dataset events relate to separate intervals. I'd like to trigger the DAG only if the datset events were emitted for the same interval.

I'm fairly new to using datasets so apologies if my issue already has a solution or workaround.

Use case/motivation

I have a few issues with datasets that I'm having trouble solving:

  • A dataset producer DAG gets re-run, but we don't want downstream DAGs to be re-triggered for the same data interval.
  • Out-of-sync issues where a DAG is triggered based on a stale event in cases where multiple dataset triggers are defined: Dataset aware scheduling - is there a way to reset DAG? #36618
  • If a producer dag gets run with catchup=True and we don't want consumer DAGs to be backfilled, can we restrict backfill on consumer DAGs.

Technically this could be accomplished with TriggerDagRunOperator/ExternalTaskSensor, but these have other issues that datasets solve quite nicely. The benefit of decoupling DAGs using datasets is huge. However by using datasets, some of the benefits of time schedules are lost.

Related issues

#36618

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@casperhart casperhart added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Dec 23, 2024
Copy link

boring-cyborg bot commented Dec 23, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@dosubot dosubot bot added the area:datasets Issues related to the datasets feature label Dec 23, 2024
@tirkarthi
Copy link
Contributor

cc: @Lee-W @uranusjr

@potiuk
Copy link
Member

potiuk commented Dec 24, 2024

cc: @dstandish

@potiuk
Copy link
Member

potiuk commented Dec 24, 2024

This is another case where I think "data interval" is so established term that we should embrace it, not move away from it (re: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-83+amendment+to+support+classic+Airflow+authoring+style) .

cc: @casperhart -> I think it would be great if you also incorporate your points in the discussion in that AIP-83 amendment, I think it's pretty relevant, and I think it would be valuable to hear from others as well about the cases they think about.

@casperhart
Copy link
Author

After skimming through the doc I would say that I agree that logical_date can be a confusing term, especially because the meaning of the logical_date is different depending on if a DAG is scheduled or manually triggered. With a manual trigger it defaults to the trigger date, but for scheduled DAGs it's the same as data_interval_start and so is redundant. But independent of the logical_date, data intervals are incredibly useful and IMHO airflow's most powerful feature. I'll add comments to the doc a bit later.

But for this issue specifically, if datasets were better integrated with data intervals and backfill features, that would be very helpful.

@dstandish
Copy link
Contributor

There's a section in that doc that lists some of the issues with data intervals. They are, as you say, mostly redundant since they are generally derived from logical date. They are also static i.e. there's no way for the task to record what it actually did. Also they are at the dag run scope so, it's assumed all tasks are processing the same data. And there's no way e.g. to backfill a wide range, but rather we are stuck in a partition-driven paradigm where we must create runs for every interval / partition.

There's no one-size-fits-all rule that would govern how to map data intervals from a triggering dag to triggered dag with dataset triggers, so when this was being implemented I did not think we should do it. But we did. We take min and max. I suspect most of the time this is not meaningful, and probably not used.

I think you could make a stronger argument for this kind of thing if instead of listening for a dataset update (which does not have a data interval associated with it per se) if we actually could listen for a dag run event and schedule on that, then it could make more sense.

I think that you probably want to look to the the work that @uranusjr is planning to do with assets to try to implement some of the functionality you are seeking. I also am a little unclear on the use cases you are trying to explain and I think it would be helpful in aiding others understanding if you could go into more detail for each one cus it's a bit unclear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:datasets Issues related to the datasets feature kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet
Projects
None yet
Development

No branches or pull requests

4 participants