Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lineage workflow schedule support #1615

Closed
clojurians-org opened this issue Mar 28, 2020 · 7 comments
Closed

add lineage workflow schedule support #1615

clojurians-org opened this issue Mar 28, 2020 · 7 comments
Assignees
Labels
accepted An Issue that is confirmed as a bug by the DataHub Maintainers. feature-request Request for a new feature to be added

Comments

@clojurians-org
Copy link
Contributor

clojurians-org commented Mar 28, 2020

currently datahub have the dataset lineage information, we can utilize it to generate schedule information, then dispatch it to workflow engine(such as airflow, azkaban).

as i know, some tool(such dataworks) already use lineage by parsing etl job sql to help to build the relationship between job in Intelligent assistant way, and use the annotation to customize the rule generation.

we can implement it in several stage.

  1. assume the every dataset corresponding to one etl task, default to etl name generation rule, write the script to generate the workflow engine job-lineage template.
  2. extend to add [etl job name] metadata field for dataset aspect for real case.
  3. add etl job entity, and we can generate the input and outputs datasets to build the relationship connection.
  4. generate the full etl job specification to schedule for workflow engine, and use extra customize file to override the rule.
  5. optional to add the event-lake support for realtime integration.
@clojurians-org clojurians-org added the feature-request Request for a new feature to be added label Mar 28, 2020
@mars-lan
Copy link
Contributor

Thanks for the suggestions. We are planning to add jobs & flows as first class entities (see roadmap) and tie those to the lineage model. Will take these suggestion into consideration when implementing.

@mars-lan mars-lan added the accepted An Issue that is confirmed as a bug by the DataHub Maintainers. label Mar 29, 2020
@liangjun-jiang
Copy link
Contributor

Thanks. @mars-lan . Any chance can you share the design details when you consider onboarding the jobs & flows entity? and how it ties to the lineage model? The reason I am asking is that internally, we are in the process of implementing this feature. We don't mind contributing this feature. But want to know how you would design it.
Internally, we borrowed the idea from Apache Atlas's process entity definition:

Process: This type extends Asset. Conceptually, it can be used to represent any data transformation operation. For example, an ETL process that transforms a hive table with raw data to another hive table that stores some aggregate can be a specific type that extends the Process type. A Process type has two specific attributes, inputs and outputs. Both inputs and outputs are arrays of DataSet entities. Thus an instance of a Process type can use these inputs and outputs to capture how the lineage of a DataSet evolves.

@hshahoss
Copy link
Contributor

@loftyet would like to chat more on how you are using it internally. I'll ping you on slack.

@liangjun-jiang
Copy link
Contributor

@hshahoss we are in the process of implementing it. I think it's the best that we can get a design thought from @mars-lan and others before implementing this job/flow entity.

@liangjun-jiang
Copy link
Contributor

I also don't mind sharing the design review documentation here so everyone can take a look at it.

@hshahoss
Copy link
Contributor

@loftyet yeah that would be a good start. We can take a look at it.

@mars-lan
Copy link
Contributor

Let's concentrate the discussion in this issue: #1731

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted An Issue that is confirmed as a bug by the DataHub Maintainers. feature-request Request for a new feature to be added
Projects
None yet
Development

No branches or pull requests

4 participants