-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design Review: column level lineage feature #1731
Comments
Thanks for opening this feature request. One quick comment for the first issue you mentioned - We do not necessarily need to modify the existing |
Agreed. There are still quite some areas I have not thought through. Need you guys' input for sure. |
[Continued] In a simplified
Interpreting this SQL statement, we understand
These analysis will help us understand the following
and, in similar fashion
In the next part, I will present my thought about modeling the column level lineage with Datahub [To be continued] |
1. Establish a relationship between a
|
Here are a few more notes:
|
In no particular order, some thoughts / questions.
So some alternate, specific proposals / brainstorms:
|
I made #column-level-lineage in slack (this is an open channel), if we want a more casual place to discuss. Important updates / proposals should still go here. |
Let me put more thoughts on this while I am working on the implementation in the past 48 hours.
|
To the above:
So our current focus is how should we model fields? Are they top level entities? Can we leverage the schema models we have today? |
For the point 5, I don't really know what is the most common use cases. From my background of data engineering, I think working with database DDL is more common, and most practical to be automated. |
Thanks @liangjun-jiang for your thoughts and continued follow ups on this. Internally we had been putting some thoughts on this as well. We are thinking of representing field level lineage as a separate aspect called DatasetFieldLevelUpstreamLineage called Aspect Models
One other implementation of UDF can be
As mentioned the above is just some hypothetical assumption of UDF. But, we can come up with more concrete interface for UDF. What does this enable us to model
How to convert the above metadata into graph representation
(I can expand this further on how the graph models might look like) Please let us know on what do u think of this. |
@nagarjunakanamarlapudi this is great. It definitely has more thoughts than what I have implemented. That being said, I will present my simplest implementation and reasoning in the next few comments. |
Mentioned earlier, I have proposed two new relationships:
2
Because In the other word, the validator looks at the source and target of a pair of relationship:
I did the following two hacks
|
The next step is to create new aspect for
I also created a
The
To compare the existing FineGrainUpstreamLineage --> UpStreamLineage |
The next steps to implement graph builder with
The base The final step, we register these two relationship builders with
|
This implementation supports features
In this Neo4j graph presentation,
|
Here are sample
|
So I think this issue has a lot of really great ideas in it, but it is starting to get a little large and hard to follow. Jumping right from here to a large PR isn't that easy either :) Can we maybe try the full RFC process here? i.e. a design doc? That should be easier to follow than this issue (the latest state of the RFC PR is the current proposal, no need to read a large back and forth discussion if you want to jump right in), and we can review that, and then after that is ok'd we can start code reviews. I would also strongly suggest multiple PRs; try to make them smaller. A good example is the first PR should probably be models only, no code changes. Then you can start adding code support. Let me know what you think, thanks! |
Absolutely. @jplaisted . This issue was created before RFC was adopted. Happy to convert this into a RFC for future reference. |
Assuming we can close this in favor of #1841. Feel free to reopen otherwise. |
Suppose I have a SQL in snowflake like this INSERT INTO EMPLOYEE How can I get lineage in datahub like F_NAME+L_NAME ----> NAME |
Is your feature request related to a problem? Please describe.
Yes. Column level lineage support has been requested for a few times in the past.
Describe the solution you'd like
This issue is meant to have a documentation to address how to design this feature.
Describe alternatives you've considered
n/a
Additional context
While datahub currently is supporting table-level lineage as a dataset's aspect. There is a strong need to get column-level lineage.
A sample illustration of this column-level lineage as:
If we look at the right part of this screenshot. We notice that
INSERT-SELECT-1
came from tableorders
andcustomers
oid
,cid
,ottl
,sid
columns ofINSERT-SELECT-1
were from ones oforders
tablecl
andcem
columns ofINSERT-SELECT-1
were from ones ofcustomers
table.small_orders
,medium_orders
,large_orders
andspecial_orders
are derived fromINSERT-SELECT-1
Below this
INSERT-SELECT-1
, there is another lineage representation cases following the similar fashion.Now we look at the left part of this screenshot. We notice how the SQL statement is used to generate the target table, and how the columns in the target table are derived from the source tables.
In this design review, I think we need to address two important issues:
Upstream.pdl
to support column level lineage. To make it easier to understand, the currentUpstream.pdl
look like (deleted code comment for abbreviation)sql
statement easily, and ingest MCE message so Datahub could pick them up.To be continued
The text was updated successfully, but these errors were encountered: