Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(metadata-io): in Neo4j service use proper algorithm to get lineage #8687

Conversation

lix-mms
Copy link
Contributor

@lix-mms lix-mms commented Aug 22, 2023

The existing shortestPath algorithm supporting the lineage queries has performance issue when the graph database has a high amount of dataset nodes (say >100K). According to Neo4j Cypher PROFILE, it looks like the shortestPath query tries to enumerate all the nodes in the graph database to compute a shortestPath from the start node, which is unnecessary since usually the reachable nodes from the start node are only a very small portion of the whole set of nodes.

Such a performance issue basically leads to that the lineage features in the UI of our DataHub do not work at all. (GraphQL queries searchAcrossLineage or getLineage always time out, but in the background the execution of the query continues until it finishes after a long time.)

After extensive investigations, we found that apoc.path.spanningTree is a better alternative to the existing algorithm. It is uses Breadth-First-Search to expand the lineage paths from the start node, efficiently returns collected paths and does not explore any non-reachable nodes.

With a fix via this PR, in our case, the query time given a node having >200 downstream lineages reduced from 90s to roughly 13s or even less, which is a substantial improvement.

In addition, I restructured how the query statements are formatted, which should lead to better readability and easier maintenance.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment labels Aug 22, 2023
@ghf0128
Copy link

ghf0128 commented Aug 22, 2023

Nice job!

@lix-mms
Copy link
Contributor Author

lix-mms commented Aug 22, 2023

APOC needs to be enabled in Neo4j for this change to work. In particular, for deployment via Helm chart, this PR is needed:
acryldata/datahub-helm#353

The dependency should be communicated via change logs.

@lix-mms
Copy link
Contributor Author

lix-mms commented Aug 24, 2023

Trying to figure out why metadata-io:test is failing. Would be great to have additional support.

* this bypasses dependency issues with apoc:all:4.4.0.9
@lix-mms
Copy link
Contributor Author

lix-mms commented Aug 25, 2023

I made a fix and metadata-io:test runs successfully now.

(Needed to use a slightly higher version of neo4j apoc as a test dependency to avoid some dependency issue on snakeyaml.)

@maggiehays maggiehays added the community-contribution PR or Issue raised by member(s) of DataHub Community label Aug 28, 2023
@RyanHolstien
Copy link
Collaborator

Can you please make a note of this breaking change in updating-datahub.md? Thanks! Otherwise LGTM :)

@lix-mms
Copy link
Contributor Author

lix-mms commented Oct 9, 2023

@RyanHolstien Thanks! I'll update it as soon as acryldata/datahub-helm#353 is merged.

Btw, should we also add a breaking change tip line for the replacement of the Neo4j chart (PR: acryldata/datahub-helm#365)? It looks like the existing chart overrides will not automatically work with the new chart structure.

@lix-mms
Copy link
Contributor Author

lix-mms commented Oct 11, 2023

@RyanHolstien I added a breaking change note to cover this PR change and the change introduced in acryldata/datahub-helm#365. Please check 😋

@lix-mms
Copy link
Contributor Author

lix-mms commented Oct 12, 2023

Not sure why the build checks start to fail after merging master and the errors do not seem to be relevant to this PR.

Any help is appreciated!

@lix-mms
Copy link
Contributor Author

lix-mms commented Nov 9, 2023

Only test cypress_rest is failing.

@lix-mms
Copy link
Contributor Author

lix-mms commented Nov 10, 2023

Tests are passing. Mind a final check for merge? 😋 @RyanHolstien

@RyanHolstien RyanHolstien merged commit 179f103 into datahub-project:master Nov 10, 2023
37 checks passed
@lix-mms lix-mms deleted the fix/metadata-io/use-proper-algorithm-to-get-lineage branch November 13, 2023 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community devops PR or Issue related to DataHub backend & deployment product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants