Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: New neo4j csv publisher to improve performance using batched params #1957

Merged
merged 10 commits into from
Aug 16, 2022

Conversation

kristenarmes
Copy link
Contributor

@kristenarmes kristenarmes commented Aug 11, 2022

Summary of Changes

  • Use unwind clause in the cypher queries when merging nodes and relations to be able to batch params and improve performance
  • Updated to use managed transactions in line with Neo4j's recommended best practices
  • Added PUBLISH_REVERSE_RELATIONSHIPS config
    • True (default): keep the existing behavior of publishing two way relations between nodes
    • False: only publish a one way relationship
  • Added PRESERVE_ADHOC_UI_DATA config
    • True (default): stops the publisher from updating a node or relationship created via the UI, e.g. a description or owner added manually by an Amundsen user. Such nodes/relationships will not have a published_tag property that is set by databuilder.
    • False: nothing changes

Tests

Basic unit tests for the publisher

Documentation

N/A

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • PR includes a summary of changes.
  • PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does

…able two or one way relations

Signed-off-by: Kristen Armes <karmes@lyft.com>
…ogic to reusable function

Signed-off-by: Kristen Armes <karmes@lyft.com>
@boring-cyborg boring-cyborg bot added the area:databuilder From databuilder folder label Aug 11, 2022
Signed-off-by: Kristen Armes <karmes@lyft.com>
…d lint

Signed-off-by: Kristen Armes <karmes@lyft.com>
Signed-off-by: Kristen Armes <karmes@lyft.com>
@kristenarmes kristenarmes marked this pull request as ready for review August 12, 2022 00:37
@kristenarmes kristenarmes requested a review from a team as a code owner August 12, 2022 00:37

# CSV HEADER
# A header with this suffix will be pass to Neo4j statement without quote
UNQUOTED_SUFFIX = ':UNQUOTED'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this i think does nothing and is no longer needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this, it is used to specify fields where the values are not strings, so like "order_pos:UNQUOTED" would be for numbers 1, 2, etc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find a reference for this, could you share. I recall looking this up a year+ ago finding it hasn't been doing anything and wasn't really a thing since maybe the earliest versions of neo4j. I tried just now again, and didn't find anything concrete.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind I found it. Clearly I was dumb back then.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I just looked more. IIUC: in the publisher, all that's happening is this suffix is being removed so it doesn't end up ingested. At some point, it was doing something, but not anymore (please correct me if I'm wrong). Instead of continuing to support this, instead can we simply remove this behavior from the loader and then, likewise, from all publishers? I did a full code search and I'm quite sure this is safe! Maybe double check though.

def get_scope(self) -> str:
return 'publisher.neo4j'

def _create_indices(self, node_file: str) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: allow for list of pre and post-publish actions. and set this as a default pre-publish action

@ozandogrultan
Copy link
Contributor

What is the minimum Neo4j version required to be able to use this new publisher? I would suggest to document it if there is such requirement

@kristenarmes
Copy link
Contributor Author

kristenarmes commented Aug 12, 2022

Hey @ozandogrultan, the Neo4j driver was upgraded in databuilder v7.0.0 (#1938) so it should work for anyone past that version. But I can make a note of it in the description too 👍

…nsaction

Signed-off-by: Kristen Armes <karmes@lyft.com>
Signed-off-by: Kristen Armes <karmes@lyft.com>
Signed-off-by: Kristen Armes <karmes@lyft.com>
Copy link

@dkunitsk dkunitsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing. Minor comments raised. Very nearly done.

Signed-off-by: Kristen Armes <karmes@lyft.com>
Copy link

@dkunitsk dkunitsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😭 🥲 😭

tx.run(stmt, parameters=params)


def get_props_body_keys(record_keys: list,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so good

Signed-off-by: Kristen Armes <karmes@lyft.com>
@feng-tao
Copy link
Member

amazing! any perf number before vs after?

@kristenarmes
Copy link
Contributor Author

@feng-tao we've seen great improvements in a few tests I've ran, with having deleted the nodes/relations in between runs:

Task # 1: 67676 statements

Before: Successfully published. Elapsed: 203 seconds

After: Successfully published. Elapsed: 8 seconds

Task # 2: 237734 statements

Before: Successfully published. Elapsed: 876 seconds

After: Successfully published. Elapsed: 29 seconds

@feng-tao
Copy link
Member

nice, it would be great if you could share the change to the slack so that others could leverage :) (I still remembered the old days to take long time to update in the etl job)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:databuilder From databuilder folder
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants