Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(ingest): Improve FileBackedDict iteration performance; minor refactoring #7689

Conversation

asikowitz
Copy link
Collaborator

  • Refactors __iter__
  • Manually implements .items() and .values()
  • Adds sql_query_iterator and filtered_items
  • Renames connection -> shared_connection
  • Removes unnecessary flush during close if connection is not shared
  • Adds context manager dunder methods

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@asikowitz asikowitz requested a review from hsheth2 March 24, 2023 20:07
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 24, 2023
@codecov-commenter
Copy link

codecov-commenter commented Mar 24, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -7.33 ⚠️

Comparison is base (301c861) 74.39% compared to head (11adbc4) 67.07%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7689      +/-   ##
==========================================
- Coverage   74.39%   67.07%   -7.33%     
==========================================
  Files         353      353              
  Lines       35386    35395       +9     
==========================================
- Hits        26327    23740    -2587     
- Misses       9059    11655    +2596     
Flag Coverage Δ
pytest-testIntegration ?
pytest-testIntegrationBatch1 36.47% <26.82%> (+<0.01%) ⬆️
pytest-testQuick 63.58% <100.00%> (+0.02%) ⬆️
pytest-testSlowIntegration 32.94% <26.82%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...n/src/datahub/utilities/file_backed_collections.py 100.00% <100.00%> (ø)

... and 82 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

cursor = self._conn.execute(f"SELECT key, value FROM {self.tablename}")
for row in cursor:
if row[0] not in cache_keys:
yield row[0], self.deserializer(row[1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means that we're not tracking these objects in the active cache, which means that mutations made against items() won't get saved

for row in self._conn.execute(
f"SELECT key, value FROM {self.tablename} WHERE {cond_sql}"
):
yield row[0], self.deserializer(row[1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above - are we requiring no mutations to values during iteration?

Copy link
Collaborator Author

@asikowitz asikowitz Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I thought we were requiring no mutations at all. Mutations during iteration is cool but it feels dangerous to me -- it works for most cases, but it seems too easy to do something out of the ordinary and get burned, e.g.:

@dataclass
class MyObject:
    name: str
    value: int
    parents: List[str]

my_dict = FileBackedDict[MyObject]()
...
important_entry = MyObject["main"]
for k, obj in my_dict:
    if obj.name == important_entry.name:
        important_entry.parents.append(k)

Here, we'll catch some but not all of the important_entry mutations, once important_entry is cache busted

Or less contrived...

for k, obj in my_dict.items():
    for parent in obj.parents:
        my_dict[parent].value += obj.value
    obj.parents = []

If len(obj.parents) > max_cache_size we won't catch the mutation in the last line

Copy link
Collaborator Author

@asikowitz asikowitz Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, still don't think we should rely on mutation, but I realize iteration was soooo slow because I convert cache_keys to a list instead of a set when materializing it... so I'll just revert these changes because that's the real performance improvement

EDIT: Or so I thought? Still doing some investigating

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, we should rename to items_no_mutation or something and then the constraint will be more clear in the calling code

I'm just trying to prevent incorrect usage of the interface because it'll become super confusing if bugs happen

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I'm ok with requiring no mutations during iteration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to do the renames -- I understand wanting to prevent incorrect usage, but I think we should hopefully be able to say a blanket "you cannot store objects in a FileBackedDict and mutate them" and not require different names. Otherwise, I don't think renaming to items_no_mutation is sufficient -- it's not just when iterating through items that you can't mutate. You can't mutate during any iteration, so you shouldn't be able to do for key in my_file_backed_dict, nor can you mutate during regular key access, e.g. if key in my_file_backed_dict: .... I don't think it's practical to rename all of these usages

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, are you saying we won't catch changes to keys during iteration? e.g.

d = FileBackedDict()
d['a'] = 'a'
d['b'] = 'b'
for k,v in d.items():
    d['b'] = 'd'
    print(k,v)

And then if we iterate 'a' first, we should get the update in the next loop?

I do think the items() spec is supposed to pick up these changes, which we won't right now

@@ -307,6 +351,17 @@ def close(self) -> None:
def __del__(self) -> None:
self.close()

def __enter__(self) -> "FileBackedDict":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you inherit from our Closeable class, you don't need to manually define enter and exit

@asikowitz
Copy link
Collaborator Author

asikowitz commented Mar 27, 2023

Latest changes remove custom .items() in favor of:

  • Cache with dirty bit implementation, that doesn't write values if they haven't changed. This is a general performance upgrade, especially around iteration of values or items, where previously we would persist the entire dictionary again even if no changes were made.
    • This further prevents the mutation of values. You can see this by the required change in the Counter test. In general I see this as a neutral to good thing, because we don't want to encourage the mutation of values. Ideally it doesn't work in any scenarios, but that's difficult to do since we have the cache. We discussed limiting the types of values to frozen dataclasses, but I think that's out of scope for these changes
  • items_snapshot method which does no manual accesses for optimized performance. Merged this with filtered_items to keep the interface clean

@asikowitz asikowitz merged commit c7d35ff into datahub-project:master Mar 27, 2023
@asikowitz asikowitz deleted the file-backed-collections-improvements branch March 27, 2023 21:20
yoonhyejin pushed a commit that referenced this pull request Apr 3, 2023
…actoring (#7689)

- Adds dirty bit to cache, only writes data if dirty
- Refactors __iter__
- Adds sql_query_iterator
- Adds items_snapshot, more performant `items()` that allows for filtering
- Renames connection -> shared_connection
- Removes unnecessary flush during close if connection is not shared
- Adds Closeable mixin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants