perf(ingest): Improve FileBackedDict iteration performance; minor refactoring #7689

asikowitz · 2023-03-24T20:07:49Z

Refactors __iter__
Manually implements .items() and .values()
Adds sql_query_iterator and filtered_items
Renames connection -> shared_connection
Removes unnecessary flush during close if connection is not shared
Adds context manager dunder methods

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…actoring

codecov-commenter · 2023-03-24T20:34:39Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -7.33 ⚠️

Comparison is base (301c861) 74.39% compared to head (11adbc4) 67.07%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7689      +/-   ##
==========================================
- Coverage   74.39%   67.07%   -7.33%     
==========================================
  Files         353      353              
  Lines       35386    35395       +9     
==========================================
- Hits        26327    23740    -2587     
- Misses       9059    11655    +2596

Flag	Coverage Δ
pytest-testIntegration	`?`
pytest-testIntegrationBatch1	`36.47% <26.82%> (+<0.01%)`	⬆️
pytest-testQuick	`63.58% <100.00%> (+0.02%)`	⬆️
pytest-testSlowIntegration	`32.94% <26.82%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...n/src/datahub/utilities/file_backed_collections.py	`100.00% <100.00%> (ø)`

... and 82 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

hsheth2 · 2023-03-24T21:19:11Z

metadata-ingestion/src/datahub/utilities/file_backed_collections.py

+        cursor = self._conn.execute(f"SELECT key, value FROM {self.tablename}")
+        for row in cursor:
+            if row[0] not in cache_keys:
+                yield row[0], self.deserializer(row[1])


this means that we're not tracking these objects in the active cache, which means that mutations made against items() won't get saved

hsheth2 · 2023-03-24T21:19:52Z

metadata-ingestion/src/datahub/utilities/file_backed_collections.py

+        for row in self._conn.execute(
+            f"SELECT key, value FROM {self.tablename} WHERE {cond_sql}"
+        ):
+            yield row[0], self.deserializer(row[1])


same as above - are we requiring no mutations to values during iteration?

Hmm I thought we were requiring no mutations at all. Mutations during iteration is cool but it feels dangerous to me -- it works for most cases, but it seems too easy to do something out of the ordinary and get burned, e.g.:

@dataclass class MyObject: name: str value: int parents: List[str] my_dict = FileBackedDict[MyObject]() ... important_entry = MyObject["main"] for k, obj in my_dict: if obj.name == important_entry.name: important_entry.parents.append(k)

Here, we'll catch some but not all of the important_entry mutations, once important_entry is cache busted

Or less contrived...

for k, obj in my_dict.items(): for parent in obj.parents: my_dict[parent].value += obj.value obj.parents = []

If len(obj.parents) > max_cache_size we won't catch the mutation in the last line

Ok, still don't think we should rely on mutation, but I realize iteration was soooo slow because I convert cache_keys to a list instead of a set when materializing it... so I'll just revert these changes because that's the real performance improvement

EDIT: Or so I thought? Still doing some investigating

in that case, we should rename to items_no_mutation or something and then the constraint will be more clear in the calling code

I'm just trying to prevent incorrect usage of the interface because it'll become super confusing if bugs happen

fwiw I'm ok with requiring no mutations during iteration

I'd prefer not to do the renames -- I understand wanting to prevent incorrect usage, but I think we should hopefully be able to say a blanket "you cannot store objects in a FileBackedDict and mutate them" and not require different names. Otherwise, I don't think renaming to items_no_mutation is sufficient -- it's not just when iterating through items that you can't mutate. You can't mutate during any iteration, so you shouldn't be able to do for key in my_file_backed_dict, nor can you mutate during regular key access, e.g. if key in my_file_backed_dict: .... I don't think it's practical to rename all of these usages

Ah, are you saying we won't catch changes to keys during iteration? e.g.

d = FileBackedDict() d['a'] = 'a' d['b'] = 'b' for k,v in d.items(): d['b'] = 'd' print(k,v)

And then if we iterate 'a' first, we should get the update in the next loop?

I do think the items() spec is supposed to pick up these changes, which we won't right now

hsheth2 · 2023-03-24T21:21:23Z

metadata-ingestion/src/datahub/utilities/file_backed_collections.py

@@ -307,6 +351,17 @@ def close(self) -> None:
    def __del__(self) -> None:
        self.close()

+    def __enter__(self) -> "FileBackedDict":


if you inherit from our Closeable class, you don't need to manually define enter and exit

… closeable mixin

asikowitz · 2023-03-27T20:08:37Z

Latest changes remove custom .items() in favor of:

Cache with dirty bit implementation, that doesn't write values if they haven't changed. This is a general performance upgrade, especially around iteration of values or items, where previously we would persist the entire dictionary again even if no changes were made.
- This further prevents the mutation of values. You can see this by the required change in the Counter test. In general I see this as a neutral to good thing, because we don't want to encourage the mutation of values. Ideally it doesn't work in any scenarios, but that's difficult to do since we have the cache. We discussed limiting the types of values to frozen dataclasses, but I think that's out of scope for these changes
items_snapshot method which does no manual accesses for optimized performance. Merged this with filtered_items to keep the interface clean

…actoring (#7689) - Adds dirty bit to cache, only writes data if dirty - Refactors __iter__ - Adds sql_query_iterator - Adds items_snapshot, more performant `items()` that allows for filtering - Renames connection -> shared_connection - Removes unnecessary flush during close if connection is not shared - Adds Closeable mixin

perf(ingest): Improve FileBackedDict iteration performance; minor ref…

0c64977

…actoring

asikowitz requested a review from hsheth2 March 24, 2023 20:07

vercel bot deployed to Preview March 24, 2023 20:17 View deployment

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 24, 2023

hsheth2 reviewed Mar 24, 2023

View reviewed changes

asikowitz added 2 commits March 27, 2023 15:46

add dirty bit; remove custom values and items; add item_snapshot; add…

4d85b73

… closeable mixin

add type annotation

5f14de3

vercel bot deployed to Preview March 27, 2023 19:57 View deployment

cleaning up

677e768

asikowitz requested a review from hsheth2 March 27, 2023 20:08

vercel bot deployed to Preview March 27, 2023 20:09 View deployment

asikowitz added 3 commits March 27, 2023 16:09

mark dirty check

2a4e323

merge items_snapshot and filtered_items

22de669

cleaner logic

11adbc4

vercel bot deployed to Preview March 27, 2023 20:33 View deployment

hsheth2 approved these changes Mar 27, 2023

View reviewed changes

asikowitz merged commit c7d35ff into datahub-project:master Mar 27, 2023

asikowitz deleted the file-backed-collections-improvements branch March 27, 2023 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(ingest): Improve FileBackedDict iteration performance; minor refactoring #7689

perf(ingest): Improve FileBackedDict iteration performance; minor refactoring #7689

asikowitz commented Mar 24, 2023

codecov-commenter commented Mar 24, 2023 •

edited

Loading

hsheth2 Mar 24, 2023

hsheth2 Mar 24, 2023

asikowitz Mar 24, 2023 •

edited

Loading

asikowitz Mar 24, 2023 •

edited

Loading

hsheth2 Mar 26, 2023

hsheth2 Mar 26, 2023

asikowitz Mar 27, 2023

asikowitz Mar 27, 2023

hsheth2 Mar 24, 2023

asikowitz commented Mar 27, 2023 •

edited

Loading

perf(ingest): Improve FileBackedDict iteration performance; minor refactoring #7689

perf(ingest): Improve FileBackedDict iteration performance; minor refactoring #7689

Conversation

asikowitz commented Mar 24, 2023

Checklist

codecov-commenter commented Mar 24, 2023 • edited Loading

Codecov Report

hsheth2 Mar 24, 2023

Choose a reason for hiding this comment

hsheth2 Mar 24, 2023

Choose a reason for hiding this comment

asikowitz Mar 24, 2023 • edited Loading

Choose a reason for hiding this comment

asikowitz Mar 24, 2023 • edited Loading

Choose a reason for hiding this comment

hsheth2 Mar 26, 2023

Choose a reason for hiding this comment

hsheth2 Mar 26, 2023

Choose a reason for hiding this comment

asikowitz Mar 27, 2023

Choose a reason for hiding this comment

asikowitz Mar 27, 2023

Choose a reason for hiding this comment

hsheth2 Mar 24, 2023

Choose a reason for hiding this comment

asikowitz commented Mar 27, 2023 • edited Loading

codecov-commenter commented Mar 24, 2023 •

edited

Loading

asikowitz Mar 24, 2023 •

edited

Loading

asikowitz Mar 24, 2023 •

edited

Loading

asikowitz commented Mar 27, 2023 •

edited

Loading