-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(ingest): Improve FileBackedDict iteration performance; minor refactoring #7689
perf(ingest): Improve FileBackedDict iteration performance; minor refactoring #7689
Conversation
Codecov ReportPatch coverage:
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## master #7689 +/- ##
==========================================
- Coverage 74.39% 67.07% -7.33%
==========================================
Files 353 353
Lines 35386 35395 +9
==========================================
- Hits 26327 23740 -2587
- Misses 9059 11655 +2596
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 82 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
cursor = self._conn.execute(f"SELECT key, value FROM {self.tablename}") | ||
for row in cursor: | ||
if row[0] not in cache_keys: | ||
yield row[0], self.deserializer(row[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means that we're not tracking these objects in the active cache, which means that mutations made against items() won't get saved
for row in self._conn.execute( | ||
f"SELECT key, value FROM {self.tablename} WHERE {cond_sql}" | ||
): | ||
yield row[0], self.deserializer(row[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above - are we requiring no mutations to values during iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I thought we were requiring no mutations at all. Mutations during iteration is cool but it feels dangerous to me -- it works for most cases, but it seems too easy to do something out of the ordinary and get burned, e.g.:
@dataclass
class MyObject:
name: str
value: int
parents: List[str]
my_dict = FileBackedDict[MyObject]()
...
important_entry = MyObject["main"]
for k, obj in my_dict:
if obj.name == important_entry.name:
important_entry.parents.append(k)
Here, we'll catch some but not all of the important_entry
mutations, once important_entry
is cache busted
Or less contrived...
for k, obj in my_dict.items():
for parent in obj.parents:
my_dict[parent].value += obj.value
obj.parents = []
If len(obj.parents) > max_cache_size
we won't catch the mutation in the last line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, still don't think we should rely on mutation, but I realize iteration was soooo slow because I convert cache_keys
to a list instead of a set when materializing it... so I'll just revert these changes because that's the real performance improvement
EDIT: Or so I thought? Still doing some investigating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case, we should rename to items_no_mutation
or something and then the constraint will be more clear in the calling code
I'm just trying to prevent incorrect usage of the interface because it'll become super confusing if bugs happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw I'm ok with requiring no mutations during iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to do the renames -- I understand wanting to prevent incorrect usage, but I think we should hopefully be able to say a blanket "you cannot store objects in a FileBackedDict and mutate them" and not require different names. Otherwise, I don't think renaming to items_no_mutation
is sufficient -- it's not just when iterating through items
that you can't mutate. You can't mutate during any iteration, so you shouldn't be able to do for key in my_file_backed_dict
, nor can you mutate during regular key access, e.g. if key in my_file_backed_dict: ...
. I don't think it's practical to rename all of these usages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, are you saying we won't catch changes to keys during iteration? e.g.
d = FileBackedDict()
d['a'] = 'a'
d['b'] = 'b'
for k,v in d.items():
d['b'] = 'd'
print(k,v)
And then if we iterate 'a' first, we should get the update in the next loop?
I do think the items()
spec is supposed to pick up these changes, which we won't right now
@@ -307,6 +351,17 @@ def close(self) -> None: | |||
def __del__(self) -> None: | |||
self.close() | |||
|
|||
def __enter__(self) -> "FileBackedDict": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you inherit from our Closeable
class, you don't need to manually define enter and exit
Latest changes remove custom
|
…actoring (#7689) - Adds dirty bit to cache, only writes data if dirty - Refactors __iter__ - Adds sql_query_iterator - Adds items_snapshot, more performant `items()` that allows for filtering - Renames connection -> shared_connection - Removes unnecessary flush during close if connection is not shared - Adds Closeable mixin
__iter__
.items()
and.values()
sql_query_iterator
andfiltered_items
connection
->shared_connection
close
if connection is not sharedChecklist