Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up client-side bulk-handling #890

Merged
merged 2 commits into from
Feb 12, 2020

Conversation

danielmitterdorfer
Copy link
Member

With this commit we speed up preparing bulk requests from data files by
implementing several optimizations:

  • Bulk data are passed as a string instead of a list to the runner. This
    avoids the cost of converting the raw list to a string in the Python
    Elasticsearch client.
  • Lines are read in bulk from the data source instead of line by line.
    This avoids many method calls.
  • We provide a special implementation for the common case (ids are
    autogenerated, no conflicts) to make the hot code path as simple as
    possible.

This commit also adds a microbenchmark that measures the speedup. The
following table shows a comparison of the throughput of the bulk reader
for various bulk sizes:

Bulk Size master [ops/s] This PR [ops/s] Speedup
100 14829 92395 6.23
1000 1448 10953 7.56
10000 148 1100 7.43
100000 15 107 7.13

All data have been measured using Python 3.8 on Linux.

With this commit we speed up preparing bulk requests from data files by
implementing several optimizations:

* Bulk data are passed as a string instead of a list to the runner. This
avoids the cost of converting the raw list to a string in the Python
Elasticsearch client.
* Lines are read in bulk from the data source instead of line by line.
This avoids many method calls.
* We provide a special implementation for the common case (ids are
autogenerated, no conflicts) to make the hot code path as simple as
possible.

This commit also adds a microbenchmark that measures the speedup. The
following table shows a comparison of the throughput of the bulk reader
for various bulk sizes:

| Bulk Size | master [ops/s] | This PR [ops/s] | Speedup |
|-----------|----------------|-----------------|---------|
| 100       | 14829          | 92395           | 6.23    |
| 1000      | 1448           | 10953           | 7.56    |
| 10000     | 148            | 1100            | 7.43    |
| 100000    | 15             | 107             | 7.13    |

All data have been measured using Python 3.8 on Linux.
@danielmitterdorfer danielmitterdorfer added enhancement Improves the status quo :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc. labels Feb 11, 2020
@danielmitterdorfer danielmitterdorfer added this to the 1.4.1 milestone Feb 11, 2020
@danielmitterdorfer danielmitterdorfer self-assigned this Feb 11, 2020
@dliappis dliappis added the highlight A substantial improvement that is worth mentioning separately in release notes label Feb 11, 2020
Copy link
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredible performance improvement!

LGTM left a nit and a question.

current_bulk = []
for action_metadata_item, document in zip(self.action_metadata, self.file_source):
# hoist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly I had to look this up as my understanding this is mostly JS terminology whereas we in Python we talk about block scoping.

But what does this comment refer at example? Which variable, in which block, is getting hoisted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that term from JVM optimizations. In any case we ensure that the field access is turned into a local variable access because it is used in the loop on the hot code path and this is what I was referring to here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. For a person like me it doesn't add any additional info (seems obvious from the implementation) but if it's valuable to someone else, that's fine.

self.meta_data_index_no_id = '{"index": {"_index": "%s"}}' % index_name
self.meta_data_index_with_id = '{"index": {"_index": "%s", "_id": "%s"}}\n' % (index_name, "%s")
self.meta_data_update_with_id = '{"update": {"_index": "%s", "_id": "%s"}}\n' % (index_name, "%s")
self.meta_data_index_no_id = '{"index": {"_index": "%s"}}\n' % index_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this might become something to fix when we enable C4001 in pylintrc; should we # pylint: disable=invalid-string-quote right after __init__ (I found a ref here).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we deal with this when we enable C4001?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine.

@danielmitterdorfer danielmitterdorfer merged commit 4044259 into elastic:master Feb 12, 2020
@danielmitterdorfer danielmitterdorfer deleted the faster-bulks branch February 12, 2020 12:19
@danielmitterdorfer
Copy link
Member Author

Thanks for your review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo highlight A substantial improvement that is worth mentioning separately in release notes :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants