Speed up client-side bulk-handling #890

danielmitterdorfer · 2020-02-11T06:48:49Z

With this commit we speed up preparing bulk requests from data files by
implementing several optimizations:

Bulk data are passed as a string instead of a list to the runner. This
avoids the cost of converting the raw list to a string in the Python
Elasticsearch client.
Lines are read in bulk from the data source instead of line by line.
This avoids many method calls.
We provide a special implementation for the common case (ids are
autogenerated, no conflicts) to make the hot code path as simple as
possible.

This commit also adds a microbenchmark that measures the speedup. The
following table shows a comparison of the throughput of the bulk reader
for various bulk sizes:

Bulk Size	master [ops/s]	This PR [ops/s]	Speedup
100	14829	92395	6.23
1000	1448	10953	7.56
10000	148	1100	7.43
100000	15	107	7.13

All data have been measured using Python 3.8 on Linux.

With this commit we speed up preparing bulk requests from data files by implementing several optimizations: * Bulk data are passed as a string instead of a list to the runner. This avoids the cost of converting the raw list to a string in the Python Elasticsearch client. * Lines are read in bulk from the data source instead of line by line. This avoids many method calls. * We provide a special implementation for the common case (ids are autogenerated, no conflicts) to make the hot code path as simple as possible. This commit also adds a microbenchmark that measures the speedup. The following table shows a comparison of the throughput of the bulk reader for various bulk sizes: | Bulk Size | master [ops/s] | This PR [ops/s] | Speedup | |-----------|----------------|-----------------|---------| | 100 | 14829 | 92395 | 6.23 | | 1000 | 1448 | 10953 | 7.56 | | 10000 | 148 | 1100 | 7.43 | | 100000 | 15 | 107 | 7.13 | All data have been measured using Python 3.8 on Linux.

dliappis

This is incredible performance improvement!

LGTM left a nit and a question.

dliappis · 2020-02-12T11:14:51Z

esrally/track/params.py

        current_bulk = []
-        for action_metadata_item, document in zip(self.action_metadata, self.file_source):
+        # hoist


Firstly I had to look this up as my understanding this is mostly JS terminology whereas we in Python we talk about block scoping.

But what does this comment refer at example? Which variable, in which block, is getting hoisted?

I know that term from JVM optimizations. In any case we ensure that the field access is turned into a local variable access because it is used in the loop on the hot code path and this is what I was referring to here.

I see. For a person like me it doesn't add any additional info (seems obvious from the implementation) but if it's valuable to someone else, that's fine.

dliappis · 2020-02-12T11:30:57Z

esrally/track/params.py

-            self.meta_data_index_no_id = '{"index": {"_index": "%s"}}' % index_name
+            self.meta_data_index_with_id = '{"index": {"_index": "%s", "_id": "%s"}}\n' % (index_name, "%s")
+            self.meta_data_update_with_id = '{"update": {"_index": "%s", "_id": "%s"}}\n' % (index_name, "%s")
+            self.meta_data_index_no_id = '{"index": {"_index": "%s"}}\n' % index_name


nit: this might become something to fix when we enable C4001 in pylintrc; should we # pylint: disable=invalid-string-quote right after __init__ (I found a ref here).

How about we deal with this when we enable C4001?

That's fine.

danielmitterdorfer · 2020-02-12T12:19:20Z

Thanks for your review!

danielmitterdorfer added enhancement Improves the status quo :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc. labels Feb 11, 2020

danielmitterdorfer added this to the 1.4.1 milestone Feb 11, 2020

danielmitterdorfer requested a review from dliappis February 11, 2020 06:48

danielmitterdorfer self-assigned this Feb 11, 2020

Handle string bulk body with detailed stats

14c4ef9

dliappis added the highlight A substantial improvement that is worth mentioning separately in release notes label Feb 11, 2020

dliappis approved these changes Feb 12, 2020

View reviewed changes

danielmitterdorfer merged commit 4044259 into elastic:master Feb 12, 2020

danielmitterdorfer deleted the faster-bulks branch February 12, 2020 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up client-side bulk-handling #890

Speed up client-side bulk-handling #890

danielmitterdorfer commented Feb 11, 2020

dliappis left a comment

dliappis Feb 12, 2020

danielmitterdorfer Feb 12, 2020

dliappis Feb 12, 2020

dliappis Feb 12, 2020

danielmitterdorfer Feb 12, 2020

dliappis Feb 12, 2020

danielmitterdorfer commented Feb 12, 2020

Speed up client-side bulk-handling #890

Speed up client-side bulk-handling #890

Conversation

danielmitterdorfer commented Feb 11, 2020

dliappis left a comment

Choose a reason for hiding this comment

dliappis Feb 12, 2020

Choose a reason for hiding this comment

danielmitterdorfer Feb 12, 2020

Choose a reason for hiding this comment

dliappis Feb 12, 2020

Choose a reason for hiding this comment

dliappis Feb 12, 2020

Choose a reason for hiding this comment

danielmitterdorfer Feb 12, 2020

Choose a reason for hiding this comment

dliappis Feb 12, 2020

Choose a reason for hiding this comment

danielmitterdorfer commented Feb 12, 2020