Add support for search_after and point-in-time #1190

DJRickyB · 2021-02-18T20:24:03Z

This change adds support for search_after usage and point-in-time interactions. Specifically:

Defines open-point-in-time and close-point-in-time operations, for use in composite contexts
Adds support to the query/search runner for search_after pagination, using a new operation type paginated-search
Defines scroll-search operation type for usability
Moves runner parsing logic to its own module for isolation
Adds test and documentation for the above

Bonus:

Re-factors existing query runner to reduce duplicated code
Adds a benchmark file for demonstrating and experimenting with parsing things in the vein of detailed-results and the properties required for search_after support
A minor documentation fix (corrected get-async-search to delete-async-search)
Style fixes in docs with extra whitespace

Closes #1141

… runner.py

.pylintrc

DJRickyB · 2021-02-18T21:48:47Z

docs/track.rst

@@ -883,8 +883,13 @@ Properties
 * ``body`` (mandatory): The query body.
 * ``response-compression-enabled`` (optional, defaults to ``true``): Allows to disable HTTP compression of responses. As these responses are sometimes large and decompression may be a bottleneck on the client, it is possible to turn off response compression.
 * ``detailed-results`` (optional, defaults to ``false``): Records more detailed meta-data about queries. As it analyzes the corresponding response in more detail, this might incur additional overhead which can skew measurement results. This flag is ineffective for scroll queries.
-* ``pages`` (optional): Number of pages to retrieve. If this parameter is present, a scroll query will be executed. If you want to retrieve all result pages, use the value "all".
-* ``results-per-page`` (optional):  Number of documents to retrieve per page for scroll queries.
+* ``results-per-page`` (optional): Number of results to retrieve per page.  This maps to the Search API's ``size`` parameter, and can be used for paginated and non-paginated searches.  Defaults to ``10``


should this just be called size? there didn't seem to be a good reason to make this only for scroll/pagination given it maps to something consequential to the single request_body search

To me the name results-per-page provides less room for interpretation than size. We could, however, introduce size as a parameter and allow results-per-page as an alias.

can/should i deprecate results-per-page here?

esrally/driver/runner.py

esrally/track/params.py

DJRickyB · 2021-02-18T22:00:51Z

esrally/track/params.py

+        target_name = params.get("index")
+        type_name = params.get("type")
+        if not target_name:
+            target_name = params.get("data-stream", default_target)


is data-stream intentionally undocumented for search? also undocumented here, tentatively

Looking at #1092 I see:

Assuming this PR is good, i will address this functionality next and then document together ie. i propose we don't document until #1054 is addressed.

This was intentional originally but I think we should document this property now.

danielmitterdorfer

Thanks for the PR! I did a first pass and left a couple of thoughts.

danielmitterdorfer · 2021-02-22T08:26:01Z

docs/track.rst

@@ -883,8 +883,13 @@ Properties
 * ``body`` (mandatory): The query body.
 * ``response-compression-enabled`` (optional, defaults to ``true``): Allows to disable HTTP compression of responses. As these responses are sometimes large and decompression may be a bottleneck on the client, it is possible to turn off response compression.
 * ``detailed-results`` (optional, defaults to ``false``): Records more detailed meta-data about queries. As it analyzes the corresponding response in more detail, this might incur additional overhead which can skew measurement results. This flag is ineffective for scroll queries.
-* ``pages`` (optional): Number of pages to retrieve. If this parameter is present, a scroll query will be executed. If you want to retrieve all result pages, use the value "all".
-* ``results-per-page`` (optional):  Number of documents to retrieve per page for scroll queries.
+* ``results-per-page`` (optional): Number of results to retrieve per page.  This maps to the Search API's ``size`` parameter, and can be used for paginated and non-paginated searches.  Defaults to ``10``


To me the name results-per-page provides less room for interpretation than size. We could, however, introduce size as a parameter and allow results-per-page as an alias.

docs/track.rst

esrally/track/params.py

danielmitterdorfer · 2021-02-22T11:53:53Z

esrally/track/params.py

+        target_name = params.get("index")
+        type_name = params.get("type")
+        if not target_name:
+            target_name = params.get("data-stream", default_target)


Looking at #1092 I see:

Assuming this PR is good, i will address this functionality next and then document together ie. i propose we don't document until #1054 is addressed.

This was intentional originally but I think we should document this property now.

esrally/track/params.py

setup.py

tests/driver/parsing_test.py

danielmitterdorfer

Thanks for iterating. I did another pass through the code and left a couple of suggestions.

docs/track.rst

esrally/driver/runner.py

esrally/track/track.py

danielmitterdorfer

Thanks for iterating! I left some minor suggestions and corrections on the docs and some formatting. SearchAfterExtractor looks good but I think we can use one shared instance per runner.

danielmitterdorfer · 2021-03-24T13:55:13Z

docs/track.rst

+Meta-data
+"""""""""
+
+The following meta data are always returned:


Is this superfluous?

danielmitterdorfer · 2021-03-24T13:55:56Z

docs/track.rst

+
+The following meta data are always returned:
+
+* ``weight``: "weight" of an operation. Always 1 for regular queries and the number of retrieved pages for scroll queries.


This mentions regular queries but the section is about paginated queries so we can simplify the sentence.

danielmitterdorfer · 2021-03-24T13:56:00Z

docs/track.rst

+The following meta data are always returned:
+
+* ``weight``: "weight" of an operation. Always 1 for regular queries and the number of retrieved pages for scroll queries.
+* ``unit``: The unit in which to interpret ``weight``. Always "ops" for regular queries and "pages" for scroll queries.


This mentions regular queries but the section is about paginated queries so we can simplify the sentence.

danielmitterdorfer · 2021-03-24T13:56:37Z

docs/track.rst

+
+The following meta data are always returned:
+
+* ``weight``: "weight" of an operation. Always 1 for regular queries and the number of retrieved pages for scroll queries.


This mentions regular queries but the section is about scroll queries so we can simplify the sentence.

danielmitterdorfer · 2021-03-24T13:56:51Z

docs/track.rst

+The following meta data are always returned:
+
+* ``weight``: "weight" of an operation. Always 1 for regular queries and the number of retrieved pages for scroll queries.
+* ``unit``: The unit in which to interpret ``weight``. Always "ops" for regular queries and "pages" for scroll queries.


This mentions regular queries but the section is about scroll queries so we can simplify the sentence.

danielmitterdorfer · 2021-03-24T14:01:07Z

docs/track.rst

+
+**Example**
+
+In this example, a point-in-time is opened, used by a ``search_after``-based search operation, and closed::


... used by a paginated-search operation?

i was going for a "this is what we are doing to elasticsearch" description rather than a "this is what we are doing in rally" description, but I take this hesitation as sufficient evidence it should be changed (so it was)

danielmitterdorfer · 2021-03-24T14:05:37Z

esrally/driver/runner.py

@@ -799,111 +821,156 @@ async def request_body_query(self, es, params):
        # disable eager response parsing - responses might be huge thus skewing results
        es.return_raw_response()

-        r = await self._raw_search(es, doc_type, index, body, request_params, headers=headers)
+        async def _search_after_query(es, params):
+            extract = SearchAfterExtractor()


This effectively means that the regex is recompiled on every request on the hot code path. As this class is immutable I propose that we instead create one instance of SearchAfterExtractor in the runner's constructor and reuse it.

danielmitterdorfer · 2021-03-24T14:07:21Z

esrally/driver/runner.py

+                        r = await self._raw_search(es, doc_type, index, body, params, headers=headers)
+
+                        props = parse(r,
+                                              ["_scroll_id", "hits.total", "hits.total.value", "hits.total.relation",


The formatting of parameters looks funny here. Is this intentional?

danielmitterdorfer · 2021-03-24T14:58:36Z

tests/driver/runner_test.py

+            # make sure pit_id is updated afterward
+            self.assertEqual("fedcba9876543211", runner.CompositeContext.get(pit_op))
+
+        es.transport.perform_request.assert_has_calls([mock.call('GET', '/_search', params={},


Nit: Can you please check that we use double-quotes everywhere consistently?

danielmitterdorfer · 2021-03-24T15:04:17Z

tests/driver/runner_test.py

+            self.assertEqual("fedcba9876543211", runner.CompositeContext.get(pit_op))
+
+        es.transport.perform_request.assert_has_calls([mock.call('GET', '/_search', params={},
+                                                                 body={'query': {'match-all': {}},


Nit: Personally, I find the following formatting easier to read:

body={ 'query': { 'match-all': {} }, 'sort': [{ 'timestamp': 'asc', 'tie_breaker_id': 'asc' }], 'size': 2, 'pit': { 'id': '0123456789abcdef', 'keep_alive': '1m' } },

I'm fine if we keep it as is but I want to offer it as a thought.

(and i'm all out of nits)

danielmitterdorfer

Thanks for iterating. Looks good now! :)

Rick Boyd added 15 commits December 30, 2020 08:12

update elasticsearch-py

3ad8562

supports point-in-time APIs and Query with search-after pagination in…

80b4390

… runner.py

add track support for search-after and point-in-time runners

0f13c1f

Merge remote-tracking branch 'upstream/master' into point-in-time

2744c76

fix tests

8744690

checkpoint

f135519

refactor query runner

9ee6e01

revert mandatory() changes

fdea4b9

update parameter sources for point-in-time api

e8669b0

refactor changes to query runner

169e3e2

bug fixes for point-in-time API runners

d6c6c10

achieve parity to other query types for search-after detail in results

f9deeba

remove pit_id from CompositeContext when PIT is closed

00e1e7a

fix tests'

49a4449

add documentation for point-in-time and search_after querying

f5c2e49

DJRickyB added enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like highlight A substantial improvement that is worth mentioning separately in release notes labels Feb 18, 2021

DJRickyB added this to the 2.1.0 milestone Feb 18, 2021

DJRickyB requested a review from gingerwizard February 18, 2021 20:24

DJRickyB self-assigned this Feb 18, 2021

Merge remote-tracking branch 'upstream/master' into point-in-time

ab0a4f3