Allow to manage Elasticsearch nodes separately from benchmarking #697

danielmitterdorfer · 2019-05-23T11:18:47Z

Currently Rally can be used as a load generator and to manage Elasticsearch instances (building a distribution from sources, installing and configuring Elasticsearch as well as launching / stopping it). It is possible to use Rally as a standalone load generator (--pipeline=benchmark-only) but it is not possible to use Rally only to manage Elasticsearch nodes. In order to do that, we should allow to call the provisioning and launch phases as separate subcommands in Rally. We should only support a local mode though, i.e. it is not possible to only setup a cluster from the coordinator node but users must invoke Rally instead on each of the target nodes. As a corollary, each command applies to only one node.

We will add the following subcommands:

provision: Provisions a single node. The artefact might be built from sources or a distribution that has been downloaded.
start: Starts an already provisioned node.
stop: Stops an already started node.

Implementation note: None of these commands should rely on the actor system and instead run the commands in the main Rally process.

Provisioning

This covers all steps from retrieving the binary (either download or build it via Gradle) until a fully configured node is installed. Note that there are essentially no new steps, it's just that we expose this now as a dedicated subcommand. One tricky aspect is that we are currently able to create a unique node name by increasing a counter (rally-node-{N}), see also:

rally/esrally/mechanic/mechanic.py

Lines 178 to 186 in f575790

    
           def nodes_by_host(ip_port_pairs): 
        
               nodes = {} 
        
               node_id = 0 
        
               for ip_port in ip_port_pairs: 
        
                   if ip_port not in nodes: 
        
                       nodes[ip_port] = [] 
        
                   nodes[ip_port].append(node_id) 
        
                   node_id += 1 
        
               return nodes

)

This will not be possible anymore with the provision subcommand as it does not have a global view of the cluster but rather a per-node view.

Post condition

After the provisioner has (successfully) run the following conditions are met:

An Elasticsearch node is installed in a well-known directory ~/.rally/benchmarks/nodes/$ID/install where $ID is a globally unique id generated by the provisioner.
A meta-data file has been written to ~/.rally/benchmarks/nodes/$ID/provisioner.json which contains the necessary data that needs to be exchanged between the provisioner and the launcher (see NodeConfiguration for data that are exchanged between those two).
$ID is written to stdout. This id can then be used later on to start and stop this node.

Note: It might make sense that we specify all telemetry devices as part of the provisioning process regardless when they get attached (some are only attached on launch) and exchange the necessary information via the provisioner metadata.

Start

This will launch a single node. The launcher process will not be a parent process of Elasticsearch (this is a change to the previous behavior) and will instead terminate once the node has successfully started. As input it will use the provisioner id to read the provisioner metadata (NodeConfiguration) to startup the node. Depending on the launcher type we may need to persist the PID of the Elasticsearch process. For plain vanilla Elasticsearch we might just run it as a daemon. Any files should go to ~/.rally/benchmarks/races/$ID if needed.

Post condition

After the launcher has (successfully) run the following conditions are met:

An Elasticsearch process is running
Its PID is recorded in a well-known location that can be retrieved again via $ID.

Stop

This will stop a node that has previously been started. If --preserve-install=true is specified on the command line, the node's installation and data directory will be cleaned up. Logs and telemetry data should always be kept (all of this is no change to previous behavior).

Post condition

The corresponding Elasticsearch process is stopped
All files except any telemetry data and logs have been deleted (unless --preserve-install=true).

Preparatory work

As a preparation we should:

Switch from installation directories based on the race timestamp to installation directories based on unique ids (~/.rally/benchmarks/nodes/$ID) - see Change filestore to be indexed by unique ID #720
Move telemetry devices that currently rely on ES_JAVA_OPTS from the launcher to the provisioner and instead persist the necessary information in config/jvm.options. - see Change telemetry devices to rely on jvm.config instead of ES_JAVA_OPTS #711
Revisit all remaining telemetry devices and ensure they will not rely on Rally being a parent process of Elasticsearch. Specifically for the DiskIo telemetry device we should ensure that it persists the disk counters at the beginning of a benchmark and can read it again afterwards. - see Update DiskIo telemetry device to persist the counters #731
Run Elasticsearch as a daemon (we should do the same for the container launcher). This will simplify exposing the start command separately. - see Implement ES daemon-mode in process launcher #701

Tasks

Expose race id as command line parameter - see Expose race-id as command line parameter #778
Calculate system metrics per node - see Calculate system metrics per node #803
Implement start, stopand provision as subcommands - see Manage Elasticsearch nodes with dedicated subcommands #830

Follow-up work

We will continue to support the actor-system-based approach for now but we intend to remove this coordination layer at some point and let users choose freely how they want to manage coordination (e.g. via Ansible) as this also allows for more complex setups. Doing this will require additional preparation though so we will tackle this in a separate issue.

The text was updated successfully, but these errors were encountered:

Ensure that DiskIo telemetry does not rely on Rally being a parent process of Elasticsearch and persists the disk counters at the beginning of a benchmark and can read it again afterwards. Relates to elastic#697

Ensure that DiskIo telemetry does not rely on Rally being a parent process of Elasticsearch and persists the disk counters at the beginning of a benchmark and can read it again afterwards. Relates to #697

By using ES_JAVA_OPTS we can provision a node, run a benchmark, and then “dynamically” (i.e. without reprovisioning) start the node again with telemetry attached. Relates to elastic#697 Relates to elastic#711

Ensure that DiskIo telemetry does not rely on Rally being a parent process of Elasticsearch and persists the disk counters at the beginning of a benchmark and can read it again afterwards. Relates to elastic#697

By using ES_JAVA_OPTS we can provision a node, run a benchmark, and then “dynamically” (i.e. without reprovisioning) start the node again with telemetry attached. Relates to #697 Relates to #711

With this commit we add a smoke test script that allows to run a benchmark in test mode against (almost) all challenges in this track. A few challenges have been excluded intentionally because they rely on other challenges being run first. While it would be possible to make this work with workarounds we should wait for a proper solution with elastic/rally#697

With this commit we gather cluster-level metrics in the driver instead of the mechanic. As these metrics are gathered via API calls there is no need to gather them on the very same machine were an Elasticsearch node is running. Instead, it defines clearer boundaries in between these two components. Relates #697 Relates #779

By using ES_JAVA_OPTS we can provision a node, run a benchmark, and then “dynamically” (i.e. without reprovisioning) start the node again with telemetry attached. Relates to elastic#697 Relates to elastic#711

With this commit we gather cluster-level metrics in the driver instead of the mechanic. As these metrics are gathered via API calls there is no need to gather them on the very same machine were an Elasticsearch node is running. Instead, it defines clearer boundaries in between these two components. Relates elastic#697 Relates elastic#779

With this commit we add a smoke test script that allows to run a benchmark in test mode against (almost) all challenges in this track. A few challenges have been excluded intentionally because they rely on other challenges being run first. While it would be possible to make this work with workarounds we should wait for a proper solution with elastic/rally#697 Relates #47

With this commit we introduce three new subcommands to Rally: * `install`: To install a single Elasticsearch node locally * `start`: To start an Elasticsearch node that has been previously installed * `stop`: To stop a running Elasticsearch node To run a benchmark, users first issue `install`, followed by `start` on all nodes. Afterwards, the benchmark is run using the `benchmark-only` pipeline. Finally, the `stop` command is invoked on all nodes to shutdown the cluster. To ensure that system metrics are stored consistently (i.e. they contain the same metadata like race id and race timestamp), we expose the race id as a command line parameter and defer writing any system metrics until the `stop` command is invoked. We attempt to read race metadata from the Elasticsearch metrics store for that race id which have been written earlier by the benchmark and merge the metadata when we write the system metrics. The current implementation is considered a new experimental addition to the existing mechanism to manage clusters with the intention to eventually replace it. The command line interface is specific to Zen discovery and subject to change as we learn more about its use. Relates #830 Closes #697

danielmitterdorfer added this to the 1.2.0 milestone May 23, 2019

drawlerr self-assigned this May 23, 2019

danielmitterdorfer modified the milestones: 1.2.0, 1.2.1 Jun 7, 2019

drawlerr mentioned this issue Jun 8, 2019

Implement ES daemon-mode in process launcher #701

Merged

drawlerr mentioned this issue Jul 3, 2019

ES as a Daemon (again) #719

Merged

ebadyano mentioned this issue Jul 3, 2019

Change telemetry devices to rely on jvm.config instead of ES_JAVA_OPTS #711

Merged

drawlerr mentioned this issue Jul 3, 2019

Change filestore to be indexed by unique ID #720

Merged

ebadyano mentioned this issue Jul 3, 2019

Update DiskIo telemetry device to persist the counters #721

Closed

drawlerr mentioned this issue Jul 4, 2019

Provision, start, stop subcommands #722

Closed

ebadyano mentioned this issue Jul 17, 2019

Update DiskIo telemetry device to persist the counters #731

Merged

ebadyano mentioned this issue Jul 31, 2019

Allow to attach telemetry devices without reprovisioning #737

Merged

danielmitterdorfer modified the milestones: 1.3.0, 1.4.0 Sep 4, 2019

danielmitterdorfer mentioned this issue Sep 30, 2019

Gather cluster-level metrics in driver #779

Merged

danielmitterdorfer mentioned this issue Oct 1, 2019

Add smoke test script elastic/rally-eventdata-track#47

Merged

danielmitterdorfer mentioned this issue Oct 7, 2019

Add smoke tests for all challenges elastic/rally-eventdata-track#49

Closed

danielmitterdorfer unassigned drawlerr Oct 9, 2019

danielmitterdorfer mentioned this issue Oct 9, 2019

Detect node failures #527

Closed

danielmitterdorfer self-assigned this Oct 11, 2019

danielmitterdorfer mentioned this issue Oct 14, 2019

Calculate system metrics per node #792

Closed

4 tasks

danielmitterdorfer mentioned this issue Oct 24, 2019

Calculate system metrics per node #803

Merged

danielmitterdorfer mentioned this issue Nov 27, 2019

Manage Elasticsearch nodes with dedicated subcommands #830

Merged

danielmitterdorfer closed this as completed in #830 Dec 4, 2019

danielmitterdorfer mentioned this issue Dec 20, 2019

An actorless load generator #852

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to manage Elasticsearch nodes separately from benchmarking #697

Allow to manage Elasticsearch nodes separately from benchmarking #697

danielmitterdorfer commented May 23, 2019 •

edited

Loading

Allow to manage Elasticsearch nodes separately from benchmarking #697

Allow to manage Elasticsearch nodes separately from benchmarking #697

Comments

danielmitterdorfer commented May 23, 2019 • edited Loading

Provisioning

Start

Stop

Preparatory work

Tasks

Follow-up work

danielmitterdorfer commented May 23, 2019 •

edited

Loading