Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to manage Elasticsearch nodes separately from benchmarking #697

Closed
7 tasks done
danielmitterdorfer opened this issue May 23, 2019 · 0 comments · Fixed by #830
Closed
7 tasks done

Allow to manage Elasticsearch nodes separately from benchmarking #697

danielmitterdorfer opened this issue May 23, 2019 · 0 comments · Fixed by #830
Assignees
Labels
:Benchmark Candidate Management Anything affecting how Rally sets up Elasticsearch enhancement Improves the status quo highlight A substantial improvement that is worth mentioning separately in release notes meta A high-level issue of a larger topic which requires more fine-grained issues / PRs :Telemetry Telemetry Devices that gather additional metrics
Milestone

Comments

@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented May 23, 2019

Currently Rally can be used as a load generator and to manage Elasticsearch instances (building a distribution from sources, installing and configuring Elasticsearch as well as launching / stopping it). It is possible to use Rally as a standalone load generator (--pipeline=benchmark-only) but it is not possible to use Rally only to manage Elasticsearch nodes. In order to do that, we should allow to call the provisioning and launch phases as separate subcommands in Rally. We should only support a local mode though, i.e. it is not possible to only setup a cluster from the coordinator node but users must invoke Rally instead on each of the target nodes. As a corollary, each command applies to only one node.

We will add the following subcommands:

  • provision: Provisions a single node. The artefact might be built from sources or a distribution that has been downloaded.
  • start: Starts an already provisioned node.
  • stop: Stops an already started node.

Implementation note: None of these commands should rely on the actor system and instead run the commands in the main Rally process.

Provisioning

This covers all steps from retrieving the binary (either download or build it via Gradle) until a fully configured node is installed. Note that there are essentially no new steps, it's just that we expose this now as a dedicated subcommand. One tricky aspect is that we are currently able to create a unique node name by increasing a counter (rally-node-{N}), see also:

def nodes_by_host(ip_port_pairs):
nodes = {}
node_id = 0
for ip_port in ip_port_pairs:
if ip_port not in nodes:
nodes[ip_port] = []
nodes[ip_port].append(node_id)
node_id += 1
return nodes
)

This will not be possible anymore with the provision subcommand as it does not have a global view of the cluster but rather a per-node view.

Post condition

After the provisioner has (successfully) run the following conditions are met:

  • An Elasticsearch node is installed in a well-known directory ~/.rally/benchmarks/nodes/$ID/install where $ID is a globally unique id generated by the provisioner.
  • A meta-data file has been written to ~/.rally/benchmarks/nodes/$ID/provisioner.json which contains the necessary data that needs to be exchanged between the provisioner and the launcher (see NodeConfiguration for data that are exchanged between those two).
  • $ID is written to stdout. This id can then be used later on to start and stop this node.

Note: It might make sense that we specify all telemetry devices as part of the provisioning process regardless when they get attached (some are only attached on launch) and exchange the necessary information via the provisioner metadata.

Start

This will launch a single node. The launcher process will not be a parent process of Elasticsearch (this is a change to the previous behavior) and will instead terminate once the node has successfully started. As input it will use the provisioner id to read the provisioner metadata (NodeConfiguration) to startup the node. Depending on the launcher type we may need to persist the PID of the Elasticsearch process. For plain vanilla Elasticsearch we might just run it as a daemon. Any files should go to ~/.rally/benchmarks/races/$ID if needed.

Post condition

After the launcher has (successfully) run the following conditions are met:

  • An Elasticsearch process is running
  • Its PID is recorded in a well-known location that can be retrieved again via $ID.

Stop

This will stop a node that has previously been started. If --preserve-install=true is specified on the command line, the node's installation and data directory will be cleaned up. Logs and telemetry data should always be kept (all of this is no change to previous behavior).

Post condition

  • The corresponding Elasticsearch process is stopped
  • All files except any telemetry data and logs have been deleted (unless --preserve-install=true).

Preparatory work

As a preparation we should:

Tasks

Follow-up work

We will continue to support the actor-system-based approach for now but we intend to remove this coordination layer at some point and let users choose freely how they want to manage coordination (e.g. via Ansible) as this also allows for more complex setups. Doing this will require additional preparation though so we will tackle this in a separate issue.

@danielmitterdorfer danielmitterdorfer added enhancement Improves the status quo meta A high-level issue of a larger topic which requires more fine-grained issues / PRs :Telemetry Telemetry Devices that gather additional metrics :Benchmark Candidate Management Anything affecting how Rally sets up Elasticsearch highlight A substantial improvement that is worth mentioning separately in release notes labels May 23, 2019
@danielmitterdorfer danielmitterdorfer added this to the 1.2.0 milestone May 23, 2019
@drawlerr drawlerr self-assigned this May 23, 2019
@danielmitterdorfer danielmitterdorfer modified the milestones: 1.2.0, 1.2.1 Jun 7, 2019
ebadyano added a commit to ebadyano/rally that referenced this issue Jul 3, 2019
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
ebadyano added a commit to ebadyano/rally that referenced this issue Jul 3, 2019
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
ebadyano added a commit to ebadyano/rally that referenced this issue Jul 3, 2019
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
ebadyano added a commit to ebadyano/rally that referenced this issue Jul 17, 2019
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
ebadyano added a commit that referenced this issue Jul 25, 2019
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to #697
ebadyano added a commit to ebadyano/rally that referenced this issue Jul 31, 2019
By using ES_JAVA_OPTS we can provision a node, run a benchmark, and then
“dynamically” (i.e. without reprovisioning) start the node again with
telemetry attached.

Relates to elastic#697
Relates to elastic#711
novosibman pushed a commit to novosibman/rally that referenced this issue Aug 12, 2019
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
novosibman pushed a commit to novosibman/rally that referenced this issue Aug 12, 2019
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
novosibman pushed a commit to novosibman/rally that referenced this issue Aug 12, 2019
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
ebadyano added a commit that referenced this issue Aug 20, 2019
By using ES_JAVA_OPTS we can provision a node, run a benchmark, and then
“dynamically” (i.e. without reprovisioning) start the node again with
telemetry attached.

Relates to #697
Relates to #711
@danielmitterdorfer danielmitterdorfer modified the milestones: 1.3.0, 1.4.0 Sep 4, 2019
danielmitterdorfer added a commit to danielmitterdorfer/rally-eventdata-track that referenced this issue Oct 1, 2019
With this commit we add a smoke test script that allows to run a
benchmark in test mode against (almost) all challenges in this track. A
few challenges have been excluded intentionally because they rely on
other challenges being run first. While it would be possible to make
this work with workarounds we should wait for a proper solution with
elastic/rally#697
danielmitterdorfer added a commit that referenced this issue Oct 2, 2019
With this commit we gather cluster-level metrics in the driver instead of the
mechanic. As these metrics are gathered via API calls there is no need to gather
them on the very same machine were an Elasticsearch node is running. Instead, it
defines clearer boundaries in between these two components.

Relates #697
Relates #779
novosibman pushed a commit to novosibman/rally that referenced this issue Oct 2, 2019
By using ES_JAVA_OPTS we can provision a node, run a benchmark, and then
“dynamically” (i.e. without reprovisioning) start the node again with
telemetry attached.

Relates to elastic#697
Relates to elastic#711
novosibman pushed a commit to novosibman/rally that referenced this issue Oct 2, 2019
With this commit we gather cluster-level metrics in the driver instead of the
mechanic. As these metrics are gathered via API calls there is no need to gather
them on the very same machine were an Elasticsearch node is running. Instead, it
defines clearer boundaries in between these two components.

Relates elastic#697
Relates elastic#779
danielmitterdorfer added a commit to elastic/rally-eventdata-track that referenced this issue Oct 7, 2019
With this commit we add a smoke test script that allows to run a
benchmark in test mode against (almost) all challenges in this track. A
few challenges have been excluded intentionally because they rely on
other challenges being run first. While it would be possible to make
this work with workarounds we should wait for a proper solution with
elastic/rally#697

Relates #47
danielmitterdorfer added a commit to elastic/rally-eventdata-track that referenced this issue Oct 7, 2019
With this commit we add a smoke test script that allows to run a
benchmark in test mode against (almost) all challenges in this track. A
few challenges have been excluded intentionally because they rely on
other challenges being run first. While it would be possible to make
this work with workarounds we should wait for a proper solution with
elastic/rally#697

Relates #47
@danielmitterdorfer danielmitterdorfer self-assigned this Oct 11, 2019
danielmitterdorfer added a commit that referenced this issue Dec 4, 2019
With this commit we introduce three new subcommands to Rally:

* `install`: To install a single Elasticsearch node locally
* `start`: To start an Elasticsearch node that has been previously installed
* `stop`: To stop a running Elasticsearch node

To run a benchmark, users first issue `install`, followed by `start` on all
nodes. Afterwards, the benchmark is run using the `benchmark-only` pipeline.
Finally, the `stop` command is invoked on all nodes to shutdown the cluster.

To ensure that system metrics are stored consistently (i.e. they contain the
same metadata like race id and race timestamp), we expose the race id as a
command line parameter and defer writing any system metrics until the `stop`
command is invoked. We attempt to read race metadata from the Elasticsearch
metrics store for that race id which have been written earlier by the benchmark
and merge the metadata when we write the system metrics.

The current implementation is considered a new experimental addition to the
existing mechanism to manage clusters with the intention to eventually replace
it. The command line interface is specific to Zen discovery and subject to
change as we learn more about its use.

Relates #830
Closes #697
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Benchmark Candidate Management Anything affecting how Rally sets up Elasticsearch enhancement Improves the status quo highlight A substantial improvement that is worth mentioning separately in release notes meta A high-level issue of a larger topic which requires more fine-grained issues / PRs :Telemetry Telemetry Devices that gather additional metrics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants