Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add container metric fields (from ECS) #282

Merged
merged 15 commits into from
Mar 27, 2024

Conversation

ChrsMark
Copy link
Member

@ChrsMark ChrsMark commented Aug 24, 2023

This PR adds container related metrics fields as part of #72.

Continues #87 on top of the project restructuring.

Based on the comparison between ECS and Otel collector the result is the following (see #282 (comment)).

Summary of added fields

  1. container.cpu.time (ok for the Collector implementation: Add container metric fields (from ECS) #282 (comment))
  2. container.memory.usage (more accurate and aligned with system metrics)
  3. container.disk.io which has as dimension the already existent disk.io.direction (Otel collector only provides the Linux specific metric and hence a generic one capable to cover Windows as well would be better)
  4. container.network.io which has as dimension the already existent network.io.direction (better alignment with disko.io and use of dimension).

All of those also refer to container.id as attribute as well.

@ChrsMark ChrsMark requested review from a team August 24, 2023 09:10
@ChrsMark ChrsMark force-pushed the container_metrics branch 2 times, most recently from b7d690a to 451f3b2 Compare August 24, 2023 12:31
Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a prototype/implementation of these metrics somewhere (e.g. in collector contrib?)

model/metrics/container.yaml Outdated Show resolved Hide resolved
@ChrsMark
Copy link
Member Author

ChrsMark commented Oct 4, 2023

Is there a prototype/implementation of these metrics somewhere (e.g. in collector contrib?)

The Collector implementation can be found at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/dockerstatsreceiver/documentation.md but the naming is more or less different. The respective Beats implementation that honors the ECS fields can be found as an example at https://github.com/elastic/beats/blob/89bcc33a9a90cbacceaa61fceec0e8d23073574c/metricbeat/module/docker/cpu/data.go#L65.

I will come with a comparison matrix soon to help us understand what are our alternatives/options here.

@ChrsMark
Copy link
Member Author

So I have checked what ECS provides in combination with the Otel Collector's implementation. Here is summary:

ECS Otel collector Notes
container.cpu.usage container.cpu.utilization In otelcollector container.cpu.usage is a namespace
container.memory.usage container.memory.percent container.memory.usage is a namespace
container.disk.read/write.bytes container.blockio.io_service_bytes_recursive Otel collector only retrieve the Linux specific metric.
container.network.ingress.bytes container.network.io.usage.rx_bytes
container.network.egress.bytes container.network.io.usage.tx_bytes

So based on the above I propose the following naming:

  1. container.cpu.utilization (does not break the Collector)
  2. container.memory.utilization (more accurate and aligned with system metrics and cpu.utilazation)
  3. container.disk.io.bytes and container.disk.io.direction (Otel collector only provides the Linux specific metric and hence a generic one capable to cover Windows as well would be better)
  4. container.network.io.bytes and container.network.io.direction (better alignment with disko.io and use of dimension).

@ChrsMark
Copy link
Member Author

Pushed latest changes based on #282 (comment).
@jsuereth @trask feel free to have another look.

@ChrsMark
Copy link
Member Author

Will need to use the network and disk attributes from #530

@ChrsMark
Copy link
Member Author

ChrsMark commented Jan 2, 2024

@rmfitzpatrick @jamesmoessis since you have been maintaining the dockerstatsreceiver could you provide your feedback here?

@open-telemetry/semconv-container-approvers please have a look as well.

@ChrsMark ChrsMark self-assigned this Jan 9, 2024
@dmitryax
Copy link
Member

dmitryax commented Jan 9, 2024

Why don't we start with usage metrics for memory and CPU instead of the utilization? The utilization seems less important because it can always be derived from usage divided by limit. container.cpu.utilization is also confusing without knowing what the limit is. There is a PR to migrate from in the kubelet receiver.

In otelcollector container.cpu.usage is a namespace

I see it's implemented in the docker receiver for metrics like container.cpu.usage.kernelmode, but I don't think it's the right approach. I think the metric should be container.cpu.usage with kernelmode as an attribute value.

Also, usage for CPU is an average aggregation over time, so it may be more important to define container.cpu.time in the first place as a cumulative sum. It'll be consistent with the system metric system.cpu.time

@ChrsMark
Copy link
Member Author

Thank's @dmitryax Ilike the idea of being aligned with kubeletstats as well.
If I understand this correctly your proposal is to add the following:

  1. container.cpu.time and container.cpu.state
  2. container.memory.usage

Also do you propose to skip the container.cpu/memory.utilization metrics completely for now, or we can add them additionally? I would lean towards adding them anyways even as optional?

@dmitryax
Copy link
Member

Right. But I'm not sure about container.cpu.state. The docker stats receiver reports the following metrics:

  container.cpu.usage.system
  container.cpu.usage.total
  container.cpu.usage.kernelmode
  container.cpu.usage.usermode
  container.cpu.usage.percpu

All these metrics correspond 1:1 to the info that the docker provides. We cannot get more granular data to provide metrics with state and logical number attributes as we do for system.cpu.time.

The kubelet, on the other hand, doesn't provide any data other than plain container time/usage values. So we don't have any attributes for the container.cpu.time and container.cou.usage metrics emitted by the kubelet receiver.

I'm curious how https://www.elastic.co/guide/en/ecs/current/ecs-container.html#field-container-cpu-usage is used in the existing implementations. Is it used to represent the docker stats?

Maybe we can just define container.cpu.time without any attributes defined in the semantic convention?

Another option is to change the docker receiver to emit container.cpu.time but without an option to have both state and logical number attributes: only one of them or none.

@ChrsMark
Copy link
Member Author

I'm curious how https://www.elastic.co/guide/en/ecs/current/ecs-container.html#field-container-cpu-usage is used in the existing implementations. Is it used to represent the docker stats?

Yes @dmitryax, this metric is used to represent the total CPU usage calculated from docker stats.
In Metricbeat, which implements these metrics, we have the following:

docker.cpu.kernel.*
docker.cpu.system.*
docker.cpu.user.*
docker.cpu.total.*
docker.cpu.core.*

The implementation can be found at https://github.com/elastic/beats/blob/13ec0b30f5502720ff5079853e6f437478348ec7/metricbeat/module/docker/cpu/data.go#L32-L61

A sample document can be found at https://github.com/elastic/beats/blob/main/metricbeat/module/docker/cpu/_meta/data.json.

From these the docker.cpu.total.norm.pct is also exposed as the container.cpu.usage ECS field as we can see at https://github.com/elastic/beats/blob/13ec0b30f5502720ff5079853e6f437478348ec7/metricbeat/module/docker/cpu/data.go#L65. The reason is that only this one was considered as interesting enough to be promoted as an ECS field/metric so far.

So I think that all implementations can emit this base metric called container.cpu.usage with container.cpu.state to be set to total as the default. With that, kubeletstats can use it without an issue as it would always emit container.cpu.state: total. Would that make sense?

@jamesmoessis
Copy link

I agree with what @dmitryax is saying. We can change the dockerstats receiver to emit container.cpu.time, in fact, this is how I would've wanted it to be implemented in the first place so it aligns with the system semconv, but didn't change it since it was there from older existing implementations. Since it's still in alpha, it wouldn't be a bad idea to change it sooner rather than later.

In terms of the state attribute, would it make sense to mark it as conditionally required if available? Then implementations can choose whether to add the attribute.

I think for CPU logical_number we can either :

  1. Have this in a separate metric to container.cpu.time. Example container.cpu.time_percpu. This is similar to how the docker stats receiver works currently. I'm aware this deviates somewhat to the system semconv.
  2. Have it as an optional attribute. In my experience I disable this because the cost of cardinality at scale is not worth the seldom useful granularity it provides.

Also do you propose to skip the container.cpu/memory.utilization metrics completely for now, or we can add them additionally

I propose to skip this for now. Reduce the scope of this PR and it will be closed out faster. This PR should have bare minimum then we can add more in later PRs.

Copy link
Member

@joaopgrassi joaopgrassi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving given my comment above, and also that IIUC collector folks are also on par with the changes.

docs/system/container-metrics.md Show resolved Hide resolved
docs/system/container-metrics.md Show resolved Hide resolved
model/metrics/container.yaml Show resolved Hide resolved
docs/system/container-metrics.md Outdated Show resolved Hide resolved
model/metrics/container.yaml Outdated Show resolved Hide resolved
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
@ChrsMark
Copy link
Member Author

@joaopgrassi @lmolkova thank's for reviewing! I have tried to address or answer your comments. Could have another look please?

Copy link
Contributor

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments, but LGTM.

Heads up on #820 - we want to merge it (hopefully very soon) and once it's in stability will be required for all metrics, attributes, and enum members.

@ChrsMark
Copy link
Member Author

I have filed #840 to keep track of #282 (comment). Beyond this, I believe that's ready to go now @open-telemetry/specs-semconv-maintainers :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.