Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand section on profilers (perf and VTune) #381

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amadio
Copy link
Contributor

@amadio amadio commented Nov 1, 2022

I've focused more on perf than VTune, but this is intended to close #43. I think the online documentation for VTune is good enough that we can just point students there. However, if you think the VTune section should be expanded further, let me know.

@amadio amadio self-assigned this Nov 1, 2022
Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether the presented content is too detailed. But I would let other people comment on this.

talk/tools/profiling.tex Outdated Show resolved Hide resolved
talk/tools/profiling.tex Show resolved Hide resolved
@amadio
Copy link
Contributor Author

amadio commented Nov 1, 2022

I wonder whether the presented content is too detailed. But I would let other people comment on this.

You can always skip what you don't need, but the content is useful for people just looking at the slides as a reference. That said, @hageboeck had the same concern.

Copy link
Contributor

@sponce sponce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning the level of details, although it's a bit too much for perf, I would keep the slides on list/stat/record/report as I find it nice to have one feature per slide. Maybe a couple of complex examples can be removed, but on the other hand, it's a nice ref and we do not need to go through all details when we give the course

talk/tools/profiling.tex Show resolved Hide resolved
talk/tools/profiling.tex Show resolved Hide resolved
talk/tools/profiling.tex Show resolved Hide resolved
Comment on lines +51 to +75
\begin{minted}{shell-session}
$ perf
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
The most commonly used perf commands are:
annotate Read perf.data and display annotated code
c2c Shared Data C2C/HITM Analyzer.
config Get and set variables in a configuration file.
diff Read perf.data and display the differential profile
evlist List the event names in a perf.data file
list List all symbolic event types
mem Profile memory accesses
record Run a command and record its profile into perf.data
report Read perf.data and display the profile
sched Tool to trace/measure scheduler properties (latencies)
script Read perf.data and display trace output
stat Run command and gather performance counter statistics
top System profiling tool.
version display the version of perf binary
probe Define new dynamic tracepoints
trace strace inspired tool
See 'perf help COMMAND' for more information on a specific command.
\end{minted}
\end{block}
}
\end{frame}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful ? I think I would drop it

Copy link
Contributor Author

@amadio amadio Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use a similar slide to this to give a general overview of perf in my own presentations, mentioning that there are more commands than the ones I cover. If you don't want to go into details, this could be a useful slide for that. However, other than that, it's probably fine to drop. I did have to shorten the description of the commands to fit in the slide anyway, so this is not quite what you'd get by running perf without arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first thought I also found this too much. On second thought, yeah, why shouldn't we leave an overview here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that this slide would be systematically skipped when you present. So if it's a pure reference, then let's put it in a reference section at the very end. Otherwise, let's drop it.

mentioning that there are more commands than the ones I cover

Useful indeed, but then I would mention that there are a lot of commands, not list them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that most people don't think it's useful, so I will drop this slide.

Comment on lines +219 to 231
\begin{frame}[fragile]
\frametitle{Intel VTune Profiler}
\centering
\includegraphics[width=0.75\textwidth]{tools/vtune.png}
\begin{itemize}
\item Very powerful GUI-based profiler for Intel CPUs and GPUs
\item Now free to use with
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html}{Intel oneAPI Base Toolkit} or
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}{standalone}
\item See the \href{https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/}
{official online documentation} for more information
\end{itemize}
\end{frame}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the picture brings something for people not knowing the tool ? I would maybe replace it with a bullet highlighting the things it can do which perf cannot (if any) and another giving the donwsides

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since VTune is a graphical tool, I thought it would be nice to show what it looks like when you open it. You can use the picture to show the types of analyses that VTune is able to do instead of a bullet list, and just tell people when presenting about the extra features it has over perf. For detailed usage information, I'd point people to the online docs. One thing I'd mention while presenting is the Top-Down Microarchitecture Analysis, which is a very good method to find bottlenecks. While perf can also do it, it cannot show you detailed information for each symbol like VTune does, and the annotation of source code by VTune is also a lot easier to use than perf's.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also link a talk from Ahmad Yasin, who was behind the creation of the Top-Down Microarchitecture Analysis Method at Intel. It's a very nice talk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even like to have more pictures. E.g. I love the microarchitecture analysis with the pipeline visualization. Or how a general hierarchical profile looks like. Or the pane showing contention between threads. Or even better, a live demonstration :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not care about picture themselves. I care that if there is a picture, it's understandable, that is that we explain what appears there. In this case, there is a LOT of explanations missing, and I'm not sure we want to include them actually.

@amadio
Copy link
Contributor Author

amadio commented Nov 7, 2022

Are any changes needed? From my side this should be ready for merging.

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through the example commands again and found them to be quite hard to understand for someone who is only using perf casually. Maybe you can reword or simplify a few of those.

Comment on lines +148 to +149
$ # Sample CPU stack traces (via frame pointers), at 100 Hertz, for 10s:
$ perf record -F 100 -g -- sleep 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sleep 10 here the command to be profiled or a trick to profile something systemwide? Sorry for my limited knowledge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, I did intend to have -a to capture things system-wide, but the command as is records data only for the sleep command.

Comment on lines +151 to +152
$ # Sample stack traces for PID using DWARF to unwind stacks, for 10s:
$ perf record -p <PID> --call-graph=dwarf -- sleep 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it is even more surprising for me. The PID should give the process to profile. What does the sleep 10 do? Is there no flag to tell perf to count 10s? The current command line is surprising to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the sleep command is only used to give perf the start/stop timings (it's a very common thing to do with perf to use sleep, as there's no other easy way to tell perf to stop otherwise). The profiled process is actually the one given by <PID>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here we suppose that people are at easy with frame-pointers (previous line) and dwarf. That would require another set of slides by itself. Less and less convinced that we should not simplify drastically and give only one slide of examples with one line of each list/stat/record/report

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @sponce. Maybe I'm assuming too much prior knowledge that the average student doesn't/won't have. I guess in that case, showing just how to do the simplest case, which is to collect and view a report just using the default of cycles for the event is good enough for the course, and we can point people to other sets of slides when more advanced material is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm sure HSF people would love to create a full course dedicated to perf. And I promise I would be one of your first students :-)

Copy link
Contributor Author

@amadio amadio Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given a few talks here and there, so I have many slides on perf (not using LaTeX, though). I could think about converting the material I have into a course on performance analysis, and including other less known tools, like bpftrace, uftrace, bcc, etc. That said, perf itself is more than enough for a full course, as I doubt many people have used perf data, perf c2c, perf mem, and other less well known commands as well. Plus there is the post-processing and data visualization as well, which is also interesting (gprof2dot, flamegraph, d3js).

{ \scriptsize
\begin{block}{}
\begin{minted}{shell-session}
$ # Sample on-CPU functions for the specified command, at 100 Hertz:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an on-CPU function? Does this relate to heterogeneous computing? In the sense that you don't profile GPU functions?

I just tried that command and it counted cycles. So maybe:

Suggested change
$ # Sample on-CPU functions for the specified command, at 100 Hertz:
$ # Sample cycles for the specified command, at 100 Hertz:

Copy link
Contributor Author

@amadio amadio Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf cannot take samples when the process is not running, that's why it's usually referred to on-CPU sampling, because samples are taken only when threads are scheduled on some CPU. However, you can also trace scheduling events to try to see what is going on when threads are off-CPU (i.e. being scheduled out, then back in). See https://www.brendangregg.com/offcpuanalysis.html for more information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I start wondering if it's worth keeping examples that cannot be understood simply. The explanation you just gave is already far above the expected knowledge of the people attending the course. In order to explain that, you would need a whole set of slides starting with "thread scheduling", "sampling", etc...

talk/tools/profiling.tex Show resolved Hide resolved
$ # Sample stack traces for PID using DWARF to unwind stacks, for 10s:
$ perf record -p <PID> --call-graph=dwarf -- sleep 10

$ # Precise on-CPU user stack traces (no skid) using PEBS (Intel CPUs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an on-CPU stack trace? And what is skid? And what's PEBS? :)
I am asking because a future presenter of these slides might not know this. Is all the information relevant?

Maybe we need a slide introducing some terms of art and defining the acronyms. Or a glossary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained on-CPU above. Basically, there is a margin of error to attribute samples to instructions, as a number of instructions are in flight in parallel on the CPU at any given time. This error is called the skid in the sampling (see more information here). PEBS stands for Precise Event Based Sampling (PEBS), and is a feature on Intel CPUs that allows sampling with low or no skid. The sort of equivalent thing on AMD CPUs is IBS, or Instruction-based Sampling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking because a future presenter of these slides might not know this. Is all the information relevant?

I hope that someone presenting perf to others will read the manual pages and understand these examples ahead of time. I tried to give a general overview of how to do several different things with each of the most important commands, so of course that what I added I think is relevant information for people trying to use perf. Maybe this is all too complicated for a C++ course and we should really just point people to the actual documentation or other material instead. I'm starting to think that that will be easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is all too complicated for a C++ course

Do we need a tool section in the expert part ? That could be a solution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a tools course, separate from a C++ course. VTune, perf, valgrind, can all be used for much more than just C++, so we can bundle this together with bash, coreutils, and some other command line tools that are used very often and make a new course.

Comment on lines +157 to +159
$ # Sample CPU stack traces using Instruction-based sampling (AMD CPUs):
$ # (Note that you need to use system-wide sampling for IBS on AMD CPUs)
$ perf record -a -g -e cycles:pp -- <command>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't -a a system-wide sampling? Why do I need a <command> then? What is IBS?

Copy link
Contributor Author

@amadio amadio Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IBS is explained above. The requirement to use system-wide sampling is a hardware requirement when using IBS on AMD CPUs. This is also explained in perf's documentation (see man perf-list). I added this example to show how to use event modifiers and to remind people that IBS requires system-wide sampling to work.

Comment on lines +219 to 231
\begin{frame}[fragile]
\frametitle{Intel VTune Profiler}
\centering
\includegraphics[width=0.75\textwidth]{tools/vtune.png}
\begin{itemize}
\item Very powerful GUI-based profiler for Intel CPUs and GPUs
\item Now free to use with
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html}{Intel oneAPI Base Toolkit} or
\href{https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html}{standalone}
\item See the \href{https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/}
{official online documentation} for more information
\end{itemize}
\end{frame}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even like to have more pictures. E.g. I love the microarchitecture analysis with the pipeline visualization. Or how a general hierarchical profile looks like. Or the pane showing contention between threads. Or even better, a live demonstration :)

Comment on lines +51 to +75
\begin{minted}{shell-session}
$ perf
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
The most commonly used perf commands are:
annotate Read perf.data and display annotated code
c2c Shared Data C2C/HITM Analyzer.
config Get and set variables in a configuration file.
diff Read perf.data and display the differential profile
evlist List the event names in a perf.data file
list List all symbolic event types
mem Profile memory accesses
record Run a command and record its profile into perf.data
report Read perf.data and display the profile
sched Tool to trace/measure scheduler properties (latencies)
script Read perf.data and display trace output
stat Run command and gather performance counter statistics
top System profiling tool.
version display the version of perf binary
probe Define new dynamic tracepoints
trace strace inspired tool
See 'perf help COMMAND' for more information on a specific command.
\end{minted}
\end{block}
}
\end{frame}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first thought I also found this too much. On second thought, yeah, why shouldn't we leave an overview here.

@amadio
Copy link
Contributor Author

amadio commented Nov 7, 2022

I would even like to have more pictures. E.g. I love the microarchitecture analysis with the pipeline visualization. Or how a general hierarchical profile looks like. Or the pane showing contention between threads. Or even better, a live demonstration :)

I could not reply directly to this, so adding as quote above.

Although I would like to, I unfortunately don't have so much more time to invest in improving the slides. I really need to go back to work on Geant4 and XRootD now. In any case, I think the online documentation of VTune is really good already. perf is harder to use just by looking at the docs, therefore my added examples, which are meant to be copy/pasted into the terminal to try out perf even without deep knowledge about it.

@bernhardmgruber
Copy link
Contributor

I would like @sponce and @hageboeck to comment on the complexity of the presented material. For my part, I am fine enough to merge. If I had to present this material, I would probably skip a third of the commands because my knowledge about them is insufficient.

@sponce
Copy link
Contributor

sponce commented Nov 8, 2022

I'm in general not at ease with this one. On one hand it's already far too complex, on the other hand a lot of explanations are missing on concepts used without presenting them. I can see 2 ways out : adding more, but then splitting into a standard part and an expert one. Or simplifying, keeping really only the core, as we did for gdb, in 4 slides total (first 2 with second one split and one example slide.

@amadio
Copy link
Contributor Author

amadio commented Nov 8, 2022

Ok, I think it's better to go with the second route of simplifying things a bit and providing examples only for the more basic usage of perf, and breaking the first slide into two. I will update this pull request in the next few days when I find the time for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Present vtune in the tools section
3 participants