Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPIR_proctable external variable not accessible in gdb in v4.0.6rc #8563

Closed
louisespellacy-arm opened this issue Mar 8, 2021 · 23 comments · Fixed by #8570, #8571 or #8572
Closed

MPIR_proctable external variable not accessible in gdb in v4.0.6rc #8563

louisespellacy-arm opened this issue Mar 8, 2021 · 23 comments · Fixed by #8570, #8571 or #8572

Comments

@louisespellacy-arm
Copy link

louisespellacy-arm commented Mar 8, 2021

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.6rc2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From openmpi-v4.0.6rc2 tar gz with GCC 8.3.0 or 10.2.0 or PGI 20.1.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 18.04
  • Computer hardware: x86_64
  • Network type: self

Details of the problem

Following changes made in #7757, MPIR_proctable is not accessible via gdb.

gdb --quiet --ex start --ex "set MPIR_being_debugged=1" --ex "break MPIR_Breakpoint" --ex "continue" --ex "print MPIR_proctable_size" --ex "print MPIR_proctable" --ex "output (*MPIR_proctable)@4" --args mpirun -n 4 ./a.out

// with openmpi 4.0.5
(gdb) print MPIR_proctable_size
$1 = 4
(gdb) print MPIR_proctable
$2 = (struct MPIR_PROCDESC *) 0x721ca0
(gdb) output (*MPIR_proctable)@4
{{host_name = 0x71f7c0 "mycomputer", executable_name = 0x72aa50 "/home/louspe01/test_dir/./a.out", 
    pid = 13804}, {host_name = 0x720fc0 "mycomputer", 
    executable_name = 0x72aa80 "/home/louspe01/test_dir/./a.out", pid = 13805}, {
    host_name = 0x72bf70 "mycomputer", executable_name = 0x721d10 "/home/louspe01/test_dir/./a.out", 
    pid = 13806}, {host_name = 0x72bea0 "mycomputer", 
    executable_name = 0x72b850 "/home/louspe01/test_dir/./a.out", pid = 13807}}

//with openmpi 4.0.6rc2
(gdb) print MPIR_proctable_size
$1 = 4
(gdb) print MPIR_proctable
$2 = (struct MPIR_PROCDESC *) 0x754690
(gdb) output (*MPIR_proctable)@4
0x754690

When comparing the DWARF output, the symbols for MPIR_proctable were previously found in libopen-rte.so with the following entries:

 <1><697c>: Abbrev Number: 53 (DW_TAG_variable)
    <697d>   DW_AT_name        : (indirect string, offset: 0x40ed): MPIR_proctable
    <6981>   DW_AT_decl_file   : 1
    <6982>   DW_AT_decl_line   : 162
    <6983>   DW_AT_decl_column : 23
    <6984>   DW_AT_type        : <0x6992>
    <6988>   DW_AT_external    : 1
    <6988>   DW_AT_location    : 9 byte block: 3 90 79 2b 0 0 0 0 0     (DW_OP_addr: 2b7990)

However, the symbols are now in libopen-orted-mpir.so as expected from change #7757 but they are all marked as external:

 <1><2f0e>: Abbrev Number: 24 (DW_TAG_variable)
    <2f0f>   DW_AT_name        : (indirect string, offset: 0x2791): MPIR_proctable
    <2f13>   DW_AT_decl_file   : 63
    <2f14>   DW_AT_decl_line   : 21
    <2f15>   DW_AT_decl_column : 30
    <2f16>   DW_AT_type        : <0x2f1a>
    <2f1a>   DW_AT_external    : 1
    <2f1a>   DW_AT_declaration : 1

The resulting behaviour is that the values in the MPIR_proctable are not queriable in gdb.

@awlauria
Copy link
Contributor

awlauria commented Mar 8, 2021

@louisespellacy-arm can you try this change to see if it resolves this issue:

awlauria@6911750

It did for me testing locally.

@louisespellacy-arm
Copy link
Author

@awlauria Works for me! Thanks for the quick response. Updated dwarf looks like:

 <1><6f0>: Abbrev Number: 23 (DW_TAG_variable)
    <6f1>   DW_AT_name        : (indirect string, offset: 0x6a8): MPIR_proctable
    <6f5>   DW_AT_decl_file   : 1
    <6f6>   DW_AT_decl_line   : 16
    <6f7>   DW_AT_decl_column : 23
    <6f8>   DW_AT_type        : <0x706>
    <6fc>   DW_AT_external    : 1
    <6fc>   DW_AT_location    : 9 byte block: 3 60 10 20 0 0 0 0 0      (DW_OP_addr: 201060)

@awlauria
Copy link
Contributor

awlauria commented Mar 9, 2021

@louisespellacy-arm No problem. is this also needed on the v3.* branches?

@louisespellacy-arm
Copy link
Author

IIRC the original fix was merged into v.3.1.x branch so it will need to go where the last fix was put.

@awlauria
Copy link
Contributor

awlauria commented Mar 9, 2021

Thanks, opened on v3.1.x as well.

@jsquyres
Copy link
Member

@louisespellacy-arm Austen and I have been discussing this and trying to understand both the issue and his proposed fix. I'm concerned that we don't understand why his change fixes the issue for you.

I am admittedly not a DWARF expert, so there's probably a subtlety being lost on me.

I see the dwarf change, but I don't see how it affects anything in gdb. I constructed a small case which should be similar to Open MPI:

ompi-mpir-proctable-issue.tar.gz

It has:

  • foo.h which declares extern struct foo *the_foo and the foo_func() function
  • foo.c which defines struct foo and instantiates struct foo *the_foo, and defines the foo_func() function
  • main.c, which includes foo.h and calls foo_func()
  • a Makefile with all and clean targets

If I build this, I can still print *the_foo from gdb, and I see all the members of the foo struct.

If I move the definition of struct foo to foo.h, same result.
If I remove the declaration of extern struct foo *the_foo from foo.h, same result.

I do see some changes from dwarfdump output between these changes, but the gdb behavior is the same.

So I guess I'm wondering: exactly what is going on here? What is the change that you need, and why? This is apparently a very subtle issue, and I'd like to understand and document it properly.

@awlauria
Copy link
Contributor

awlauria commented Mar 10, 2021

I also see the same behavior as @jsquyres described on Power9 + rhel8.2. The dwarf output is different, but the gdb behavior remains intact. gcc version 8.3.1

@louisespellacy-arm
Copy link
Author

Hi both - I am unclear from @jsquyres comment above - were you not able to reproduce the issue in gdb relating to MPIR_proctable?

(gdb) output (*MPIR_proctable)@4
0x754690

@awlauria
Copy link
Contributor

awlauria commented Mar 10, 2021

@louisespellacy-arm we are able to reproduce in OMPI. But we are trying to understand why DW_AT_location matters, and why shuffling the declaration of the MPIR_proctable from a header file to a .c file changes it.

Also, we are unable to reproduce similar behavior in a stand-alone toy program. We're just looking for an explanation of why this works, so we can be sure the problem is actually fixed and not just masked somehow. Any input you could provide would be appreciated - for our own understanding and so we can be sure it is correctly fixed.

@jsquyres
Copy link
Member

@louisespellacy-arm From the toy program I attached in the above comment:

$ make clean; make all; gdb ./main
rm -f *.o main libfoo.so
cc  -g -O0 -gdwarf -c main.c
cc  -g -O0 -gdwarf -c foo.c
ar cr libfoo.so foo.o
cc -g -O0 -gdwarf main.o libfoo.so -L. -lfoo -o main
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
...snipped for brevity...
Reading symbols from ./main...done.
(gdb) b 7
Breakpoint 1 at 0x643: file main.c, line 7.
(gdb) run
Starting program: /home/jsquyres/tmp/ompi-mpir-proctable-issue/main 
manpath: warning: $MANPATH set, ignoring /etc/manpath.config
I am here in foo()!

Breakpoint 1, main () at main.c:7
7           printf("Back in main\n");
(gdb) print (*the_foo)@4
$1 = {{a = 42, b = 97 'a', c = 3.1415899999999999}, {a = 1433751584, b = 85 'U', 
    c = 0}, {a = 0, b = 0 '\000', c = 0}, {a = 0, b = 0 '\000', c = 0}}
(gdb) 

@bosilca
Copy link
Member

bosilca commented Mar 10, 2021

The reproducer is a statically built application, while in the case of OMPI we are looking at symbols pulled from a shared library. Two different cases, I would think.

@jsquyres
Copy link
Member

The reproducer is a statically built application, while in the case of OMPI we are looking at symbols pulled from a shared library. Two different cases, I would think.

Ah, how easily we get spoiled by Libtool making shared libraries for us... I had errors in Makefile. Fixed, and now I have a proper shared libfoo.so:

ompi-mpir-proctable-issue.tar.gz

$ make all
cc  -fPIC -g -O0 -gdwarf -c main.c
cc  -fPIC -g -O0 -gdwarf -c foo.c
gcc -shared -fPIC -g -O0 -gdwarf foo.o -o libfoo.so
cc -fPIC -g -O0 -gdwarf main.o -o main -L. -lfoo
$ ldd main
        linux-vdso.so.1 (0x00007ffc12562000)
        libfoo.so (0x00007eff9e902000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007eff9e511000)
        /lib64/ld-linux-x86-64.so.2 (0x00007eff9ed06000)

gdb behavior is still good:

$ gdb ./main
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
...snipped for brevity...
Reading symbols from ./main...done.
(gdb) b 7
Breakpoint 1 at 0x763: file main.c, line 7.
(gdb) run
Starting program: /home/jsquyres/tmp/ompi-mpir-proctable-issue/main 
manpath: warning: $MANPATH set, ignoring /etc/manpath.config
I am here in foo()!

Breakpoint 1, main () at main.c:7
7           printf("Back in main\n");
(gdb) p *the_foo
$1 = {a = 42, b = 97 'a', c = 3.1415899999999999}
(gdb)

@awlauria
Copy link
Contributor

I can confirm that just moving the struct declaration to the header makes it accesible in gdb.

However, DW_AT_location is still not present for MPIR_proctable. I don't know enough about this to know if that can be an issue or not on certain architectures or situations.

@louisespellacy-arm
Copy link
Author

Hi All - I think the issue may be that orted_submit.c no longer has full debug information because it is no longer compiled with -g.

I compared the make V=1 output for building 4.0.5 and 4.0.6rc2 for orted_submit.c. in 4.0.5, it is built with -g and in 4.0.6rc2 because of the changes to the makefile - it is not.

I tried to replicate what I think the scenario is in:
ompi-mpir-proctable-issue-changed.tar.gz

LD_LIBRARY_PATH=. gdb -ex "break main.c:8" -ex "run" -ex "output *(the_foo)@2" ./main

I've tested this theory by manually adding -g to the compile line below:

libtool: compile:  /opt/gcc/10.2.0/bin/gcc -DHAVE_CONFIG_H -I. -I../../orte -I../opal/include -I../ompi/include -I../oshmem/include -I../opal/mca/hwloc/hwloc201/hwloc/include/private/autogen -I../opal/mca/hwloc/hwloc201/hwloc/include/hwloc/autogen -I../ompi/mpiext/cuda/c -I../.. -I.. -I../../opal/include -I../../orte/include -I../orte/include -I../../ompi/include -I../../oshmem/include -I/home/louspe01/compile/openmpi-4.0.6rc2/build-gcc10.2.0/opal/mca/event/libevent2022/libevent/include -I/home/louspe01/compile/openmpi-4.0.6rc2/opal/mca/event/libevent2022/libevent -I/home/louspe01/compile/openmpi-4.0.6rc2/opal/mca/event/libevent2022/libevent/include -I/home/louspe01/compile/openmpi-4.0.6rc2/build-gcc10.2.0/opal/mca/hwloc/hwloc201/hwloc/include -I/home/louspe01/compile/openmpi-4.0.6rc2/opal/mca/hwloc/hwloc201/hwloc/include -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -mcx16 -pthread -MT orted/orted_submit.lo -MD -MP -MF orted/.deps/orted_submit.Tpo -c ../../orte/orted/orted_submit.c  -fPIC -DPIC -o orted/.libs/orted_submit.o

and re-creating libopen-rte.so. Without the -g, lib-openrte.so is missing the relevant DWARF information for the MPIR objects and hence they are not queriable. I figured this out by using readelf -w libopen-rte.so and there was no DW_AT_producer to look at.

I believe by adding the debug information for orted_submit.c then the information is available in gdb.

@awlauria
Copy link
Contributor

Thanks for investigating this further @louisespellacy-arm.

Probably the right thing to do would be to compile orted_submit.c with -g. The easy way to do that is just to append it to CFLAGS, but that will cause all of orte to be compiled with -g.

I don't think that's a bad thing, but will defer to @jsquyres

@louisespellacy-arm
Copy link
Author

louisespellacy-arm commented Mar 11, 2021

@awlauria @jsquyres I think that your fix (moving struct MPIR_PROCDESC) is still the solution - keeping all the MPIR declarations in one library - I was just providing an explanation as to the behaviour we were seeing.
I believe the debug information for MPIR_proctable was incomplete due to orted_submit.c having debug info.

@jsquyres
Copy link
Member

We had PR's like #8422 to move things into a subdirectory to ensure that they get compiled with -g.

I guess I'm confused as to why compiling orted_submit.c with -g fixes the issue, because the symbol that @louisespellacy-arm couldn't find is MPIR_Proctable, but that's not instantiated in orted_submit.c -- that's instantiated in orte/orted/orted-mpir/orted_mpir_breakpoint.c, which, per #8422, should be compiled properly with -g.

@louisespellacy-arm
Copy link
Author

louisespellacy-arm commented Mar 11, 2021

@jsquyres I could find MPIR_proctable but it was the wrong "shape" when outputted in gdb because the debug information for MPIR_proctable was incomplete.
My understanding is that the debug information was available for the pointer declaration of MPIR_proctable but not the struct itself - hence why outputting in gdb was resulting in a single address (pointer) rather the entries of MPIR_proctable.
When compiling orted_submit.c without -g there was no debug information for MPIR_PROCDESC, hence the DWARF couldn't resolve the "shape" of the object.

\\without -g orted_submit.c
(gdb) output (*MPIR_proctable)@4
0x754690
\\ with -g orted_submit.c
(gdb) output (*MPIR_proctable)@4
{{host_name = 0x71f7c0 "mycomputer", executable_name = 0x72aa50 "/home/louspe01/test_dir/./a.out", 
    pid = 13804}, {host_name = 0x720fc0 "mycomputer", 
    executable_name = 0x72aa80 "/home/louspe01/test_dir/./a.out", pid = 13805}, {
    host_name = 0x72bf70 "mycomputer", executable_name = 0x721d10 "/home/louspe01/test_dir/./a.out", 
    pid = 13806}, {host_name = 0x72bea0 "mycomputer", 
    executable_name = 0x72b850 "/home/louspe01/test_dir/./a.out", pid = 13807}}

But as I said - I'm not a DWARF expert. I think the main issue is that we need debug information for all parts of MPIR_

@awlauria
Copy link
Contributor

That makes sense. I'm fine with merging the PR's as is. Thanks @louisespellacy-arm

@jsquyres
Copy link
Member

Please give me a little time to review.

@jsquyres
Copy link
Member

I did some more testing, this time with OMPI itself instead of a small example.

I'm pretty sure that @louisespellacy-arm hit the nail on the head, above:

I could find MPIR_proctable but it was the wrong "shape" when outputted in gdb because the debug information for MPIR_proctable was incomplete.

Meaning:

  • The "move the extern" stuff is a red herring (I don't think that C's definition of extern is the same as DWARF's definition of extern).
  • The real issue is that when orted_mpir_breakpoint.c was compiled with -g, it didn't have a definition of (struct MPIR_PROCDESC) -- it only knew about (struct MPIR_PROCDESC*).

Meaning: if we move just the struct MPIR_PROCDESC definition to orted_mpir.h, everything works (for me, at least).

@louisespellacy-arm Can you verify if that's the case for you?

If so, I would propose amending all the PRs to:

  1. Move just the struct MPIR_PROCDESC definition to orted_mpir.h.
  2. Add a comment in orted_mpir.h about why the definition needs to be there (i.e., so that it can be used by multiple files, and so that it can be in the .c file that is compiled with -g, yadda yadda yadda).
  3. Re-write the git commit message to explain the move -- what the problem is, why this move fixed it, etc.

I'm sorry to be so pedantic. But this MPIR linker stuff is complicated and it really takes a deep dive to understand what and why. Putting comments and commit messages in there will definitely help whoever comes after us and has to maintain this MPIR stuff.

Thanks!

@awlauria
Copy link
Contributor

Sounds good, I'll amend the patch. @jsquyres

awlauria added a commit to awlauria/ompi that referenced this issue Mar 12, 2021
Make sure the definition of the MPIR_Proctable
is in a header file that is included in the file
orted_mpir_breakpoint.c, which is compiled with -g
and compiled without optimizations.

Otherwise, the debugger (such as gdb) won't know
the complete definition of the proctable, preventing
it from being able to read it.

Since the MPIR_proctable should be accessed from
orted_submit.c and orted_mpir_breakpoint.c, move it
to the mpir_orted.h header file.

See issue: open-mpi#8563

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
awlauria added a commit to awlauria/ompi that referenced this issue Mar 15, 2021
Make sure the definition of the MPIR_Proctable
is in a header file that is included in the file
orted_mpir_breakpoint.c, which is compiled with -g
and compiled without optimizations.

Otherwise, the debugger (such as gdb) won't know
the complete definition of the proctable, preventing
it from being able to read it.

Since the MPIR_proctable should be accessed from
orted_submit.c and orted_mpir_breakpoint.c, move it
to the mpir_orted.h header file.

See issue: open-mpi#8563

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
(cherry picked from commit a71fbaf)
awlauria added a commit to awlauria/ompi that referenced this issue Mar 15, 2021
Make sure the definition of the MPIR_Proctable
is in a header file that is included in the file
orted_mpir_breakpoint.c, which is compiled with -g
and compiled without optimizations.

Otherwise, the debugger (such as gdb) won't know
the complete definition of the proctable, preventing
it from being able to read it.

Since the MPIR_proctable should be accessed from
orted_submit.c and orted_mpir_breakpoint.c, move it
to the mpir_orted.h header file.

See issue: open-mpi#8563

Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
(cherry picked from commit a71fbaf)
@awlauria
Copy link
Contributor

All PR's have been merged, closing issue.

Thanks @louisespellacy-arm for reporting and your help in this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment