Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF 4.9.0: segmentation fault after repeatedly opening a NetCDF 4 file, reading a vector and closing the file #2486

Closed
Alexander-Barth opened this issue Aug 22, 2022 · 21 comments
Assignees
Milestone

Comments

@Alexander-Barth
Copy link
Contributor

Alexander-Barth commented Aug 22, 2022

The julia user @sjdaines reported this segmentation fault (Alexander-Barth/NCDatasets.jl#187 ), when repeatedly open a NetCDF 4 file, reading a vector and closing the file. After doing this ~1 000 000 times we have a segmentation fault. For the original use-case, the error occurs much earlier.

  • the version of the software with which you are encountering an issue

NetCDF 4.9.0 with HDF5 1.12.1 on Linux 5.15.0 with gcc 5.2.0 or gcc 12.1.0.

  • a description of the issue with the steps needed to reproduce it

NetCDF 4.9.0 is compiled with:

export CPPFLAGS="-I/workspace/destdir/include"
export CFLAGS="-std=c99"   
export LDFLAGS="-L/workspace/destdir/lib"    
./configure --prefix=/workspace/destdir --build=x86_64-linux-musl --host=x86_64-linux-gnu --enable-shared --disable-static --disable-dap-remote-tests --disable-plugins

The segmentation fault can also be reproduced with the following C code:

#include <stdlib.h>
#include <stdio.h>
#include <netcdf.h>
#define FILE_NAME "coords.nc"
#define NX 90
#define ERR(e) {printf("Error: %s\n", nc_strerror(e)); exit(2);}

int main() {
  int ncid, varid;
  float data_in[NX];
  int x, y, retval, niter;

  niter = 0;

  while (1) {
    if (niter % 1000 == 0) {
      printf("niter: %d\n",niter);
    }
    if ((retval = nc_open(FILE_NAME, NC_NOWRITE, &ncid)))
      ERR(retval);

    if ((retval = nc_inq_varid(ncid, "latitude", &varid)))
      ERR(retval);

    if ((retval = nc_get_var_float(ncid, varid, &data_in[0])))
      ERR(retval);

    if ((retval = nc_close(ncid)))
      ERR(retval);

    niter += 1;
  }
   
  return 0;
}

Compiled with:

gcc -g test_segfault6.c $(nc-config --cflags --libs)

After niter: 944000, the output is Segmentation fault (core dumped). Running the programm under gdb, we see the following stack trace:

[...]
niter: 944000
niter: 945000

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7a991fa in __GI___libc_free (mem=0xffffffff80000400) at malloc.c:3255
3255    malloc.c: No such file or directory.
(gdb) 
(gdb) where
#0  0x00007ffff7a991fa in __GI___libc_free (mem=0xffffffff80000400) at malloc.c:3255
#1  0x00007ffff7cef600 in nc4_rec_grp_del () from /workspace/destdir/lib/libnetcdf.so.19
#2  0x00007ffff7cefa2b in nc4_nc4f_list_del () from /workspace/destdir/lib/libnetcdf.so.19
#3  0x00007ffff7c975ee in nc4_close_netcdf4_file () from /workspace/destdir/lib/libnetcdf.so.19
#4  0x00007ffff7c976d5 in nc4_close_hdf5_file () from /workspace/destdir/lib/libnetcdf.so.19
#5  0x00007ffff7c97cd6 in NC4_close () from /workspace/destdir/lib/libnetcdf.so.19
#6  0x00007ffff7c3581a in nc_close () from /workspace/destdir/lib/libnetcdf.so.19
#7  0x0000000000400a42 in main () at test_segfault6.c:28

On a different system with HDF5 1.10.0 this error could not be reproduced (tested up to 5 000 000 iterations).

The NetCDF file is available at:
https://github.com/Alexander-Barth/NCDatasets.jl/files/9393436/coords.zip and contains the following data:

$ ncdump -h -s coords.nc 
netcdf coords {
dimensions:
	latitude = 90 ;
	longitude = 144 ;
	bnds = 2 ;
variables:
	float latitude(latitude) ;
		latitude:axis = "Y" ;
		latitude:units = "degrees_north" ;
		latitude:standard_name = "latitude" ;
		latitude:_Storage = "contiguous" ;
		latitude:_Endianness = "little" ;
	float longitude(longitude) ;
		longitude:axis = "X" ;
		longitude:units = "degrees_east" ;
		longitude:standard_name = "longitude" ;
		longitude:_Storage = "contiguous" ;
		longitude:_Endianness = "little" ;

// global attributes:
		:source = "Data from Met Office Unified Model" ;
		:um_version = "11.9" ;
		:Conventions = "CF-1.7" ;
		:_NCProperties = "version=2,netcdf=4.7.4,hdf5=1.12.0," ;
		:_SuperblockVersion = 0 ;
		:_IsNetcdf4 = 1 ;
		:_Format = "netCDF-4" ;
}
@Alexander-Barth Alexander-Barth changed the title NetCDF 4.9.0: segmentation fault after repeatedly open a NetCDF 4 file, reading a vector and closing the file NetCDF 4.9.0: segmentation fault after repeatedly opening a NetCDF 4 file, reading a vector and closing the file Aug 22, 2022
@WardF
Copy link
Member

WardF commented Aug 22, 2022

Interesting, and thank you for the comprehensive information! I'll set up to replicate this, thanks!

@WardF WardF self-assigned this Aug 22, 2022
@WardF WardF added this to the 4.9.1 milestone Aug 22, 2022
@WardF
Copy link
Member

WardF commented Aug 22, 2022

Notes to self: v4.9.0 w/ HDF5 1.12.2 does not manifest this issue on MacOS M1. Setting up a 1.12.1 environment. @Alexander-Barth would it be possible to get the corresponding libnetcdf.settings and libhdf5.settings files from the environment where the issue is being observed?

@WardF
Copy link
Member

WardF commented Aug 22, 2022

Also, @Alexander-Barth, on the system that's failing, what happens if you compile and run the program as follows:

$ gcc -g test_segfault6.c $(nc-config --cflags --libs) -fsanitize=address -fno-omit-frame-pointer

This makes some assumptions about the capabilities of the underlying compiler/system, and it's possible it will simply fail to compile due to unrecognized arguments. But if it compiles, does it still fail after iteration 944000 or does it fail sooner?

@Alexander-Barth
Copy link
Contributor Author

This is libhdf5.settings (HDF5 is taken from https://anaconda.org/conda-forge/hdf5/1.12.1/download/linux-64/hdf5-1.12.1-nompi_h2750804_103.tar.bz2)

	    SUMMARY OF THE HDF5 CONFIGURATION
	    =================================

General Information:
-------------------
                   HDF5 Version: 1.12.1
                  Configured on: Mon Dec  6 11:34:37 UTC 2021
                  Configured by: conda@52cf20af83a3
                    Host system: x86_64-conda-linux-gnu
              Uname information: Linux 52cf20af83a3 5.11.0-1021-azure #22~20.04.1-Ubuntu SMP Fri Oct 29 01:11:25 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
                       Byte sex: little-endian
             Installation point: /home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_

Compiling Options:
------------------
                     Build Mode: production
              Debugging Symbols: no
                        Asserts: no
                      Profiling: no
             Optimization Level: high

Linking Options:
----------------
                      Libraries: static, shared
  Statically Linked Executables: 
                        LDFLAGS: -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib -Wl,-rpath-link,/home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib -L/home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib
                     H5_LDFLAGS: 
                     AM_LDFLAGS:  -L/home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib
                Extra libraries: -lcrypto -lcurl -lrt -lpthread -lz -ldl -lm 
                       Archiver: /home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_build_env/bin/x86_64-conda-linux-gnu-ar
                       AR_FLAGS: cr
                         Ranlib: /home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_build_env/bin/x86_64-conda-linux-gnu-ranlib

Languages:
----------
                              C: yes
                     C Compiler: /home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_build_env/bin/x86_64-conda-linux-gnu-cc
                       CPPFLAGS: -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/include
                    H5_CPPFLAGS: -D_GNU_SOURCE -D_POSIX_C_SOURCE=200809L   -DNDEBUG -UH5_DEBUG_API
                    AM_CPPFLAGS:  -I/home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/include
                        C Flags: -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/work=/usr/local/src/conda/hdf5_split-1.12.1 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_=/usr/local/src/conda-prefix
                     H5 C Flags:  -std=c99  -Wall -Wcast-qual -Wconversion -Wextra -Wfloat-equal -Wformat=2 -Winit-self -Winvalid-pch -Wmissing-include-dirs -Wno-c++-compat -Wno-format-nonliteral -Wshadow -Wundef -Wwrite-strings -pedantic -Wlarger-than=2560 -Wlogical-op -Wframe-larger-than=16384 -Wpacked-bitfield-compat -Wsync-nand -Wstrict-overflow=5 -Wno-unsuffixed-float-constants -Wdouble-promotion -Wtrampolines -Wstack-usage=8192 -Wmaybe-uninitialized -Wdate-time -Warray-bounds=2 -Wc99-c11-compat -Wduplicated-cond -Whsa -Wnormalized -Wnull-dereference -Wunused-const-variable -Walloca -Walloc-zero -Wduplicated-branches -Wformat-overflow=2 -Wformat-truncation=1 -Wrestrict -Wattribute-alias -Wcast-align=strict -Wshift-overflow=2 -Wattribute-alias=2 -Wmissing-profile -Wc11-c2x-compat -fstdarg-opt  -s -Wno-aggregate-return -Wno-inline -Wno-missing-format-attribute -Wno-missing-noreturn -Wno-overlength-strings -Wno-jump-misses-init -Wno-suggest-attribute=const -Wno-suggest-attribute=noreturn -Wno-suggest-attribute=pure -Wno-suggest-attribute=format -Wno-suggest-attribute=cold -Wno-suggest-attribute=malloc  -Wbad-function-cast -Wimplicit-function-declaration -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wpacked -Wpointer-sign -Wpointer-to-int-cast -Wredundant-decls -Wstrict-prototypes -Wswitch -Wunused-function -Wunused-variable -Wunused-parameter -Wcast-align -Wunused-but-set-variable -Wformat -Wincompatible-pointer-types -Wshadow -Wcast-function-type -Wmaybe-uninitialized -O3
                     AM C Flags: 
               Shared C Library: yes
               Static C Library: yes


                        Fortran: yes
               Fortran Compiler: /home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_build_env/bin/x86_64-conda-linux-gnu-gfortran ( GNU Fortran (GCC) 9.4.0)
                  Fortran Flags: 
               H5 Fortran Flags:  -std=f2008  -Waliasing -Wall -Wcharacter-truncation -Wextra -Wimplicit-interface -Wsurprising -Wunderflow -pedantic -Warray-temporaries -Wintrinsics-std -Wimplicit-procedure -Wreal-q-constant -Wfunction-elimination -Wrealloc-lhs -Wrealloc-lhs-all -Wno-c-binding-type -Wuse-without-only -Winteger-division -Wfrontend-loop-interchange   -s -O3
               AM Fortran Flags: 
         Shared Fortran Library: yes
         Static Fortran Library: yes

                            C++: yes
                   C++ Compiler: /home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_build_env/bin/x86_64-conda-linux-gnu-c++
                      C++ Flags: -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/work=/usr/local/src/conda/hdf5_split-1.12.1 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/hdf5_split_1638790369387/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_=/usr/local/src/conda-prefix
                   H5 C++ Flags:   -Wall -Wcast-qual -Wconversion -Wctor-dtor-privacy -Weffc++ -Wextra -Wfloat-equal -Wformat=2 -Winit-self -Winvalid-pch -Wmissing-include-dirs -Wno-format-nonliteral -Wnon-virtual-dtor -Wold-style-cast -Woverloaded-virtual -Wreorder -Wshadow -Wsign-promo -Wundef -Wwrite-strings -pedantic -Wlarger-than=2560 -Wlogical-op -Wframe-larger-than=16384 -Wpacked-bitfield-compat -Wsync-nand -Wstrict-overflow=5 -Wdouble-promotion -Wtrampolines -Wstack-usage=8192 -Wmaybe-uninitialized -Wdate-time -Wopenmp-simd -Warray-bounds=2 -Wduplicated-cond -Whsa -Wnormalized -Wnull-dereference -Wunused-const-variable -Walloca -Walloc-zero -Wduplicated-branches -Wformat-overflow=2 -Wformat-truncation=1 -Wrestrict -Wattribute-alias -Wcast-align=strict -Wshift-overflow=2 -Wattribute-alias=2 -Wmissing-profile -Wno-deprecated-copy -fstdarg-opt  -s  -Wcast-align -Wmissing-declarations -Wpacked -Wredundant-decls -Wswitch -Wunused-but-set-variable -Wunused-function -Wunused-variable -Wunused-parameter -Wshadow -O3
                   AM C++ Flags:  -DOLD_HEADER_FILENAME -DHDF_NO_NAMESPACE -DNO_STATIC_CAST
             Shared C++ Library: yes
             Static C++ Library: yes

                           Java: no


Features:
---------
                   Parallel HDF5: no
Parallel Filtered Dataset Writes: no
              Large Parallel I/O: no
              High-level library: yes
                Build HDF5 Tests: yes
                Build HDF5 Tools: yes
                    Threadsafety: yes (recursive RW locks: no)
             Default API mapping: v112
  With deprecated public symbols: yes
          I/O filters (external): deflate(zlib)
                             MPE: no
                   Map (H5M) API: no
                      Direct VFD: yes
                      Mirror VFD: no
              (Read-Only) S3 VFD: yes
            (Read-Only) HDFS VFD: no
                         dmalloc: no
  Packages w/ extra debug output: none
                     API tracing: no
            Using memory checker: yes
 Memory allocation sanity checks: no
          Function stack tracing: no
                Use file locking: best-effort
       Strict file format checks: no
    Optimization instrumentation: no

This is libnetcdf.settings

# NetCDF C Configuration Summary
==============================

# General
-------
NetCDF Version:         4.9.0
Dispatch Version:       5
Configured On:
Host System:            x86_64-pc-linux-gnu
Build Directory:        /workspace/srcdir/netcdf-c-4.9.0
Install Prefix:         /workspace/destdir
Plugin Install Prefix:  N.A.

# Compiling Options
-----------------
C Compiler:             /opt/bin/x86_64-linux-gnu-libgfortran3-cxx11/cc
CFLAGS:                 -std=c99 -fno-strict-aliasing
CPPFLAGS:               -I/workspace/destdir/include
LDFLAGS:                -L/workspace/destdir/lib
AM_CFLAGS:
AM_CPPFLAGS:
AM_LDFLAGS:
Shared Library:         yes
Static Library:         no
Extra libraries:        -lhdf5_hl -lhdf5 -lm -lz -ldl -lxml2 -lcurl
XML Parser:             libxml2

# Features
--------
NetCDF-2 API:           yes
HDF4 Support:           no
HDF5 Support:           yes
NetCDF-4 API:           yes
NC-4 Parallel Support:  no
PnetCDF Support:        no
DAP2 Support:           yes
DAP4 Support:           yes
Byte-Range Support:     no
Diskless Support:       yes
MMap Support:           no
JNA Support:            no
CDF5 Support:           yes
ERANGE Fill Support:    no
Relaxed Boundary Check: yes
Parallel Filters:       yes
NCZarr Support:         yes
Multi-Filter Support:   yes
Quantization:           yes
Logging:                no
SZIP Write Support:     no
Standard Filters:       deflate bz2
ZSTD Support:           no
Benchmarks:             no

Interestingly with -fsanitize=address -fno-omit-frame-pointer the code fails right-away:

sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # gcc  test_segfault6.c $(nc-config --cflags --libs) -fsanitize=address -fno-omit-frame-pointer
sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # ./a.out 
==729==AddressSanitizer CHECK failed: /workspace/srcdir/gcc-5.2.0/libsanitizer/asan/asan_rtl.cc:556 "((!asan_init_is_running && "ASan init calls itself!")) != (0)" (0x0, 0x0)
    <empty stack>

@Alexander-Barth
Copy link
Contributor Author

Alexander-Barth commented Aug 23, 2022

I am rerunning this test case with gcc 12.1 and NetCDF compiled with -g which gives a more complete stack trace and line numbers:

sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # gcc -g  test_segfault6.c $(nc-config --cflags --libs) -fsanitize=address -fno-omit-frame-pointer
sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # ./a.out 
niter: 0
AddressSanitizer:DEADLYSIGNAL
=================================================================
==598==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000010 (pc 0x7fa903fc92f6 bp 0x7fffe290f440 sp 0x7fffe290ebd8 T0)
==598==The signal is caused by a READ memory access.
==598==Hint: address points to the zero page.
    #0 0x7fa903fc92f6 in __sanitizer::internal_strlen(char const*) /workspace/srcdir/gcc-12.1.0/libsanitizer/sanitizer_common/sanitizer_libc.cpp:167
    #1 0x7fa903f68c70 in __interceptor_strdup /workspace/srcdir/gcc-12.1.0/libsanitizer/asan/asan_interceptors.cpp:435
    #2 0x7fa903d6a3ef in NCDISPATCH_initialize /workspace/srcdir/netcdf-c-4.9.0/libdispatch/ddispatch.c:85
    #3 0x7fa903d5ded9 in nc_initialize /workspace/srcdir/netcdf-c-4.9.0/liblib/nc_initialize.c:90
    #4 0x7fa903d622cd in NC_open /workspace/srcdir/netcdf-c-4.9.0/libdispatch/dfile.c:1982
    #5 0x7fa903d616c1 in nc_open /workspace/srcdir/netcdf-c-4.9.0/libdispatch/dfile.c:662
    #6 0x401303 in main /workspace/srcdir/netcdf-c-4.9.0/test_segfault6.c:19
    #7 0x7fa903b55156 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #8 0x7fa903b55207 in __libc_start_main_impl ../csu/libc-start.c:409
    #9 0x401138  (/workspace/srcdir/netcdf-c-4.9.0/a.out+0x401138)

sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # gcc --version
x86_64-linux-gnu-gcc (GCC) 12.1.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The failure in strdup starts to resemble another issue (possibly related) reported here:

This is the line where the error occurs:
https://github.com/Unidata/netcdf-c/blob/v4.9.0/libdispatch/ddispatch.c#L85

In the build environment the HOME variable is not set. When it is defined, we have an error in ncrc_setrchome:

sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # export HOME=/tmp
sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # gcc -g  test_segfault6.c $(nc-config --cflags --libs) -fsanitize=address -fno-omit-frame-pointer
sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # ./a.out 
niter: 0
AddressSanitizer:DEADLYSIGNAL
=================================================================
==627==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000030 (pc 0x7f722b9152f6 bp 0x7ffdf3d62fe0 sp 0x7ffdf3d62778 T0)
==627==The signal is caused by a READ memory access.
==627==Hint: address points to the zero page.
    #0 0x7f722b9152f6 in __sanitizer::internal_strlen(char const*) /workspace/srcdir/gcc-12.1.0/libsanitizer/sanitizer_common/sanitizer_libc.cpp:167
    #1 0x7f722b8b4c70 in __interceptor_strdup /workspace/srcdir/gcc-12.1.0/libsanitizer/asan/asan_interceptors.cpp:435
    #2 0x7f722b6d1fda in ncrc_setrchome /workspace/srcdir/netcdf-c-4.9.0/libdispatch/drc.c:122
    #3 0x7f722b6d21ee in NC_rcload /workspace/srcdir/netcdf-c-4.9.0/libdispatch/drc.c:187
    #4 0x7f722b6d1ed4 in ncrc_initialize /workspace/srcdir/netcdf-c-4.9.0/libdispatch/drc.c:101
    #5 0x7f722b6b646a in NCDISPATCH_initialize /workspace/srcdir/netcdf-c-4.9.0/libdispatch/ddispatch.c:105
    #6 0x7f722b6a9ed9 in nc_initialize /workspace/srcdir/netcdf-c-4.9.0/liblib/nc_initialize.c:90
    #7 0x7f722b6ae2cd in NC_open /workspace/srcdir/netcdf-c-4.9.0/libdispatch/dfile.c:1982
    #8 0x7f722b6ad6c1 in nc_open /workspace/srcdir/netcdf-c-4.9.0/libdispatch/dfile.c:662
    #9 0x401303 in main /workspace/srcdir/netcdf-c-4.9.0/test_segfault6.c:19
    #10 0x7f722b4a1156 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #11 0x7f722b4a1207 in __libc_start_main_impl ../csu/libc-start.c:409
    #12 0x401138  (/workspace/srcdir/netcdf-c-4.9.0/a.out+0x401138)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /workspace/srcdir/gcc-12.1.0/libsanitizer/sanitizer_common/sanitizer_libc.cpp:167 in __sanitizer::internal_strlen(char const*)
==627==ABORTING

This involves also the HOME directory:
https://github.com/Unidata/netcdf-c/blob/v4.9.0/libdispatch/drc.c#L122

@Alexander-Barth
Copy link
Contributor Author

@sjdaines also reported an error with classical NetCDF files:
Alexander-Barth/NCDatasets.jl#187

@WardF
Copy link
Member

WardF commented Aug 23, 2022

Interesting, thank you. I will continue working on duplicating this; I am unable thus far to replicate the error on MacOS or Linux, using the .nc and C code provided, with or without memory sanitizing. I'll take a closer look at the .settings files provided when I get in to the office.

@Alexander-Barth
Copy link
Contributor Author

For completeness, this is the only patch that we apply to 4.9.0:

https://github.com/JuliaPackaging/Yggdrasil/blob/master/N/NetCDF/bundled/patches/0001-fix-for-nc_def_var_fletcher32-see-https-github.mirror.nvdadr.com-U.patch

Based on this:
#2401 (comment)

But I don't think this is relevant here. Thank you for looking into this!

@Alexander-Barth
Copy link
Contributor Author

Surprisingly, if netcdf is compiled without -std=c99 (add in the past because of variable declaration within a function body, but no longer necessary in NetCDF 4.9.0), we do no longer have this failure right at the beginning:

sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.0 # gcc -g  test_segfault6.c $(nc-config --cflags --libs) -fsanitize=address -fno-omit-frame-pointer && ./a.out
niter: 0
niter: 1000
niter: 2000
^C

I am making the long test now.

@WardF
Copy link
Member

WardF commented Aug 23, 2022

No problem looking into this; we have a long list of things to address but it comes down more to resource management than anything else; we will get to everything in time XD. One hopes.

The Sanitizer is great (I was aware of it, but thanks to @edwardhartnett for bringing it into our regular toolbox), it helps flag memory management issues the moment they occur, not when they eventually become problematic. I confess I'm really curious why this is occurring on your system and not on mine; my next step will be to test using the conda-packaged HDF5 instead of the version I compile myself.

The only downside to running with the sanitizer is the (to be expected) additional overhead incurred; it changed the 1,000,000 iteration run I was doing from a 2 minute test to a 16 minute test, last night.

@WardF
Copy link
Member

WardF commented Aug 23, 2022

Frustratingly, adding -std=c99 to my local netcdf-c build doesn't trigger the error in my environment, either. But, continuing to try to track this down.

@Alexander-Barth
Copy link
Contributor Author

If it is helpful, setting up a sandbox with gcc 12.0.1 (or other version) and all necessary libraries (HDF5, zlib and libcurl) can be achieved with the following commands on a Linux system. You only need to have git pre-installed (which is likely :-))
Assuming your current directory is empty and writable and that you have about 5 GB in your home partition, one can use:

DIR=$PWD
wget -O - https://julialang-s3.julialang.org/bin/linux/x64/1.8/julia-1.8.0-linux-x86_64.tar.gz | tar -xzf -
$DIR/julia-1.8.0/bin/julia --eval 'using Pkg; Pkg.add("BinaryBuilder")'
git clone https://github.com/JuliaPackaging/Yggdrasil.git
cd $DIR/Yggdrasil/N/NetCDF
sed 's/preferred_gcc_version=v"5"/preferred_gcc_version=v"12"/'  ./build_tarballs.jl >  ./build_tarballs_gcc12.jl
$DIR/julia-1.8.0/bin/julia --color=yes ./build_tarballs_gcc12.jl x86_64-linux-gnu  --debug=end

The warning Warning: Build failed, the following log files were generated can be ignored. I think it is due to the fact that we interrupted the build by requiring a debugging shell with --debug=end. In this shell, NetCDF is already build and installed (using this script: https://github.com/JuliaPackaging/Yggdrasil/blob/master/N/NetCDF/build_tarballs.jl#L30-L80 )

NetCDF is installed in /workspace/destdir.

To reproduce the issue, one can use:

wget -O test_segfault6.c https://dox.ulg.ac.be/index.php/s/QI3R0UKx3QdKBra/download
wget https://github.com/Alexander-Barth/NCDatasets.jl/files/9393436/coords.zip
unzip coords.zip
gcc -g  test_segfault6.c $(nc-config --cflags --libs) -fsanitize=address -fno-omit-frame-pointer && ./a.out
# -> should reproduce the error

The folder $DIR/Yggdrasil/N/NetCDF/build/x86_64-linux-gnu/ABCXYZ/srcdir/netcdf-c-4.9.0 on your host
corresponding to the source directory /workspace/srcdir/netcdf-c-4.9.0 in the sandbox where ABCXYZ is a random string.
This is the easiest way to share files between the sandbox and the host.

Use Control-D to exit sandbox.

The sed command is used to change version of the gcc compiler (as gcc 5.2 is a bit old)

Some commands need to download some large files can take a while (for example a couple of minutes). To uninstall run rm -Rf $DIR $HOME/.julia.

In any case, I would completely understand if you do not want to adventure into using unfamiliar tools.

@WardF
Copy link
Member

WardF commented Aug 23, 2022

@Alexander-Barth This is great, actually; it will help a lot to be able to replicate the environment. I will take a look at this tomorrow; my day-to-day machine is ARM, so I will move over to an x86-64 machine to test this out. I will also try under emulation if it comes down to it. Thanks!

@WardF
Copy link
Member

WardF commented Aug 24, 2022

@Alexander-Barth So, I'm running the scripts you provided above. First, this is great, I will definitely spend some time unpicking the actual Julia scripts. Julia is one of those languages that's been on my radar/to-do list, but I haven't had the chance to explore. But this containerized build system is a big help.

I'm running the provided scripts on my x86_64 dev system, under WSL (if that makes a difference); unfortunately, I'm still not able to reproduce the error. That doesn't necessarily mean there isn't a problem; after I've let this run for a while just to make sure, I'm going to run it through some additional memory profiling tools.

Do you happen to know how much system RAM is available in the environment where this issue is being observed? What I suspect, at this point, is that something isn't being freed when it should be; if available RAM is relatively low, perhaps we're seeing an out-of-memory issue?

Beyond that, I'm happy to continue helping diagnose this issue, particularly when it's so easy to use the "same" environment as where the issue is being observed.

@WardF
Copy link
Member

WardF commented Aug 24, 2022

Ah, I suspect maybe I'm not seeing the failure because of your PR from an hour ago. That might also explain it. Let me try this again and see if I recreate it if I step back before that was merged.

@WardF
Copy link
Member

WardF commented Aug 24, 2022

Ok, I am able to recreate this now, and I have a couple of leads on it. Thanks!

@WardF
Copy link
Member

WardF commented Aug 24, 2022

With some testing, I'm now observing the following:

(where v8.8.8 is a temporary tag I've created in WardF/netcdf-c to test fixes w/ the Julia build system).

sandbox:${WORKSPACE}/srcdir/netcdf-c-8.8.8 # gcc -g  test_segfault6.c $(nc-config --cflags --libs) -fsanitize=address -fno-omit-frame-pointer && ./a.out
niter: 0
AddressSanitizer:DEADLYSIGNAL
=================================================================
==9379==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000050 (pc 0x7fe5a6bbb3af bp 0x7ffdf99395c0 sp 0x7ffdf9939590 T0)
==9379==The signal is caused by a READ memory access.
==9379==Hint: address points to the zero page.
    #0 0x7fe5a6bbb3af in parsepath /workspace/srcdir/netcdf-c-8.8.8/libdispatch/dpathmgr.c:796
    #1 0x7fe5a6bbaabf in NCpathcanonical /workspace/srcdir/netcdf-c-8.8.8/libdispatch/dpathmgr.c:186
    #2 0x7fe5a6baf580 in NCDISPATCH_initialize /workspace/srcdir/netcdf-c-8.8.8/libdispatch/ddispatch.c:95
    #3 0x7fe5a6ba3009 in nc_initialize /workspace/srcdir/netcdf-c-8.8.8/liblib/nc_initialize.c:90
    #4 0x7fe5a6ba73fd in NC_open /workspace/srcdir/netcdf-c-8.8.8/libdispatch/dfile.c:1982
    #5 0x7fe5a6ba67f1 in nc_open /workspace/srcdir/netcdf-c-8.8.8/libdispatch/dfile.c:662
    #6 0x401303 in main /workspace/srcdir/netcdf-c-8.8.8/test_segfault6.c:19
    #7 0x7fe5a699a156 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #8 0x7fe5a699a207 in __libc_start_main_impl ../csu/libc-start.c:409
    #9 0x401138  (/workspace/srcdir/netcdf-c-8.8.8/a.out+0x401138)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /workspace/srcdir/netcdf-c-8.8.8/libdispatch/dpathmgr.c:796 in parsepath
==9379==ABORTING

Running through gdb, I'm observing the following:

(gdb) break dpathmgr.c:796
No source file named dpathmgr.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (dpathmgr.c:796) pending.
(gdb) run
Starting program: /workspace/srcdir/netcdf-c-8.8.8/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
niter: 0

Breakpoint 1, parsepath (inpath=0x619000000580 "/workspace/srcdir/netcdf-c-8.8.8", path=0x7fffffffaf40) at dpathmgr.c:796
796         for(p=tmp1;*p;p++) {if(*p == '\\') *p = '/';}
(gdb) l
791         if((stat = NCpath2utf8(inpath,&tmp1))) goto done;
792     #else
793         tmp1 = strdup(inpath);
794     #endif
795         /* Convert to forward slash to simplify later code */
796         for(p=tmp1;*p;p++) {if(*p == '\\') *p = '/';}
797
798         /* parse all paths to 2 parts:
799             1. drive letter (optional)
800             2. path after drive letter
(gdb) p inpath
$1 = 0x619000000580 "/workspace/srcdir/netcdf-c-8.8.8"
(gdb) p tmp1
$2 = 0x50 <error: Cannot access memory at address 0x50>
(gdb) p *tmp1
Cannot access memory at address 0x50
(gdb)

It appears something unexpected is happening with strdup(); whatever is happening, it does not occur in the WSL-based code. Investigating further.

@WardF
Copy link
Member

WardF commented Aug 24, 2022

Note to self: I've confirmed that removing the -std=c99 corrects the issue observed above.

@WardF
Copy link
Member

WardF commented Aug 24, 2022

@Alexander-Barth So, this has been a valuable and interesting exercise. I believe that the fix you have adopted here, the removal of -std=c99 from CFLAGS, is the correct one, and this issue can be closed. However, I'll be bookmarking it for reference to the scripts you provided above; thank you once again, very much.

A cursory Google search for "-std=c99" strdup turns up a lot of similar issues/headaches (like this one, for example). The common solutions appear to be to either replace -std=c99 with -std=gnu99, or to append -D_GNU_SOURCE when invoking gcc. Since we have a solution that doesn't require changing the netcdf-c code, I'm content to call this a valuable learning experience.

I'll wait to close this issue so that you have a chance to share your thoughts, @Alexander-Barth; feel free to close the issue yourself if you'd like, or I'll address it in the next couple of days.

Thanks again!

@Alexander-Barth
Copy link
Contributor Author

Alexander-Barth commented Aug 25, 2022

OK, this a very interesting find! So strdup is not part of C99 and indeed the compiler emitted a warning which I did not see.

What is surprising, is that according to config.h, strdup is available (but it is not):

/* Define to 1 if you have the `strdup' function. */
#define HAVE_STRDUP 1

Despite the test using the option -std=c99:

configure:22897: checking for strdup
configure:22897: cc -o conftest -std=c99 -fno-strict-aliasing -I/workspace/destdir/include -L/workspace/destdir/lib conftest.c -lxml2 -lcurl  >&5
configure:22897: $? = 0
configure:22897: result: yes

For reference, there is previous discussion about this:
#1408

I am closing this issue because because the option is not necessary anymore in NetCDF 4.9.0. After intensive testing by @sjdaines all reported failure cases are fixed by dropping -std=c99. Thanks a lot to all for your valuable help!!!

(And a learned a lot too; above all that C is really hard :-))

@DennisHeimbigner
Copy link
Collaborator

I recall now. It turns out that, as you note, a number of functions like strdup
are not officially part of c99. However it is also the case that, at least for gcc,
the c library actually contains strdup implementation and if you had
extern char* strdup(const char*) to a header, it actually finds it.
If you look at ncconfig.h, you will see a number of declarations
that try to deal with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants