Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of file open during unit test testMakeTrackValidationPlots in ROOT628 #43077

Closed
smuzaffar opened this issue Oct 21, 2023 · 52 comments

Comments

@smuzaffar
Copy link
Contributor

smuzaffar commented Oct 21, 2023

Hi,
I noticed that unit tests, during recent ROOT 6.28 PRs, are failing as there were too many open file descriptors [a]. Looks like, for ROOT6.28 unit test Validation/RecoTrack/test/testMakeTrackValidationPlots tries to open too many files e.g. running [b] on lxplus8 shows that there were over 37K file open calls in root 6.28 IBs while there were only 4K calls in normal ROOT 6.26 based IBs. ROOT630 and ROOT master based IBs also shows a lot open files issue.

[a] https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6a8509/35311/unitTests/failed.html
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6a8509/35311/unitTests/src/Alignment/OfflineValidation/test/DMRall/testing.log

./cmsRun: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: Error 24
Failure running DMR single configuration standalone: status 127

/var/log/messages contains entries like

Oct 21 16:07:35 cmsbuild114.cern.ch cvmfs2[6642]: (unpacked.cern.ch) open file descriptor limit exceeded
Oct 21 16:07:35 cmsbuild114.cern.ch cvmfs2[6642]: (unpacked.cern.ch) open file descriptor limit exceeded
Oct 21 16:07:35 cmsbuild114.cern.ch cvmfs2[6642]: (unpacked.cern.ch) open file descriptor limit exceeded
Oct 21 16:07:35 cmsbuild114.cern.ch cvmfs2[6642]: (unpacked.cern.ch) open file descriptor limit exceeded
Oct 21 16:07:35 cmsbuild114.cern.ch cvmfs2[6642]: (unpacked.cern.ch) open file descriptor limit exceeded
Oct 21 16:07:35 cmsbuild114.cern.ch cvmfs2[6642]: (unpacked.cern.ch) open file descriptor limit exceeded
Oct 21 16:07:35 cmsbuild114.cern.ch cvmfs2[6642]: (unpacked.cern.ch) open file descriptor limit exceeded

[b] lxplus8

  • ROOT 6.26 Based IB
> cd /tmp/$(whoami)
> source /cvmfs/cms.cern.ch/cmsset_default.sh
> scram p CMSSW_13_3_X_2023-10-20-2300
> cd CMSSW_13_3_X_2023-10-20-2300
> cmsenv
> strace -e trace=open,openat,close,read,write,connect,accept -o run.log makeTrackValidationPlots.py ~muzaffar/public/root628/test.root
> grep '^open' run.log |wc -l
3933
  • ROT 6.28 Based IB
> cd /tmp/$(whoami)
> source /cvmfs/cms.cern.ch/cmsset_default.sh
> scram p CMSSW_13_3_ROOT628_X_2023-10-20-2300
> cd CMSSW_13_3_ROOT628_X_2023-10-20-2300
> cmsenv
> strace -e trace=open,openat,close,read,write,connect,accept -o run.log makeTrackValidationPlots.py ~muzaffar/public/root628/test.root
> grep '^open' run.log |wc -l
37636
@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 21, 2023

A new Issue was created by @smuzaffar Malik Shahzad Muzaffar.

@rappoccio, @makortel, @smuzaffar, @Dr15Jones, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@smuzaffar
Copy link
Contributor Author

Number of files open during this unit tests for ROOT 6.26 based IBs are [a] while for ROOT628 (6.20 and master) IBs are [b]

[a]

grep '^open' run.log | grep -v 'No such file or directory' | grep -v 'Read-only file system' | sed -e 's|.*, "||;s|".*||' | sort | grep '/cvmfs/' | cut -d/ -f9 | sort | uniq -c  | sort -n | tail -5
      4 coral
      6 gcc
    202 python3
    266 root
    367 cmssw

[b]

grep '^open' run.log | grep -v 'No such file or directory' | grep -v 'Read-only file system' | sed -e 's|.*, "||;s|".*||' | sort | grep '/cvmfs/' | cut -d/ -f9 | sort | uniq -c  | sort -n | tail -5
    530 sherpa
    541 llvm
    987 gcc
   2731 root
  16034 cmssw

@smuzaffar
Copy link
Contributor Author

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Oct 23, 2023

By the way, running makeTrackValidationPlots.py ~muzaffar/public/root628/test.root ( this is the command run by testMakeTrackValidationPlots unit test ) shows that it starts over 150 processes [a]. Is this an expected behavior of this unit test?

[a]

> ps -u muzaffar -o pid,cmd --forest
 876671      \_ python3 /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_ROOT628_X_2023-10-22-2300/bin/el8_amd64_gcc12/makeTrackValidationPlots.py /afs/cern.ch/user/m/muzaffar/public/root628/test.root
 876873      |   \_ python3 /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_ROOT628_X_2023-10-22-2300/bin/el8_amd64_gcc12/makeTrackValidationPlots.py /afs/cern.ch/user/m/muzaffar/public/root628/test.root
 876878      |   \_ python3 /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_ROOT628_X_2023-10-22-2300/bin/el8_amd64_gcc12/makeTrackValidationPlots.py /afs/cern.ch/user/m/muzaffar/public/root628/test.root
 876880      |   \_ python3 /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_ROOT628_X_2023-10-22-2300/bin/el8_amd64_gcc12/makeTrackValidationPlots.py /afs/cern.ch/user/m/muzaffar/public/root628/test.root
 876882      |   \_ python3 /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_ROOT628_X_2023-10-22-2300/bin/el8_amd64_gcc12/makeTrackValidationPlots.py /afs/cern.ch/user/m/muzaffar/public/root628/test.root
 876884      |   \_ python3 /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_ROOT628_X_2023-10-22-2300/bin/el8_amd64_gcc12/makeTrackValidationPlots.py /afs/cern.ch/user/m/muzaffar/public/root628/test.root
 876886      |   \_ python3 /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_ROOT628_X_2023-10-22-2300/bin/el8_amd64_gcc12/makeTrackValidationPlots.py /afs/cern.ch/user/m/muzaffar/public/root628/test.root
 876888      |   \_ python3 /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_ROOT628_X_2023-10-22-2300/bin/el8_amd64_gcc12/makeTrackValidationPlots.py /afs/cern.ch/user/m/muzaffar/public/root628/test.root
 876890      |   \_ python3 /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_ROOT628_X_2023-10-22-2300/bin/el8_amd64_gcc12/makeTrackValidationPlots.py /afs/cern.ch/user/m/muzaffar/public/root628/test.root
 876892      |   \_ python3 /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_ROOT628_X_2023-10-22-2300/bin/el8_amd64_gcc12/makeTrackValidationPlots.py /afs/cern.ch/user/m/muzaffar/public/root628/test.root
> ps -u muzaffar -o pid,cmd --forest | grep python3  |wc
    157     940   36074

@makortel
Copy link
Contributor

Tagging @cms-sw/tracking-pog-l2 as well

@slava77
Copy link
Contributor

slava77 commented Oct 23, 2023

Is this an expected behavior of this unit test?

yes, it's expected. I can't say that it's desired or well justified.

If there is a suggestion on how to restrict multiprocessing.Process used in Validation/RecoTrack/python/plotting/validation.py from starting a new one, it would resolve the issue more reliably.

For my own tests I'm using ulimit -n 4096 in the shell before running

@smuzaffar
Copy link
Contributor Author

running strace -f to follow forks shows following number of openat calls

  • root 6.26: 7K
  • root 6.28: 908K
  • root 6.30: 2877K
  • root master: 50K

@mmusich
Copy link
Contributor

mmusich commented Oct 23, 2023

I can't say that it's desired or well justified.

as a reminder the unit test was introduced back then in #35957, because of the script getting recurrently broken by other developments. IIRC multiprocessing was indeed one of the reasons it stopped working once (#35827).

@slava77
Copy link
Contributor

slava77 commented Oct 23, 2023

running strace -f to follow forks shows following number of openat calls

* root 6.26: 7K

* root 6.28: 908K

* root 6.30: 2877K

* root master: 50K

uhm, I'm not sure what these counts are. Is it for simultaneously opened files or just a count of attempts to open files?

For the latter, could there be some issue with the search paths? (e.g. repeated attempts at files that do not exist)

@smuzaffar
Copy link
Contributor Author

uhm, I'm not sure what these counts are. Is it for simultaneously opened files or just a count of attempts to open files?

no, not simultaneously but is it total number of open calls by all the processes during this test

For the latter, could there be some issue with the search paths? (e.g. repeated attempts at files that do not exist)

No these are not data file search path call. The number above are the successful open file calls. Logs files, based on various root versions, are available under /afs/cern.ch/user/m/muzaffar/public/root628 .

  • root 6.26: 1464 unique file open calls which also contains 1432 calls to generate plots and few call to open root shared libs and macros [a]
  • root 6.28: 5742 unique file open calls and majority of these are open for 186 times ( I guess there are 186 threads running)
  • root 6.30 : 10952 unique file open calls

looks like root 628 and above are reading/opening all the shared libraries in LD_LIBRARY_PATH. e.g root version 6.28 and above load /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02808/el8_amd64_gcc12/external/blackhat/0.9.9-987ad1acae5cc088f5b0bffc0baf5368/lib/blackhat/libBG.so.0.0.0 which is from blackhat generator. In root 6.26 , on few of the root's shared libs were opened.

[a]

      1 plots/plots_tpPtLess09_tobTecStep/effandfakePos.pdf
      1 plots/plots_tpPtLess09_tobTecStep/effandfakePtEtaPhi.pdf
      9 root/etc/plugins/ROOT@@Math@@Minimizer
     27 root/lib/libMinuit.so
     36 root/etc/plugins/ROOT@@Math@@Minimizer/P010_Minuit2Minimizer.C
     36 root/etc/plugins/ROOT@@Math@@Minimizer/P020_TMinuitMinimizer.C
     36 root/etc/plugins/ROOT@@Math@@Minimizer/P030_GSLMinimizer.C
     36 root/etc/plugins/ROOT@@Math@@Minimizer/P040_GSLNLSMinimizer.C
     36 root/etc/plugins/ROOT@@Math@@Minimizer/P050_GSLSimAnMinimizer.C
     36 root/etc/plugins/ROOT@@Math@@Minimizer/P060_TLinearMinimizer.C
     36 root/etc/plugins/ROOT@@Math@@Minimizer/P070_TFumiliMinimizer.C
     36 root/etc/plugins/ROOT@@Math@@Minimizer/P080_GeneticMinimizer.C
     36 root/etc/plugins/ROOT@@Math@@Minimizer/P090_RMinimizer.C
     44 root/etc/plugins/TView
    115 root/etc/plugins/TVirtualGraphPainter
    115 root/etc/plugins/TVirtualHistPainter
    115 root/etc/plugins/TVirtualPS
    115 root/fonts/FreeSans.otf
    115 root/fonts/FreeSansBold.otf
    115 root/fonts/symbol.ttf
    119 /dev/null
    132 root/lib/libGraf3d.so
    176 root/etc/plugins/TView/P010_TView3D.C
    345 root/lib/libGpad.so
    345 root/lib/libHistPainter.so
    345 root/lib/libPostscript.so
    460 root/etc/plugins/TVirtualGraphPainter/P010_TGraphPainter.C
    460 root/etc/plugins/TVirtualHistPainter/P010_THistPainter.C
    460 root/etc/plugins/TVirtualPS/P010_TPostScript.C
    460 root/etc/plugins/TVirtualPS/P020_TSVG.C
    460 root/etc/plugins/TVirtualPS/P030_TPDF.C
    460 root/etc/plugins/TVirtualPS/P040_TImageDump.C
    460 root/etc/plugins/TVirtualPS/P050_TTeXDump.C

@makortel
Copy link
Contributor

makortel commented Oct 23, 2023

looks like root 628 and above are reading/opening all the shared libraries in LD_LIBRARY_PATH

@smuzaffar Have you happened to check what happens in cmsRun in this respect? In general I'd think ROOT "reading/opening all the shared libraries in LD_LIBRARY_PATH" would be bad behavior in general.

@smuzaffar
Copy link
Contributor Author

I am running a workflow

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Oct 23, 2023

Running workflow 4.22 (for 5 events, all steps 2 to 5) does not show any inconsistencies in file open.
Could it be pyROOT issue?

@smuzaffar
Copy link
Contributor Author

@slava77 @mmusich , I opened #43096 to control the number of parallel process for validation plots. Default is to use #CPUs jobs in parallel. This will help a bit with number of simultaneously open file descriptors but we still have to see why root 6.28 and above are opening too many files.

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Oct 24, 2023

For ROOT 6.26, when a new process is started to do the plotting then the stack for each opened shared library looks like [a] and only few (less than 100) root libs and macros were opened. But for ROOT 6.28 (and above) the stack looks like [b] and all the shared libs from LD_LIBRARY_PATH and system directory (e.g. /usr/lib64) were opened (over 22K files were opened).

Looks like these following calls are responsible for opening all shared libs in root 6.28 and above

libCling.so(cling::Dyld::ContainsSymbol((anonymous namespace)::LibraryPath const*, llvm::StringRef, unsigned int) const+0x70) [0xb736e0]
libCling.so(cling::Dyld::searchLibrariesForSymbol[abi:cxx11](llvm::StringRef, bool)+0x307) [0xb76cf7]
libCling.so(cling::DynamicLibraryManager::searchLibrariesForSymbol[abi:cxx11](llvm::StringRef, bool) const+0x13) [0xb77053]
libCling.so(AutoloadLibraryGenerator::tryToGenerate(llvm::orc::LookupState&, llvm::orc::LookupKind, llvm::orc::JITDylib&, llvm::orc::JITD

FYI @vgvassilev @Axel-Naumann

strace -ff -f -k -e trace=open,openat -o run.log makeTrackValidationPlots.py /afs/cern.ch/user/m/muzaffar/public/root628//test.root command was used to generate the following stacktraces

[a] /afs/cern.ch/user/m/muzaffar/public/root628/stack/root-6.26.11.log

openat(AT_FDCWD, "<inst>/lcg/root/6.26.11-2d833eecc2c1ccfc13843c9f9618acfd/lib/libGpad.so", O_RDONLY) = 21
 > /usr/lib64/libc-2.28.so(open+0x52) [0x11f9e2]
 > /usr/lib64/libc-2.28.so(_IO_file_open+0x25) [0x92485]
 > /usr/lib64/libc-2.28.so(_IO_file_fopen@@GLIBC_2.2.5+0xea) [0x9263a]
 > /usr/lib64/libc-2.28.so(__fopen_internal+0x7c) [0x8679c]
 > /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02808/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib64/libstdc++.so.6.0.30(std::__basic_file<char>::open(char const*, std::_Ios_Openmode, int)+0x33) [0xce1a3]
 > <inst>/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib64/libstdc++.so.6.0.30(std::basic_filebuf<char, std::char_traits<char> >::open(char const*, std::_Ios_Openmode)+0x3d) [0x11340d]
 > <inst>/lcg/root/6.26.11-2d833eecc2c1ccfc13843c9f9618acfd/lib/libCling.so(cling::DynamicLibraryManager::isSharedLibrary(llvm::StringRef, bool*) [clone .constprop.0]+0x1d2) [0x698a62]
 > <inst>/lcg/root/6.26.11-2d833eecc2c1ccfc13843c9f9618acfd/lib/libCling.so(cling::DynamicLibraryManager::lookupLibrary[abi:cxx11](llvm::StringRef, llvm::SmallVector<llvm::StringRef, 2u>, llvm::SmallVector<llvm::StringRef, 2u>, llvm::StringRef, bool) const+0x77) [0x699bd7]
 > <inst>/lcg/root/6.26.11-2d833eecc2c1ccfc13843c9f9618acfd/lib/libCling.so(TCling::Load(char const*, bool)+0x159) [0x5dd2c9]
 > <inst>/lcg/root/6.26.11-2d833eecc2c1ccfc13843c9f9618acfd/lib/libCore.so(TSystem::Load(char const*, char const*, bool)+0x738) [0x20f348]
 > <inst>/lcg/root/6.26.11-2d833eecc2c1ccfc13843c9f9618acfd/lib/libCore.so(TROOT::LoadClass(char const*, char const*, bool)+0x6b) [0x1997fb]
 > <inst>/lcg/root/6.26.11-2d833eecc2c1ccfc13843c9f9618acfd/lib/libCling.so(TCling::ShallowAutoLoadImpl(char const*)+0x2d1) [0x5dad21]
 > <inst>/lcg/root/6.26.11-2d833eecc2c1ccfc13843c9f9618acfd/lib/libCling.so(TCling::DeepAutoLoadImpl(char const*, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, bool)+0x280) [0x5f2740]
 > <inst>/lcg/root/6.26.11-2d833eecc2c1ccfc13843c9f9618acfd/lib/libCling.so(TCling::AutoLoad(char const*, bool)+0x283) [0x5f2d03]
 > <inst>/lcg/root/6.26.11-2d833eecc2c1ccfc13843c9f9618acfd/lib/libCore.so(TClass::GetClass(char const*, bool, bool, unsigned long, unsigned long)+0x4a8) [0x29e868]
 > <inst>/lcg/root/6.26.11-2d833eecc2c1ccfc13843c9f9618acfd/lib/libcppyy_backend3_9.so(Cppyy::GetScope(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xf8) [0x1

[b] /afs/cern.ch/user/m/muzaffar/public/root628/stack/root-6.28.09.log

openat(AT_FDCWD, "<inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libASImage.so", O_RDONLY|O_CLOEXEC) = 8
 > /usr/lib64/libpthread-2.28.so(__open+0x4f) [0x1221f]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::sys::fs::openFile(llvm::Twine const&, int&, llvm::sys::fs::CreationDisposition, llvm::sys::fs::FileAccess, llvm::sys::fs::OpenFlags, unsigned int)+0xbb) [0x440036b]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::sys::fs::openNativeFileForRead(llvm::Twine const&, llvm::sys::fs::OpenFlags, llvm::SmallVectorImpl<char>*)+0x35) [0x4400695]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::MemoryBuffer::getFile(llvm::Twine const&, bool, bool, bool)+0x28) [0x436f7c8]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::object::ObjectFile::createObjectFile(llvm::StringRef)+0x3d) [0x3da7d6d]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(cling::Dyld::ContainsSymbol((anonymous namespace)::LibraryPath const*, llvm::StringRef, unsigned int) const+0x70) [0xb736e0]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(cling::Dyld::searchLibrariesForSymbol[abi:cxx11](llvm::StringRef, bool)+0x307) [0xb76cf7]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(cling::DynamicLibraryManager::searchLibrariesForSymbol[abi:cxx11](llvm::StringRef, bool) const+0x13) [0xb77053]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(AutoloadLibraryGenerator::tryToGenerate(llvm::orc::LookupState&, llvm::orc::LookupKind, llvm::orc::JITDylib&, llvm::orc::JITDylibLookupFlags, llvm::orc::SymbolLookupSet const&)+0x167) [0xb01be7]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::orc::ExecutionSession::OL_applyQueryPhase1(std::unique_ptr<llvm::orc::InProgressLookupState, std::default_delete<llvm::orc::InProgressLookupState> >, llvm::Error)+0x1f3) [0x2000503]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::orc::ExecutionSession::lookup(llvm::orc::LookupKind, std::vector<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags>, std::allocator<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags> > > const&, llvm::orc::SymbolLookupSet, llvm::orc::SymbolState, llvm::unique_function<void (llvm::Expected<llvm::DenseMap<llvm::orc::SymbolStringPtr, llvm::JITEvaluatedSymbol, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr>, llvm::detail::DenseMapPair<llvm::orc::SymbolStringPtr, llvm::JITEvaluatedSymbol> > >)>, std::function<void (llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> >, llvm::DenseMapInfo<llvm::orc::JITDylib*>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> > > > const&)>)+0x34f) [0x200b1df]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::orc::ExecutionSession::lookup(std::vector<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags>, std::allocator<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags> > > const&, llvm::orc::SymbolLookupSet const&, llvm::orc::LookupKind, llvm::orc::SymbolState, std::function<void (llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> >, llvm::DenseMapInfo<llvm::orc::JITDylib*>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> > > > const&)>)+0x19c) [0x200be0c]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::orc::ExecutionSession::lookup(std::vector<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags>, std::allocator<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags> > > const&, llvm::orc::SymbolStringPtr, llvm::orc::SymbolState)+0x112) [0x200c052]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::orc::LLJIT::lookupLinkerMangled(llvm::orc::JITDylib&, llvm::orc::SymbolStringPtr)+0xc2) [0x2022922]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(cling::IncrementalJIT::getSymbolAddress(llvm::StringRef, bool)+0x116) [0xc06666]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(cling::IncrementalExecutor::getAddressOfGlobal(llvm::StringRef, bool*) const+0x21) [0xbffd31]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(cling::Interpreter::getAddressOfGlobal(clang::GlobalDecl const&, bool*) const+0x6a) [0xb8571a]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(TClingDataMemberInfo::Offset()+0x157) [0xb1f367]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libcppyy_backend3_9.so(Cppyy::GetDatamemberOffset(unsigned long, unsigned long)+0x336) [0x13566]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libcppyy3_9.so(CPyCppyy::CPPDataMember::Set(unsigned long, unsigned long)+0x7b) [0x4eccb]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libcppyy3_9.so(CPyCppyy::meta_getattro(_object*, _object*)+0x1184) [0x5d9c4]

@vgvassilev
Copy link
Contributor

looks like root 628 and above are reading/opening all the shared libraries in LD_LIBRARY_PATH. e.g root version 6.28 and above load /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02808/el8_amd64_gcc12/external/blackhat/0.9.9-987ad1acae5cc088f5b0bffc0baf5368/lib/blackhat/libBG.so.0.0.0 which is from blackhat generator. In root 6.26 , on few of the root's shared libs were opened.

Yes, we index the symbols from system libraries generally. This feature was enabled in ROOT 626. I am not sure why 628 indexes more libraries than 626 as there have not been practically changes in that area since then.

I don't see how ShallowAutoLoadImpl could have changed since then either. Have you seen changes in the dictionaries in that area recently?

cc: @pcanal.

@Axel-Naumann
Copy link

@smuzaffar :

all the shared libs from LD_LIBRARY_PATH and system directory (e.g. /usr/lib64) were opened (over 22K files were opened).

Are they opened and not closed?

@smuzaffar
Copy link
Contributor Author

Are they opened and not closed?

they were closed.

@Axel-Naumann
Copy link

Then I don't understand yet why the process is running out of file descriptors? Am I misunderstanding something?

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Oct 24, 2023

@Axel-Naumann , there were two issues

  1. The cmssw unit test testMakeTrackValidationPlots was starting over 150 processes in parallel
  2. ROOT 6.28 and above open over 22K files for each such process (though not simultaneously). Most of these files are read from cvmfs

Due to high number of parallel processes and large number of files opened by each process, the open file descriptors for cvmfs user hit the 1024 limit which resulted in errors like

Oct 21 16:07:35 cmsbuild114.cern.ch cvmfs2[6642]: (unpacked.cern.ch) open file descriptor limit exceeded
Oct 21 16:07:35 cmsbuild114.cern.ch cvmfs2[6642]: (unpacked.cern.ch) open file descriptor limit exceeded

(1) should be fixed by #43096. But we still need to understand why ROOT 6.28 (and above) are opening all the shared libs in LD_LIBRARY_PATH + system dirs.

@Axel-Naumann
Copy link

But we still need to understand why ROOT 6.28 (and above) are opening all the shared libs in LD_LIBRARY_PATH + system dirs.

As @vgvassilev hinted at, that's triggered by autoloading trying to find the library that provides a symbol. I let him add to this, i.e. what can be done to improve this.

@vgvassilev
Copy link
Contributor

But we still need to understand why ROOT 6.28 (and above) are opening all the shared libs in LD_LIBRARY_PATH + system dirs.

ROOT since at least 6.26 does pretty much what the system linker does but for all possible libraries on your LD_LIBRARY_PATH -- it searches for a symbol through all available libraries. The difference is that when the symbol is not found in the "usual" set of libraries then it goes off and looks at the system libraries which we observe here. I am not sure what changed across the two versions as this code hasn't. There might be several reasons I can think of:

  • The system libraries on that particular node became a lot more than they used to be;
  • We are searching for a symbol which existed in the past but does not exist now triggering the system search

@vgvassilev
Copy link
Contributor

Ok... 626 uses LLVM13 and the orcv2 infrastructure tries more eagerly to respond on such symbol lookup requests.

@hahnjo is there a way to suppress the orcv2 callback in this stacktrace?

 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(cling::DynamicLibraryManager::searchLibrariesForSymbol[abi:cxx11](llvm::StringRef, bool) const+0x13) [0xb77053]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(AutoloadLibraryGenerator::tryToGenerate(llvm::orc::LookupState&, llvm::orc::LookupKind, llvm::orc::JITDylib&, llvm::orc::JITDylibLookupFlags, llvm::orc::SymbolLookupSet const&)+0x167) [0xb01be7]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::orc::ExecutionSession::OL_applyQueryPhase1(std::unique_ptr<llvm::orc::InProgressLookupState, std::default_delete<llvm::orc::InProgressLookupState> >, llvm::Error)+0x1f3) [0x2000503]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::orc::ExecutionSession::lookup(llvm::orc::LookupKind, std::vector<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags>, std::allocator<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags> > > const&, llvm::orc::SymbolLookupSet, llvm::orc::SymbolState, llvm::unique_function<void (llvm::Expected<llvm::DenseMap<llvm::orc::SymbolStringPtr, llvm::JITEvaluatedSymbol, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr>, llvm::detail::DenseMapPair<llvm::orc::SymbolStringPtr, llvm::JITEvaluatedSymbol> > >)>, std::function<void (llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> >, llvm::DenseMapInfo<llvm::orc::JITDylib*>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> > > > const&)>)+0x34f) [0x200b1df]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::orc::ExecutionSession::lookup(std::vector<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags>, std::allocator<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags> > > const&, llvm::orc::SymbolLookupSet const&, llvm::orc::LookupKind, llvm::orc::SymbolState, std::function<void (llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> >, llvm::DenseMapInfo<llvm::orc::JITDylib*>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr> > > > const&)>)+0x19c) [0x200be0c]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::orc::ExecutionSession::lookup(std::vector<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags>, std::allocator<std::pair<llvm::orc::JITDylib*, llvm::orc::JITDylibLookupFlags> > > const&, llvm::orc::SymbolStringPtr, llvm::orc::SymbolState)+0x112) [0x200c052]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(llvm::orc::LLJIT::lookupLinkerMangled(llvm::orc::JITDylib&, llvm::orc::SymbolStringPtr)+0xc2) [0x2022922]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(cling::IncrementalJIT::getSymbolAddress(llvm::StringRef, bool)+0x116) [0xc06666]
eb792bd7ce7d870/lib/libCling.so(cling::Interpreter::getAddressOfGlobal(clang::GlobalDecl const&, bool*) const+0x6a) [0xb8571a]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libCling.so(TClingDataMemberInfo::Offset()+0x157) [0xb1f367]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0eb792bd7ce7d870/lib/libcppyy_backend3_9.so(Cppyy::GetDatamemberOffset(unsigned long, unsigned long)+0x336) [0x13566]
 > <inst>/lcg/root/6.28.09-8486a0c5a951d64e0

My suspicion is that for some (static?) variable declaration we are requesting the offset which does not exist neither in the jit nor in the object files:

https://github.com/root-project/root/blob/a139a69cc27c186f8e1d765971a41f9f4dd43dc8/core/metacling/src/TClingDataMemberInfo.cxx#L355-L356

Perhaps that's related to what @wlav reported with recent cppyy and globals...

@smuzaffar I suspect we will need to print the mangled names. Can you apply this patch and rerun:

diff --git a/interpreter/cling/lib/Interpreter/Interpreter.cpp b/interpreter/cling/lib/Interpreter/Interpreter.cpp
index 681dbffc95..739f170d4e 100644
--- a/interpreter/cling/lib/Interpreter/Interpreter.cpp
+++ b/interpreter/cling/lib/Interpreter/Interpreter.cpp
@@ -1727,6 +1727,7 @@ namespace cling {
     // Return a symbol's address, and whether it was jitted.
     std::string mangledName;
     utils::Analyze::maybeMangleDeclName(GD, mangledName);
+    printf("mangled name=%s\n", mangledName.c_str());
 #if defined(_WIN32)
     // For some unknown reason, Clang 5.0 adds a special symbol ('\01') in front
     // of the mangled names on Windows, making them impossible to find

The output would be probably GBs...

@wlav
Copy link

wlav commented Nov 14, 2023

So two separate issues then on my side. One is problems with std::string which has a domino effect causing other problems. I've changed to a different workaround: explicit instantiation of std::string and std::allocator<char> in clingwrapper.cxx. That seems to be the easiest/most complete/portable solution.

The static symbols not being found seems to be unrelated in some cases and not necessarily a knock-on effect (although one report mentioned a load order problem: the library has to be loaded before the header). See also this one: wlav/cppyy#156

@Axel-Naumann
Copy link

Axel-Naumann commented Dec 15, 2023

root-project/root#14223 might be related.

@davidlange6
Copy link
Contributor

davidlange6 commented Dec 15, 2023

Seems unlikely. Why do you suggest that?

@Axel-Naumann
Copy link

D'oh because I failed to link the issue correctly. Edited = fixed. Anyway, the rationale (for the actual candidate of "maybe related") is: #43077 (comment) shows how ROOT is opening 100 libs looking for a symbol. That takes time, and seems very similar to the diagnosis that Markus provided for the LHCb startup issue.

@makortel
Copy link
Contributor

type root

@cmsbuild cmsbuild added the root label Dec 15, 2023
@makortel
Copy link
Contributor

In order for us to keep using ROOT 6.30, we'd need this issue fixed by the last open pre-release of CMSSW_14_0_0 (scheduled for Jan 23).

@smuzaffar
Copy link
Contributor Author

root-project/root#14261 seem to fix opening of all shared libs.

@smuzaffar
Copy link
Contributor Author

I spoke too early, python3 -c 'import ROOT;print(ROOT.kError);' still opens all shared libs.

@smuzaffar
Copy link
Contributor Author

thanks to root team specially @vepadulano , root-project/root#14276 fixed the issue with python ROOT loading all the shared libs from LD_LIBRARY_PATH. This fix has been merged for root 6.28, 6.30 and master branches. cmsdist PRs

have been merged for tonight's CMSSW_14_0_X IB.

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Dec 22, 2023

FYI, I ran all CMSSW unit tests using patch mentioned in #43077 (comment) and found out that unit tests are searching for symbols in [a,b,c]

[a] cmssw

      2 reco::GsfElectron::mvaPlaceholder
      2 reco::Muon::CaloMuon
      2 reco::Muon::GEMMuon
      2 reco::Muon::GlobalMuon
      2 reco::Muon::ME0Muon
      2 reco::Muon::PFMuon
      2 reco::Muon::RPCMuon
      2 reco::Muon::StandAloneMuon
      2 reco::Muon::TrackerMuon
      2 reco::Photon::mvaPlaceholder

[b] ROOT (symbols with * still trigger searching all libs in LD_LIBRARY_PATH)

      4 TTreeReader::fgEntryStatusText
*     9 kFALSE
*    15 kTRUE
     21 gStyle
*    32 TTree::kMaxEntries
     66 TString::kNPOS
    111 gEnv
    111 gSystem
    501 kWarning
    507 gErrorIgnoreLevel
    507 kError

[c] std

     22 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::npos

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Dec 22, 2023

Note #43077 (comment) is only for unit tests. I am running full relvals now but I do not expect we use python ROOT interface during Relvals

@vgvassilev
Copy link
Contributor

If we use root-project/root#14287 static constexpr variables should not need to go via the library symbol search.

@vgvassilev
Copy link
Contributor

Somebody dialed two too many digits for a float ;) https://godbolt.org/z/nYfo74rK1

@makortel
Copy link
Contributor

@dpiparo
Copy link
Contributor

dpiparo commented Jan 16, 2024

FYI, a more generic fix is available in master since a few days and has been backported to the 6.30 branch: root-project/root#14358 .

@makortel
Copy link
Contributor

makortel commented Feb 6, 2024

@smuzaffar This issue has been resolved, right?

@smuzaffar
Copy link
Contributor Author

@makortel , yes this has been fixed.

@makortel
Copy link
Contributor

makortel commented Feb 6, 2024

@cmsbuild, please close

@cmsbuild cmsbuild closed this as completed Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests