-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large number of file open during unit test testMakeTrackValidationPlots in ROOT628 #43077
Comments
A new Issue was created by @smuzaffar Malik Shahzad Muzaffar. @rappoccio, @makortel, @smuzaffar, @Dr15Jones, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Number of files open during this unit tests for ROOT 6.26 based IBs are [a] while for ROOT628 (6.20 and master) IBs are [b] [a]
[b]
|
assign dqm |
New categories assigned: dqm @rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks |
By the way, running [a]
|
Tagging @cms-sw/tracking-pog-l2 as well |
yes, it's expected. I can't say that it's desired or well justified. If there is a suggestion on how to restrict For my own tests I'm using |
running
|
uhm, I'm not sure what these counts are. Is it for simultaneously opened files or just a count of attempts to open files? For the latter, could there be some issue with the search paths? (e.g. repeated attempts at files that do not exist) |
no, not simultaneously but is it total number of
No these are not data file search path call. The number above are the successful open file calls. Logs files, based on various root versions, are available under /afs/cern.ch/user/m/muzaffar/public/root628 .
looks like root 628 and above are reading/opening all the shared libraries in LD_LIBRARY_PATH. e.g root version 6.28 and above load /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02808/el8_amd64_gcc12/external/blackhat/0.9.9-987ad1acae5cc088f5b0bffc0baf5368/lib/blackhat/libBG.so.0.0.0 which is from [a]
|
@smuzaffar Have you happened to check what happens in |
I am running a workflow |
Running workflow 4.22 (for 5 events, all steps 2 to 5) does not show any inconsistencies in file open. |
For ROOT 6.26, when a new process is started to do the plotting then the stack for each opened shared library looks like [a] and only few (less than 100) root libs and macros were opened. But for ROOT 6.28 (and above) the stack looks like [b] and all the shared libs from LD_LIBRARY_PATH and system directory (e.g. /usr/lib64) were opened (over 22K files were opened). Looks like these following calls are responsible for opening all shared libs in root 6.28 and above
[a] /afs/cern.ch/user/m/muzaffar/public/root628/stack/root-6.26.11.log
[b] /afs/cern.ch/user/m/muzaffar/public/root628/stack/root-6.28.09.log
|
Yes, we index the symbols from system libraries generally. This feature was enabled in ROOT 626. I am not sure why 628 indexes more libraries than 626 as there have not been practically changes in that area since then. I don't see how cc: @pcanal. |
Are they opened and not closed? |
they were closed. |
Then I don't understand yet why the process is running out of file descriptors? Am I misunderstanding something? |
@Axel-Naumann , there were two issues
Due to high number of parallel processes and large number of files opened by each process, the open file descriptors for
(1) should be fixed by #43096. But we still need to understand why ROOT 6.28 (and above) are opening all the shared libs in LD_LIBRARY_PATH + system dirs. |
As @vgvassilev hinted at, that's triggered by autoloading trying to find the library that provides a symbol. I let him add to this, i.e. what can be done to improve this. |
ROOT since at least 6.26 does pretty much what the system linker does but for all possible libraries on your LD_LIBRARY_PATH -- it searches for a symbol through all available libraries. The difference is that when the symbol is not found in the "usual" set of libraries then it goes off and looks at the system libraries which we observe here. I am not sure what changed across the two versions as this code hasn't. There might be several reasons I can think of:
|
Ok... 626 uses LLVM13 and the orcv2 infrastructure tries more eagerly to respond on such symbol lookup requests. @hahnjo is there a way to suppress the orcv2 callback in this stacktrace?
My suspicion is that for some (static?) variable declaration we are requesting the offset which does not exist neither in the jit nor in the object files: Perhaps that's related to what @wlav reported with recent cppyy and globals... @smuzaffar I suspect we will need to print the mangled names. Can you apply this patch and rerun: diff --git a/interpreter/cling/lib/Interpreter/Interpreter.cpp b/interpreter/cling/lib/Interpreter/Interpreter.cpp
index 681dbffc95..739f170d4e 100644
--- a/interpreter/cling/lib/Interpreter/Interpreter.cpp
+++ b/interpreter/cling/lib/Interpreter/Interpreter.cpp
@@ -1727,6 +1727,7 @@ namespace cling {
// Return a symbol's address, and whether it was jitted.
std::string mangledName;
utils::Analyze::maybeMangleDeclName(GD, mangledName);
+ printf("mangled name=%s\n", mangledName.c_str());
#if defined(_WIN32)
// For some unknown reason, Clang 5.0 adds a special symbol ('\01') in front
// of the mangled names on Windows, making them impossible to find The output would be probably GBs... |
So two separate issues then on my side. One is problems with The static symbols not being found seems to be unrelated in some cases and not necessarily a knock-on effect (although one report mentioned a load order problem: the library has to be loaded before the header). See also this one: wlav/cppyy#156 |
root-project/root#14223 might be related. |
Seems unlikely. Why do you suggest that? |
D'oh because I failed to link the issue correctly. Edited = fixed. Anyway, the rationale (for the actual candidate of "maybe related") is: #43077 (comment) shows how ROOT is opening 100 libs looking for a symbol. That takes time, and seems very similar to the diagnosis that Markus provided for the LHCb startup issue. |
type root |
In order for us to keep using ROOT 6.30, we'd need this issue fixed by the last open pre-release of CMSSW_14_0_0 (scheduled for Jan 23). |
root-project/root#14261 seem to fix opening of all shared libs. |
I spoke too early, |
thanks to root team specially @vepadulano , root-project/root#14276 fixed the issue with python ROOT loading all the shared libs from LD_LIBRARY_PATH. This fix has been merged for root 6.28, 6.30 and master branches. cmsdist PRs
have been merged for tonight's CMSSW_14_0_X IB. |
FYI, I ran all CMSSW unit tests using patch mentioned in #43077 (comment) and found out that unit tests are searching for symbols in [a,b,c]
[a] cmssw
[b] ROOT (symbols with * still trigger searching all libs in LD_LIBRARY_PATH)
[c] std
|
Note #43077 (comment) is only for unit tests. I am running full relvals now but I do not expect we use python ROOT interface during Relvals |
If we use root-project/root#14287 static constexpr variables should not need to go via the library symbol search. |
Somebody dialed two too many digits for a float ;) https://godbolt.org/z/nYfo74rK1 |
I guess these should be changed to |
FYI, a more generic fix is available in master since a few days and has been backported to the 6.30 branch: root-project/root#14358 . |
@smuzaffar This issue has been resolved, right? |
@makortel , yes this has been fixed. |
@cmsbuild, please close |
Hi,
I noticed that unit tests, during recent ROOT 6.28 PRs, are failing as there were too many open file descriptors [a]. Looks like, for ROOT6.28 unit test
Validation/RecoTrack/test/testMakeTrackValidationPlots
tries to open too many files e.g. running [b] on lxplus8 shows that there were over 37K file open calls in root 6.28 IBs while there were only 4K calls in normal ROOT 6.26 based IBs. ROOT630 and ROOT master based IBs also shows a lot open files issue.[a] https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6a8509/35311/unitTests/failed.html
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-6a8509/35311/unitTests/src/Alignment/OfflineValidation/test/DMRall/testing.log
/var/log/messages contains entries like
[b] lxplus8
The text was updated successfully, but these errors were encountered: