Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding failures during submission in McM for WWJ + NNLOPS sample #42716

Open
sv3048 opened this issue Sep 4, 2023 · 39 comments
Open

Regarding failures during submission in McM for WWJ + NNLOPS sample #42716

sv3048 opened this issue Sep 4, 2023 · 39 comments

Comments

@sv3048
Copy link

sv3048 commented Sep 4, 2023

Dear experts!

We are facing https://cms-unified.web.cern.ch/cms-unified/showlog/?search=task_HIG-RunIISummer20UL16wmLHEGENAPV-13411 error in the WWJ + NNLOPS sample submission.

You can also check :
https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-14330__v1_T_230821_125219_1993

Here is the ongoing JIRA for this request and you can find the last few comments in the JIRA useful for debugging.
https://its.cern.ch/jira/projects/HIGHPRIOREQ/issues/HIGHPRIOREQ-631?filter=allissues

According to our previous MC contact Mattia who created this gridpack (discussion is also in JIRA), one of the log files [1], the Powheg executable fails because it cannot find the "libgsl.so.0" library. This is unexpected to us, as 1) it never occurred during validation 2) as far as we know, that is a common library used by GNU and is installed in every lxplus machine in /usr/lib (it can be checked by running "gsl-config --prefix"). Is it possible that this library is missing? We do not have idea where exactly the runcmsgrid script is executed, but also have no further clue, therefore we would be very grateful if someone could help us understanding the root of this issue.

[1] cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-14330__v1_T_230821_125219_1993/8001/HIG-RunIISummer20UL16wmLHEGEN-14330_0/09f37ff3-c996-49d0-8391-4e670bc78024-149-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

Please let us know if we need to provide anything else.
Thanks and Regards,
Sadhana Verma

cc'ing @sunilUIET too here !
@sunilUIET Please feel free to add other responsible people to this issue who can help us in this regard.

Best,
Sadhana for HWW MC contact

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 4, 2023

A new Issue was created by @sv3048 SADHANA VERMA.

@Dr15Jones, @rappoccio, @smuzaffar, @makortel, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Sep 5, 2023

assign generators

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2023

New categories assigned: generators

@mkirsano,@menglu21,@alberto-sanchez,@SiewYan,@GurpreetSinghChahal,@Saptaparna you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor

@smuzaffar does cmsdist just pick up libgsl from the OS?

@smuzaffar
Copy link
Contributor

No we should be picking up gsl from cns external

@makortel
Copy link
Contributor

makortel commented Sep 5, 2023

Does ../pwhg_main get run in a way that gsl could be picked from CMSSW externals?

@sv3048
Copy link
Author

sv3048 commented Sep 11, 2023

Hello @ALL!
Please us know the status of it. I think it would be definitely in progress but just wanted to inform you it comes under an urgent request category so just would like to get updates . If you need inputs from our side just let us know and we would be happy to help.

Thanks and Regards,
Sadhana

@sv3048
Copy link
Author

sv3048 commented Sep 16, 2023

cc'ing @bbilin @menglu21 @sunilUIET
Kindly have a look at it. The request comes in an urgent sample category.

Thanks !

@sunilUIET
Copy link
Contributor

Hi,
Additional information, these WFs are running fine at CERN during McM Validation.

@davidlange6
Copy link
Contributor

i guess the key issue is the "0" in libgsl.so.0. That library does not come with cmssw (instead libgsl.so does as does libgsl.so.25). Probably best that the executable be rebuilt including the cmssw gsl which will be available on grid nodes.

@agrohsje
Copy link

The Makefile in WWJ has
GSL_path=$(shell $(shell which gsl-config) --prefix)/lib
GSL_path2=$(shell $(shell which gsl-config) --libs) .

@agrohsje
Copy link

To add. You can fix the makefile and re-compile the executables. All grids can be copied. So there is no need to resubmit jobs. Just the executables should be fixed, all pre-sampled info added and you are good to go.

@davidlange6
Copy link
Contributor

hum, cmssw doesn't put gsl-config into the externals bin area. @smuzaffar - is that expected? [I guess this is because the gsl toolfile does not include a path...]

@smuzaffar
Copy link
Contributor

@davidlange6 , you are right. Our gsl/bin was not added in to the PATH this is why in cmssw env gsl-config is picked up from system. I can update the gsl tool file to add gsl/bin in to path. In cmssw dev area, one can also use scram to get these path e.g.

lxplus8> $(scram tool tag gsl GSL_BASE)/bin/gsl-config --prefix
/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02803/el8_amd64_gcc11/external/gsl/2.6-293f1973c8de87040110bce5dc9d71f6
lxplus8> $(scram tool tag gsl GSL_BASE)/bin/gsl-config --libs  
-L/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02803/el8_amd64_gcc11/external/gsl/2.6-293f1973c8de87040110bce5dc9d71f6/lib -lgsl -lgslcblas -lm

@smuzaffar
Copy link
Contributor

cms-sw/cmsdist#8711 adds our gsl/bin in to PATH.

@smuzaffar
Copy link
Contributor

A new scram runtime hook has been deployed on cvmfs ( cms-sw/cmsdist#8712 ) which now properly adds our gsl/bin in PATH for old release cycles and already built releases [a]. You can rebuilt the executable to use/link gsl from cms externals

[a]

lxplus> scram p  CMSSW_10_6_21
lxplus> cd CMSSW_10_6_21 
lxplus> cmsenv
lxplus> which gsl-config
/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2/bin/gsl-config
lxplus> gsl-config --prefix
/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2
lxplus> gsl-config --libs
-L/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2/lib -lgsl -lgslcblas -lm

@agrohsje
Copy link

Perfect. Thanks for all your work @smuzaffar ! @sv3048 can you update the executables?

@sv3048
Copy link
Author

sv3048 commented Sep 26, 2023

Hi @agrohsje !
Sorry I didn't get the notification. Sure, let me update it.

@sv3048
Copy link
Author

sv3048 commented Sep 26, 2023

Hi @agrohsje @smuzaffar !

I have a query.
If I understood correctly we simply need to recompile the WWJ, which should now be able to see the gsl library in/cvmfs. we must NOT change CMSSW, just recompile (step 0) and put the new pwhg_main script inside the gridpack. Is it that we suppose to do?

could you please confirm that ?

Thanks and Regards,
Sadhna

@smuzaffar
Copy link
Contributor

yes that is correct @sv3048. No need to change cmssw version, just recompiling the WWJ should now pick the gsl from cvmfs

@sv3048
Copy link
Author

sv3048 commented Sep 26, 2023

Okay, Thanks !

@agrohsje
Copy link

You can double-check that all works well by doing
which gsl-config
after
cmsenv .
It should point to the version in cvmfs.

@mlizzo
Copy link

mlizzo commented Jan 16, 2024

Dear @smuzaffar , @agrohsje and other experts,

It hurts me a lot to resurrect this issue, which I hoped it was finally solved. Following the previous discussion, I have recompiled the WWJ process after the cmssw external package was updated, and indeed it could pick the right gsl library, nominally:

/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2

Therefore we proceeded with the sample injection (after some delay). However, the gsl-config --libs command - which was added to the MakeFile - gives and additional flag that is used by the gfortran compiler, i.e. -lgslcblas. The corresponding library cannot be found in /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2/lib, but it does exist in lxplus (/usr/lib/libgslcblas.so). In other words, when running the WWJ sample in central production, the libgslcblas.so library cannot be fetched and the pwhg_main code crashes because of it. You can find an example in the following log file.

That's really unfortunate and I'm really sorry that I couldn't catch this earlier, but I'm far from being an expert in computing. I have cross checked the make output again and there shouldn't be any additional missing libraries, everything else should be correctly imported from either /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external or auxiliary libraries that have been added to the gridpack and included in the LD_LIBRARY_PATH.

I would like to ask you if you could please fix this again, I hope it's doable. Of course, this is just my interpretation, if you think the problem can be solved differently please tell me. Thank you very much in advance, let me know if something is not clear and you need more details.

Best regards,
Mattia

@agrohsje
Copy link

Dear Mattia,
I am really sorry but with my TOP L2 and other commitments, I have little time these days. :-(
Let me include @covarell @jshin96 (Jihoon is that you?).
Stupid question: You need these additional libs for compilation? Or could you modify the output from gsl-config --libs to get what is available and still manage to compile the code?
Cheers, Alexander.

@mlizzo
Copy link

mlizzo commented Jan 22, 2024

Hi @agrohsje ,

Thanks for replying, I didn't mean to overload you with extra work and I totally understand if you have other business to run. To be honest, I don't have any clue about whether that specific library is actually needed by the underlying Powheg code (and where it's used eventually). I can give it a try at what you suggested, just recompiling the code without that extra flag and see what happens. Thanks again for your feedback, if that doesn't work out I will ask other people in this thread to help me (if they can).

Cheers,
Mattia

@davidlange6
Copy link
Contributor

if you picked up the right gsl-config command the output should have been

gsl-config --libs 
-L/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2/lib -lgsl -L/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/OpenBLAS/0.3.5-nmpfii2/lib -lopenblas -lm

instead using the default gsl-config on lxplus (eg, from /usr/bin)

gsl-config --libs
-lgsl -lgslcblas -lm

so what is the output of
which gal-config

@mlizzo
Copy link

mlizzo commented Jan 22, 2024

Dear @davidlange6 ,

If I run the gsl-config --libs command in CMSSW_10_6_21 with SCRAM_ARCH slc7_amd64_gcc700 I get:
-L/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2/lib -lgsl -lgslcblas -lm
which is different from the first output that you shared. This was also reported by @smuzaffar in a previous message.
As I reported above, the ouput of which gsl-config is /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2/bin/gsl-config, therefore it's picking the cvmfs installation, not the one from /usr/bin. Can you please share your setup to understand what's the difference with mine?

@davidlange6
Copy link
Contributor

I see - I had just done
source /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2/etc/profile.d/init.csh
to set up gsl.

if instead I go to CMSSW_10_6_21, I do get the same environment as you. The difference being

setenv GSL_CBLAS_LIB "-L${OPENBLAS_ROOT}/lib -lopenblas"

that cmsenv misses. I'm not sure why that would be.. Anyway, you can either set this envvar, or perhaps use gsl-config --libs-without-cblas

@mlizzo
Copy link

mlizzo commented Jan 22, 2024

Indeed if I source that script I get what perhaps is the "right" output, thanks a lot! I was about to go with the second approach you suggested, but I think it's cleanest to set up the proper inititalization script as you just showed. Thank you very much, I'll try to compile the code again and see what happens. Maybe I can test the new gridpack via crab to check if it actually doesn't fail in a grid node, so that next time we inject the sample we don't encounter any undesired behaviour.

Cheers and thanks again for your support,
Mattia

@davidlange6
Copy link
Contributor

fwiw, doing

source /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2/etc/profile.d/init.csh 

just sets up gsl and its dependencies. If you are relying on other things from CMSSW, then they may not be configured properly.

@agrohsje
Copy link

I would go with --libs-without-cblas. In fact that was the reason why I asked if you need that or not. I would pick all from CMSSW and remove cblas.

@mlizzo
Copy link

mlizzo commented Jan 22, 2024

ok thank you both, I'll just remove that flag then, cheers

@smuzaffar
Copy link
Contributor

smuzaffar commented Jan 23, 2024

SCRAM gsl tool hook /cvmfs/cms.cern.ch/etc/scramrc/SCRAM/hooks/runtime/99-gsl-config.sh properly add the CMSSW/gsl/bin in PATH. So after cmsenv you should be picking up gsl-config from cmssw externals. Looking at the /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2/bin/gsl-config (for CMSSW_10_6_21), by default it adds -lgslcblas for gsl-config --libs . I can fix the gsl scram hook to automatically set GSL_CBLAS_LIB=-L${OPENBLAS_ROOT}/lib -lopenblas so that call to gsl-config --libs returns the correct libs

@smuzaffar
Copy link
Contributor

smuzaffar commented Jan 23, 2024

cms-sw/cms-common#10 should fix the gsl scram runtime hook to set GSL_CBLAS_LIB=-L${OPENBLAS_ROOT}/lib -lopenblas [a] . @mlizzo , once this change is deployed then you do nto need any change on your side. All you need is to recompile using output of gsl-config --libs

[a]

Singularity> scram p  CMSSW_10_6_21
cd WARNING: Release CMSSW_10_6_21 is not available for architecture slc7_amd64_gcc10.
         Developer's area is created for available architecture slc7_amd64_gcc700.
Singularity> cd CMSSW_10_6_21/
Singularity> cmsenv
Singularity> gsl-config --libs
-L/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2/lib -lgsl -L/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/OpenBLAS/0.3.5-nmpfii2/lib -lopenblas -lm

@mlizzo
Copy link

mlizzo commented Jan 23, 2024

Hi @smuzaffar thanks a lot for taking care of it. Just for my understanding, so the cblas library must be linked to gsl in any case and it's not safe to simply remove it, right? Following yesterday's comments, that's what I've been doing but I can wait for the deployment of the new feature if that's the correct way of doing this

@smuzaffar
Copy link
Contributor

smuzaffar commented Jan 23, 2024

@mlizzo , gsl library needs some cblas_* methods. By default these symbols are provided by gslcblas library but for CMSSW we want to use OpenBLAS ( see cms-sw/cmsdist#5528 for details). So it is not safe to remove only remove -lgslcblas, one should replace it with some other cblas implementation and for cmssw we should replace it with -L${OPENBLAS_BASE}/lib -lopenblas ( as @davidlange6 mentioned). So please wait for the deployment of cms-sw/cms-common#10

@mlizzo
Copy link

mlizzo commented Jan 23, 2024

Thanks for the clarification, I will wait for the new build and recompile the code again, cheers

@smuzaffar
Copy link
Contributor

@mlizzo , cms-sw/cms-common#10 has been deployed on cvmfs . Can you please try rebuilding ?

@mlizzo
Copy link

mlizzo commented Jan 23, 2024

Hi @smuzaffar , it works perfectly:

gsl-config --libs
-L/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gsl/2.2.1-pafccj2/lib -lgsl -L/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/OpenBLAS/0.3.5-nmpfii2/lib -lopenblas -lm

Thank you very much for your prompt help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants