Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segfault running pytest when a test fails #1169

Closed
shwina opened this issue Nov 30, 2022 · 0 comments · Fixed by #1170
Closed

[BUG] Segfault running pytest when a test fails #1169

shwina opened this issue Nov 30, 2022 · 0 comments · Fixed by #1170
Assignees
Labels
bug Something isn't working Python Related to RMM Python API

Comments

@shwina
Copy link
Contributor

shwina commented Nov 30, 2022

When a test that uses a MemoryResource with an UpstreamResourceAdaptor raises or fails, it results in a segfault in the subsequent test:

# test_segfault.py

import rmm                                                                                                                                                                                                 
import pytest                                                                                                                                                                                              
import gc                                                                                                                                                                                                  
                                                                                                                                                                                                           
@pytest.fixture(scope="function", autouse=True)                                                                                                                                                            
def rmm_auto_reinitialize(request):                                                                                                                                                                        
                                                                                                                                                                                                           
                                                                                                                                                                                                           
    # Run the test                                                                                                                                                                                         
    yield                                                                                                                                                                                                  
                                                                                                                                                                                                           
    # Automatically reinitialize the current memory resource after running each                                                                                                                            
    # test                                                                                                                                                                                                 
    rmm.reinitialize()                                                                                                                                                                                     
                                                                                                                                                                                                           
def test_one():                                                                                                                                                                                            
    mr = rmm.mr.PoolMemoryResource(rmm.mr.CudaMemoryResource())                                                                                                                                            
    rmm.mr.set_current_device_resource(mr)                                                                                                                                                                 
    buf = rmm.DeviceBuffer(size=10)                                                                                                                                                                        
    bl   # raises                                                                                                                                                                                                    
                                                                                                                                                                                                           
def test_two():                                                                                                                                                                                            
    gc.collect() 
pytest test_segfault.py  # segfaults

I'm still trying to get to the bottom of this, but I think it has something to do with traceback object corresponding to the error keeping objects alive for longer than expected, and this causing things to be destructed in the wrong order. Will report back when I know more.

@shwina shwina added bug Something isn't working ? - Needs Triage Need team to review and classify Python Related to RMM Python API labels Nov 30, 2022
@shwina shwina removed the ? - Needs Triage Need team to review and classify label Nov 30, 2022
@shwina shwina self-assigned this Nov 30, 2022
rapids-bot bot pushed a commit that referenced this issue Dec 13, 2022
Closes #1169.

Essentially, we are running into the situation described in https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#disabling-cycle-breaking-tp-clear with `UpstreamResourceAdaptor`.

The solution is to prevent clearing of `UpstreamResourceAdaptor` objects by decorating them with `no_gc_clear`.

Cython calls out the following:

> If you use no_gc_clear, it is important that any given reference cycle contains at least one object without no_gc_clear. Otherwise, the cycle cannot be broken, which is a memory leak.

The other object in RMM that we mark `@no_gc_clear` is `DeviceBuffer`, and a `DeviceBuffer` can keep a reference to an `UpstreamResourceAdaptor`. But, an `UpstreamResourceAdaptor` cannot keep a reference to a `DeviceBuffer`, so instances of the two cannot form a reference cycle AFAICT.

Authors:
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mark Harris (https://github.com/harrism)

URL: #1170
shwina added a commit to shwina/rmm that referenced this issue Dec 19, 2022
…idsai#1170)

Closes rapidsai#1169.

Essentially, we are running into the situation described in https://cython.readthedocs.io/en/latest/src/userguide/extension_types.html#disabling-cycle-breaking-tp-clear with `UpstreamResourceAdaptor`.

The solution is to prevent clearing of `UpstreamResourceAdaptor` objects by decorating them with `no_gc_clear`.

Cython calls out the following:

> If you use no_gc_clear, it is important that any given reference cycle contains at least one object without no_gc_clear. Otherwise, the cycle cannot be broken, which is a memory leak.

The other object in RMM that we mark `@no_gc_clear` is `DeviceBuffer`, and a `DeviceBuffer` can keep a reference to an `UpstreamResourceAdaptor`. But, an `UpstreamResourceAdaptor` cannot keep a reference to a `DeviceBuffer`, so instances of the two cannot form a reference cycle AFAICT.

Authors:
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mark Harris (https://github.com/harrism)

URL: rapidsai#1170
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Related to RMM Python API
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant