What are the soundness requirements for `dlopen`? #525

RalfJung · 2024-08-13T10:29:12Z

Using dlopen is a subtle art. On top of the usual requirements around symbol conflicts and ABI compatibility, Rust's handling of symbols adds certain extra assumptions that can lead to UB here: ideally, we'd make sure that symbols from "different" crates can never clash. During normal builds, this is ensured by checking that the StableCrateId is globally unique (and hashing everything into the StableCrateId that is considered as relevant for crate identity), but this check is bypassed by dlopen.

At the very least, this potential risk of collisions in dlopen seems worth documenting somewhere. On top of that, is there anything we could do to mitigate this problem? Making StableCrateId an actual cryptographic hash and 256 bits large is probably going to be prohibitively expensive, but maybe there is an alternative where only dlopen users have to pay for extra checks, and if you don't use dlopen it doesn't cost anything. One could imagine a rust_checked_dlopen or so that performs the crate ID uniqueness check at runtime, somehow. Is that realistic? Is it useful?

The text was updated successfully, but these errors were encountered:

VorpalBlade · 2024-08-13T13:30:44Z

What exactly are we trying to protect against?

Let me play devil's advocate here:

Due to lack of stable ABI you will most probably be using C ABI anyway, and no name mangling. You might be using stabby or similar (which builds on top of the C ABI), but arguably they are off doing their own thing.

dlsym is very basic and even in C++ that has a stable ABI it doesn't work well with C++ name mangling, you are generally working with extern C functions across dlopen/dlsym. You might have an extern C function that returns a more complex object full of C++ types and pointers but that can have a lot of footguns (same version of all involved types must be used and since there is no name mangling you can't detect this anyway).

So, assuming extern C API what can we even protect against? C ABI is fundamentally not safe due to lack of name mangling.

Is that not the usage scenario then what is? Both possible alternatives (stabby and abi_stable) already solve the safety concerns at a higher level. Is the list of things that such a layer needs to deal with what you want to end up with in this issue?

It would probably help to come up with some use cases, describing what could go wrong in order to figure this out. As it is, this issue seems broad and vague, or perhaps I'm misunderstanding it.

RalfJung · 2024-08-13T13:45:14Z

I'm not sure what the usual usecases here are.^^ If people only ever dlopen things that have a C ABI and nothing else (no Rust symbols exported), then indeed collisions in Rust's name mangling are entirely irrelevant. But is that really the only thing people do?

Cc @bjorn3

bjorn3 · 2024-08-13T14:11:18Z

Rustc dlopens codegen backends and uses the rust abi for this. The fact that rust has an unstable abi doesn't matter when you ensure that you use the same rustc version to compile the host and the plugin. For codegen backends using something like stabby or abi_stable is impractical as codegen backends are expected to use the exact same api's as rustc uses internally. Conversion of values at the abi boundary would result in unacceptable overhead.

VorpalBlade · 2024-08-13T14:26:10Z

Rustc dlopens codegen backends and uses the rust abi for this. The fact that rust has an unstable abi doesn't matter when you ensure that you use the same rustc version to compile the host and the plugin.

How does this deal with name mangling and dlsym though? Does it still use no_mangle or does it compute the expected mangled names and pass those to dlsym?

bjorn3 · 2024-08-13T14:28:55Z

For the functions in the plugin to be called by the host #[no_mangle] is used. For functions that the plugin calls, those are defined in a dylib which both the plugin and host use as regular rust dependency, ensuring that rustc correctly handles symbol mangling.

VorpalBlade · 2024-08-13T16:06:45Z

What bjorn3 said makes sense to me, when doing dlopen you need to use no_mangle. And ld.so takes care of resolving symbols called by the plugin. I guess there could be some possible issues there? As I understand it this is:

Program A depends on dylib B (normal load time dependency)
Program A then dlopens dylib C
Dylib C also depends on dylib B (normal load time dependency).
B exports normal mangled API
C exports a no_mangle API

What are the ways this can fail in?

Additional note: The Windows/Mac equivalents to dlopen may also have special consideration. I know that symbol resolution works differently for those (not a single global namespace) but I'm not an expert by any means, especially on those platforms.

Not sure how any of this could affect the opsem angle, and if people who don't care about portability will want to make use of the semantics of their platform of choice. I'm not entirely sure what the opsem angle on this even is, how does the AM represent dlopen/LoadLibrary even?

RalfJung · 2024-08-13T16:33:35Z

The concern is if dylib C depends on crate E, but E happens to have the same StableCrateId as B. Then the symbols of the two crates will get mixed up and everything explodes, even though it doesn't look like the dlopen is doing anything wrong.

VorpalBlade · 2024-08-13T16:47:55Z

@RalfJung from a pragmatic point of view two questions come to mind:

What is the probability of collision (assuming random chance, not nefarious actors)?
What is the threat model here? It realistically can't assume nefarious actors (if you load a malicious native library it is game over already, there is arbitrary code execution at the very moment of loading: IFUNCs (of xz fame), static constructors etc).

chorman0773 · 2024-08-13T17:03:00Z

I don't know that we can really express these soundness requirements in any tangible manner. It's like saying that you must not use #[export_name] to collide with a symbol name (especially since the language doesn't even guarantee the form of those symbols.

RalfJung · 2024-08-13T17:13:15Z

from a pragmatic point of view two questions come to mind:

StableCrateId is a 32bit hash (AFAIK), so (according to this) with 2900 crates in the overall dependency tree there is a 0.1% chance of collision if the hashes are assumed to be fully random.

See rust-lang/rust#10389 and rust-lang/rust#129030 for more of these discussions. In this issue, I am interested in exploring what could be done to fix this, not in discussing threat models. (This doesn't mean I think we must fix this, I just want to know what the options would be.)

bjorn3 · 2024-08-13T17:29:30Z

For symbol name collisions I believe you have to collide both the StableCrateId and crate name (assuming the v0 symbol mangling scheme). You are unlikely to use 2900 crates with the same name. If you have 3 crates with the same name the chance of a collision is only 10^-9 and at 93 crates with the same name you get a collision chance of 10^-6.

digama0 · 2024-08-13T17:33:50Z

Also re: "what's the opsem angle", I think the title question is clear enough: What are the requirements for a programmer to be able to call dlopen without causing UB? This requires understanding (1) what are the bad situations that can arise that we would like to classify as UB, and (2) what are the things that the programmer or user did that can lead to the bad situation, which is what we want to put on the warning label.

I think a threat model only comes up when it comes to prioritizing the safety requirements in (2) for human consumption, but abstractly it should be possible to come up with an objective answer to the question.

My knowledge of dynamic linking protocols is pretty low so I can't answer the question itself, though. Brainstorming some things based on what has been brought up:

One case of a (1) bad situation is that you dlopen a function with a name that already exists, I think this has to be UB. It's not clear to me whether we need to make an exception for "benign redeclarations".
There are many potential (2) causes for name collisions though.
- Hash collisions are technically possible but presumably unlikely. It is unclear whether we want to blame the programmer or the compiler for this though, it's not really a user facing issue.
- #[no_mangle] collisions are clearly on the programmer, that's why it's unsafe, but there are reasons it might not be obvious or it may be a distributed responsibility bug. I'd really wish we could catch these issues with a nice error message.
- Is there something else I'm missing? Collisions between a #[no_mangle] local function and an internally defined C function in the dlopen'd library?

VorpalBlade · 2024-08-13T17:39:26Z

Thank you, we now have a concrete issue that can lead to unsoundness, which is much easier to dicuss than the general issues with dlopen (which obviously have more, such as general no_mangle collisions etc).

Some thoughts:

One thing that comes to mind is that dlopen has flags that affect name resolution of later dlopen as well. In particular RTLD_GLOBAL and RTLD_DEEPBIND might have interesting interactions. In any case the flags mean there isn't just one single behaviour.

Does eager binding (RTLD_NOW and the corresponding ld flags) help at all? I think the newly loaded library will still get messed up in case of a collision, but existing code will be unaffected. So not good enough.

On glibc it looks like dlmopen would actually avoid the issue you describe though by putting everything into a separate namespace! Not portable though. Also only 16 separate namespaces are supported apparently. So not great.

digama0 · 2024-08-13T17:44:03Z

Thank you, we now have a concrete issue that can lead to unsoundness, which is much easier to dicuss than the general issues with dlopen (which obviously have more, such as general no_mangle collisions etc).

I'm interested to see the complete list though, not just some example issues. We need to be exhaustive regarding sources of UB.

VorpalBlade · 2024-08-13T17:45:37Z

One case of a (1) bad situation is that you dlopen a function with a name that already exists, I think this has to be UB. It's not clear to me whether we need to make an exception for "benign redeclarations".

That is actually fine at the top level, and standard practice for plugins: the plugin exports one or more entry points with well known names (no-mangled). These are then accessed using dlsym.

The key point here is that you resolve these with dlsym in a handle to the specific dlopened library. You don't attempt to call them directly. Also already resolved symbols stay resolved (I don't know how dlopen interacts with lazy binding actually, that is an interesting question!).

digama0 · 2024-08-13T17:48:20Z

Is there a way to use dlopen such that all the names stay in a separate universe, as it were, and you only use dlsym to look up function pointers in the library by name instead of having name coincidences cause random things to be merged? That sounds like a much saner linking strategy to me, much easier to ensure soundness that way.

dlight · 2024-08-13T17:50:10Z

Due to lack of stable ABI you will most probably be using C ABI anyway.

Well Bevy has that scheme to improve compile times where the Bevy crate itself is compiled into a dynamic library, bypassing the slow (static) linker to link at runtime. I don't know if it uses dlopen, but it could. Or if this is not allowed, it should be documented.

Also: if Rust's name mangling is a soundness issue for dlopen I suppose it is also an issue for standard dynamic linking? (using ld.so) Dynamic linking of Rust libraries with Rust ABI should be well supported when you guarantee that everything was compiled by the same compiler.

RalfJung · 2024-08-13T17:55:22Z

Also: if Rust's name mangling is a soundness issue for dlopen I suppose it is also an issue for standard dynamic linking? (using ld.so) Dynamic linking of Rust libraries with Rust ABI should be well supported when you guarantee that everything was compiled by the same compiler.

I was wondering the same thing. However, in the common case the crates you depend on will be known at compile-time to the Rust compiler, they are just not linked-in. In that case the compiler will still check for StableCrateId collisions so there should be no issues.

I don't know if there are situations where regular dynamic linking can bypass the hash collision check.

VorpalBlade · 2024-08-13T17:56:49Z

Is there a way to use dlopen such that all the names stay in a separate universe, as it were, and you only use dlsym to look up function pointers in the library by name instead of having name coincidences cause random things to be merged? That sounds like a much saner linking strategy to me, much easier to ensure soundness that way.

That seems to be what the dlmopen glibc extension is (but see caveats in my above comment on it). Also as I understand it, you would get separate copies of every loaded transitive dependency, which might interact poorly with assuming statics or functions have address stability (if you get two different instances of libstd.so for example) (see #522, I guess).

Also on non-ELF (so, Windows PE and MacOS MachO) I believe resolution doesn't use a global name space to begin with (you don't ask for symbol X, you ask for symbol X from library Y). I don't know if those are "fool-proof" though, e.g. what happens if you have two LibA.dll from two different directories.

bjorn3 · 2024-08-13T18:07:43Z

Well Bevy has that scheme to improve compile times where the Bevy crate itself is compiled into a dynamic library, bypassing the slow (static) linker to link at runtime. I don't know if it uses dlopen, but it could. Or if this is not allowed, it should be documented.

Bevy doesn't use dlopen for loading libbevy_dynamic.so, nor can it. Bevy has support for loading dylib plugins though using bevy_dynamic_plugin, but it has been marked as deprecated in bevyengine/bevy#13080 as there were several crashes caused by it.

chorman0773 · 2024-08-13T18:24:37Z

StableCrateId is a 32bit hash (AFAIK), so (according to this) with 2900 crates in the overall dependency tree there is a 0.1% chance of collision if the hashes are assumed to be fully random.

And with 64k crates, it's a 50% chance. With a whole lot of things being dlopened (my own project, lccc, comes to mind here, where each frontend, backend, and optimizer are separate dsos), hitting 64k aggregate doesn't seem unrealistic.

bjorn3 · 2024-08-13T18:48:32Z

64k crates all with the same crate name is very unlikely to happen.

Diggsey · 2024-08-13T20:51:39Z

Also on non-ELF (so, Windows PE and MacOS MachO) I believe resolution doesn't use a global name space to begin with (you don't ask for symbol X, you ask for symbol X from library Y). I don't know if those are "fool-proof" though, e.g. what happens if you have two LibA.dll from two different directories.

On windows, DLLs are more like executables that happen to be running in the same address space rather than shared libraries. You have to make sure you're loading the correct DLL in the first place, but there is no danger of symbol conflicts between DLLs. When loading a DLL by name, there is a specific search order (assuming an absolute path is not used). Manifests can be used to ensure correct versions of dependencies and transitive dependencies - https://learn.microsoft.com/en-us/windows/win32/sbscs/about-side-by-side-assemblies-

michaelwoerister · 2024-08-14T09:55:38Z

from a pragmatic point of view two questions come to mind:

StableCrateId is a 32bit hash (AFAIK), so (according to this) with 2900 crates in the overall dependency tree there is a 0.1% chance of collision if the hashes are assumed to be fully random.

To clarify: StableCrateId is a 64-bit hash.

michaelwoerister · 2024-08-14T10:06:05Z

Given 64-bit hashes, one would get a collision probability of 10^-15 when having 190 versions of the same crate in the crate graph (according to the table here).

RalfJung · 2024-08-14T12:43:04Z

Turns out there are actually two levels of hashes here, with a theoretical chance of collision in each level: cargo hashes some stuff into -Cmetadata, and then rustc hashes that together with other stuff into StableCrateId.

I don't think that changes anything about the probabilities, but it seems to make it harder to actually check for collisions, since we'd need the original data cargo hashes together to ensure it is all globally unique.

michaelwoerister · 2024-08-14T12:47:11Z

Cargo could probably switch to using 256 bit cryptographic hashes to reduce the concern here.

RalfJung · 2024-08-14T13:36:44Z

For symbol name collisions I believe you have to collide both the StableCrateId and crate name (assuming the v0 symbol mangling scheme)

The "I believe" is giving me pause, would be good to have that checked. :)
So the crate name goes into StableCrateId but is then also separately repeated in the symbol name?

Also, v0 symbol mangling is still not stable...

bjorn3 · 2024-08-14T13:44:30Z

So the crate name goes into StableCrateId but is then also separately repeated in the symbol name?

Yes. The demangled version of a crate reference in the v0 symbol mangling scheme is crate_name[crate_disambiguator] where the crate disambiguator happens to be the StableCrateId in the current rustc version, though any byte sequence unique between different crates is allowed by the specification.

Also, v0 symbol mangling is still not stable...

It is stable, but not the default. You can enable it using -Csymbol-mangling-version=v0.

michaelwoerister · 2024-08-15T09:29:03Z

We could probably harden v0 symbol names against collisions by adding a single (but wide) hash to each symbol that includes more information about the all crate-ids occurring in the name. So instead of

my_crate[{64-bit hash}]::foo<my_other_crate[{64-bit hash}]::Bar>

we could have something like

my_crate[0]::foo<my_other_crate[0]::Bar> @ {256 bit hash}

where @ {256 bit hash} is some kind of suffix on the symbol name that includes the full information that would have gone into computing the StableCrateId of both my_crate and my_other_crate. Assuming that we trust a 256 bit cryptographic hash to have truly negligible collision risk, that might solve the problem (but would not be cheap in terms of symbol name length as encoding the hash would take 42 bytes).

Some other kind of means that does not rely on hashing at all might be preferable. I don't know if something like rust_checked_dlopen is feasible.

bjorn3 · 2024-08-15T09:53:03Z

That would actually break the existing collision detection for the non-dlopen case. It also means if you have two versions of the same crate, you can no longer know which crate was used after demangling without parsing the crate metadata of all crates and trying every combination of StableCrateId to see if it matches the given combined hash.

michaelwoerister · 2024-08-15T13:03:23Z

That would actually break the existing collision detection for the non-dlopen case.

In what way?

It also means if you have two versions of the same crate, you can no longer know which crate was used after demangling without parsing the crate metadata of all crates and trying every combination of StableCrateId to see if it matches the given combined hash.

Yes, that's true. But that information is already pretty opaque, right?

bjorn3 · 2024-08-15T13:17:03Z

That would actually break the existing collision detection for the non-dlopen case.

In what way?

Rustc is guaranteed to error when StableCrateId collide if this collision occurs when both crates with the same StableCrateId are used as (indirect) dependencies by a crate. When you the StableCrateId in the symbol names together, this may result in a collision even if no StableCrateId collides.

Yes, that's true. But that information is already pretty opaque, right?

In the case of two my_crate[{64-bit hash}]::foo<my_other_crate[{64-bit hash}]::Bar> with distinct hashes, you can distinguish my_crate being a different version from my_other_crate being a different version. As well as check if my_crate[{64-bit hash}]::foo and my_crate[{64-bit hash}]::bar are from the same crate. And finally you can use rustc -Zls=root to read the disambiguator from the rlib/rmeta file of a crate.

RalfJung · 2024-08-15T13:57:25Z

Rustc is guaranteed to error when StableCrateId collide if this collision occurs when both crates with the same StableCrateId are used as (indirect) dependencies by a crate. When you the StableCrateId in the symbol names together, this may result in a collision even if no StableCrateId collides.

If you do that with a cryptographic hash, the chances of a collision are astronomically low so we may disregard that possibility. Like, there's not a single known example of a collision for SHA256 or other comparable hashes.

michaelwoerister · 2024-08-15T14:23:18Z

When you the StableCrateId in the symbol names together, this may result in a collision even if no StableCrateId collides.

Actually, the proposal would be something like:

For every crate losslessly store the full information that goes into computing StableCrateId in crate metadata, so it is accessible downstream.
When generating a symbol name that references a given crate, feed that full information into the cryptographic hash.

There would only be one level of hashing (disregarding that Cargo feeds hashes into -Cmetadata) and that would be a cryptographic hash. So I think the risk of collisions should be very low.

In the case of two my_crate[{64-bit hash}]::foo<my_other_crate[{64-bit hash}]::Bar> with distinct hashes, you can distinguish my_crate being a different version from my_other_crate being a different version. [...]

This could be mitigated by using short prefixes of the StableCrateId like git does for commit hashes. But it's not a perfect solution as it might lead to "phantom collisions" that don't affect correctness but are misleading to humans.

chorman0773 · 2024-08-15T15:49:11Z

I think that dylibs are just a pain in the ass in general for opsem - so are rlibs to an extent, because you can "guess" (or determine) the mangled name of a symbol and deliberately produce a function with that export_name.

rustc could probably fix at least some here by using STV_PROTECTED. I don't know what opsem can do here. DSOs are a mess everywhere.

bjorn3 · 2024-08-15T17:33:50Z

rustc could probably fix at least some here by using STV_PROTECTED. I don't know what opsem can do here. DSOs are a mess everywhere.

I agree we should use protected visibility for non-#[no_mangle] symbols. We already have rust-lang/rust#105518 open as issue for that. I tried to implement it once, but got linker errors on arm64.

RalfJung mentioned this issue Aug 13, 2024

Collisions in type_id rust-lang/rust#10389

Closed

This comment was marked as off-topic.

Sign in to view

VorpalBlade mentioned this issue Aug 13, 2024

What are the requirements for unloading a library (dlclose)? #526

Open

RalfJung mentioned this issue Aug 14, 2024

Language vs. implementation threat models and implications for TypeId collision resistance rust-lang/rust#129030

Open

What are the soundness requirements for dlopen? #525

What are the soundness requirements for dlopen? #525

Comments

RalfJung commented Aug 13, 2024 • edited Loading

VorpalBlade commented Aug 13, 2024

RalfJung commented Aug 13, 2024

bjorn3 commented Aug 13, 2024

VorpalBlade commented Aug 13, 2024

bjorn3 commented Aug 13, 2024

VorpalBlade commented Aug 13, 2024

RalfJung commented Aug 13, 2024 • edited Loading

VorpalBlade commented Aug 13, 2024

chorman0773 commented Aug 13, 2024

RalfJung commented Aug 13, 2024 • edited Loading

bjorn3 commented Aug 13, 2024

digama0 commented Aug 13, 2024 • edited Loading

VorpalBlade commented Aug 13, 2024

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

digama0 commented Aug 13, 2024

VorpalBlade commented Aug 13, 2024

digama0 commented Aug 13, 2024

dlight commented Aug 13, 2024

RalfJung commented Aug 13, 2024

VorpalBlade commented Aug 13, 2024 • edited Loading

bjorn3 commented Aug 13, 2024

chorman0773 commented Aug 13, 2024

bjorn3 commented Aug 13, 2024

Diggsey commented Aug 13, 2024

michaelwoerister commented Aug 14, 2024

michaelwoerister commented Aug 14, 2024

RalfJung commented Aug 14, 2024

michaelwoerister commented Aug 14, 2024

RalfJung commented Aug 14, 2024

bjorn3 commented Aug 14, 2024 • edited Loading

michaelwoerister commented Aug 15, 2024

bjorn3 commented Aug 15, 2024

michaelwoerister commented Aug 15, 2024

bjorn3 commented Aug 15, 2024

RalfJung commented Aug 15, 2024 • edited Loading

michaelwoerister commented Aug 15, 2024

chorman0773 commented Aug 15, 2024

bjorn3 commented Aug 15, 2024

What are the soundness requirements for `dlopen`? #525

What are the soundness requirements for `dlopen`? #525

RalfJung commented Aug 13, 2024 •

edited

Loading

RalfJung commented Aug 13, 2024 •

edited

Loading

RalfJung commented Aug 13, 2024 •

edited

Loading

digama0 commented Aug 13, 2024 •

edited

Loading

VorpalBlade commented Aug 13, 2024 •

edited

Loading

bjorn3 commented Aug 14, 2024 •

edited

Loading

RalfJung commented Aug 15, 2024 •

edited

Loading