Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the soundness requirements for dlopen? #525

Open
RalfJung opened this issue Aug 13, 2024 · 40 comments
Open

What are the soundness requirements for dlopen? #525

RalfJung opened this issue Aug 13, 2024 · 40 comments

Comments

@RalfJung
Copy link
Member

RalfJung commented Aug 13, 2024

Using dlopen is a subtle art. On top of the usual requirements around symbol conflicts and ABI compatibility, Rust's handling of symbols adds certain extra assumptions that can lead to UB here: ideally, we'd make sure that symbols from "different" crates can never clash. During normal builds, this is ensured by checking that the StableCrateId is globally unique (and hashing everything into the StableCrateId that is considered as relevant for crate identity), but this check is bypassed by dlopen.

At the very least, this potential risk of collisions in dlopen seems worth documenting somewhere. On top of that, is there anything we could do to mitigate this problem? Making StableCrateId an actual cryptographic hash and 256 bits large is probably going to be prohibitively expensive, but maybe there is an alternative where only dlopen users have to pay for extra checks, and if you don't use dlopen it doesn't cost anything. One could imagine a rust_checked_dlopen or so that performs the crate ID uniqueness check at runtime, somehow. Is that realistic? Is it useful?

@VorpalBlade
Copy link

What exactly are we trying to protect against?

Let me play devil's advocate here:

Due to lack of stable ABI you will most probably be using C ABI anyway, and no name mangling. You might be using stabby or similar (which builds on top of the C ABI), but arguably they are off doing their own thing.

dlsym is very basic and even in C++ that has a stable ABI it doesn't work well with C++ name mangling, you are generally working with extern C functions across dlopen/dlsym. You might have an extern C function that returns a more complex object full of C++ types and pointers but that can have a lot of footguns (same version of all involved types must be used and since there is no name mangling you can't detect this anyway).

So, assuming extern C API what can we even protect against? C ABI is fundamentally not safe due to lack of name mangling.

Is that not the usage scenario then what is? Both possible alternatives (stabby and abi_stable) already solve the safety concerns at a higher level. Is the list of things that such a layer needs to deal with what you want to end up with in this issue?

It would probably help to come up with some use cases, describing what could go wrong in order to figure this out. As it is, this issue seems broad and vague, or perhaps I'm misunderstanding it.

@RalfJung
Copy link
Member Author

I'm not sure what the usual usecases here are.^^ If people only ever dlopen things that have a C ABI and nothing else (no Rust symbols exported), then indeed collisions in Rust's name mangling are entirely irrelevant. But is that really the only thing people do?

Cc @bjorn3

@bjorn3
Copy link
Member

bjorn3 commented Aug 13, 2024

Rustc dlopens codegen backends and uses the rust abi for this. The fact that rust has an unstable abi doesn't matter when you ensure that you use the same rustc version to compile the host and the plugin. For codegen backends using something like stabby or abi_stable is impractical as codegen backends are expected to use the exact same api's as rustc uses internally. Conversion of values at the abi boundary would result in unacceptable overhead.

@VorpalBlade
Copy link

Rustc dlopens codegen backends and uses the rust abi for this. The fact that rust has an unstable abi doesn't matter when you ensure that you use the same rustc version to compile the host and the plugin.

How does this deal with name mangling and dlsym though? Does it still use no_mangle or does it compute the expected mangled names and pass those to dlsym?

@bjorn3
Copy link
Member

bjorn3 commented Aug 13, 2024

For the functions in the plugin to be called by the host #[no_mangle] is used. For functions that the plugin calls, those are defined in a dylib which both the plugin and host use as regular rust dependency, ensuring that rustc correctly handles symbol mangling.

@VorpalBlade
Copy link

What bjorn3 said makes sense to me, when doing dlopen you need to use no_mangle. And ld.so takes care of resolving symbols called by the plugin. I guess there could be some possible issues there? As I understand it this is:

  • Program A depends on dylib B (normal load time dependency)
  • Program A then dlopens dylib C
  • Dylib C also depends on dylib B (normal load time dependency).
  • B exports normal mangled API
  • C exports a no_mangle API

What are the ways this can fail in?

Additional note: The Windows/Mac equivalents to dlopen may also have special consideration. I know that symbol resolution works differently for those (not a single global namespace) but I'm not an expert by any means, especially on those platforms.

Not sure how any of this could affect the opsem angle, and if people who don't care about portability will want to make use of the semantics of their platform of choice. I'm not entirely sure what the opsem angle on this even is, how does the AM represent dlopen/LoadLibrary even?

@RalfJung
Copy link
Member Author

RalfJung commented Aug 13, 2024

The concern is if dylib C depends on crate E, but E happens to have the same StableCrateId as B. Then the symbols of the two crates will get mixed up and everything explodes, even though it doesn't look like the dlopen is doing anything wrong.

@VorpalBlade
Copy link

@RalfJung from a pragmatic point of view two questions come to mind:

  • What is the probability of collision (assuming random chance, not nefarious actors)?
  • What is the threat model here? It realistically can't assume nefarious actors (if you load a malicious native library it is game over already, there is arbitrary code execution at the very moment of loading: IFUNCs (of xz fame), static constructors etc).

@chorman0773
Copy link
Contributor

I don't know that we can really express these soundness requirements in any tangible manner. It's like saying that you must not use #[export_name] to collide with a symbol name (especially since the language doesn't even guarantee the form of those symbols.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 13, 2024

from a pragmatic point of view two questions come to mind:

StableCrateId is a 32bit hash (AFAIK), so (according to this) with 2900 crates in the overall dependency tree there is a 0.1% chance of collision if the hashes are assumed to be fully random.

See rust-lang/rust#10389 and rust-lang/rust#129030 for more of these discussions. In this issue, I am interested in exploring what could be done to fix this, not in discussing threat models. (This doesn't mean I think we must fix this, I just want to know what the options would be.)

@bjorn3
Copy link
Member

bjorn3 commented Aug 13, 2024

For symbol name collisions I believe you have to collide both the StableCrateId and crate name (assuming the v0 symbol mangling scheme). You are unlikely to use 2900 crates with the same name. If you have 3 crates with the same name the chance of a collision is only 10^-9 and at 93 crates with the same name you get a collision chance of 10^-6.

@digama0
Copy link

digama0 commented Aug 13, 2024

Also re: "what's the opsem angle", I think the title question is clear enough: What are the requirements for a programmer to be able to call dlopen without causing UB? This requires understanding (1) what are the bad situations that can arise that we would like to classify as UB, and (2) what are the things that the programmer or user did that can lead to the bad situation, which is what we want to put on the warning label.

I think a threat model only comes up when it comes to prioritizing the safety requirements in (2) for human consumption, but abstractly it should be possible to come up with an objective answer to the question.

My knowledge of dynamic linking protocols is pretty low so I can't answer the question itself, though. Brainstorming some things based on what has been brought up:

  • One case of a (1) bad situation is that you dlopen a function with a name that already exists, I think this has to be UB. It's not clear to me whether we need to make an exception for "benign redeclarations".
  • There are many potential (2) causes for name collisions though.
    • Hash collisions are technically possible but presumably unlikely. It is unclear whether we want to blame the programmer or the compiler for this though, it's not really a user facing issue.
    • #[no_mangle] collisions are clearly on the programmer, that's why it's unsafe, but there are reasons it might not be obvious or it may be a distributed responsibility bug. I'd really wish we could catch these issues with a nice error message.
    • Is there something else I'm missing? Collisions between a #[no_mangle] local function and an internally defined C function in the dlopen'd library?

@VorpalBlade
Copy link

Thank you, we now have a concrete issue that can lead to unsoundness, which is much easier to dicuss than the general issues with dlopen (which obviously have more, such as general no_mangle collisions etc).

Some thoughts:

One thing that comes to mind is that dlopen has flags that affect name resolution of later dlopen as well. In particular RTLD_GLOBAL and RTLD_DEEPBIND might have interesting interactions. In any case the flags mean there isn't just one single behaviour.

Does eager binding (RTLD_NOW and the corresponding ld flags) help at all? I think the newly loaded library will still get messed up in case of a collision, but existing code will be unaffected. So not good enough.

On glibc it looks like dlmopen would actually avoid the issue you describe though by putting everything into a separate namespace! Not portable though. Also only 16 separate namespaces are supported apparently. So not great.

@VorpalBlade

This comment was marked as off-topic.

@digama0

This comment was marked as off-topic.

@RalfJung

This comment was marked as off-topic.

@digama0
Copy link

digama0 commented Aug 13, 2024

Thank you, we now have a concrete issue that can lead to unsoundness, which is much easier to dicuss than the general issues with dlopen (which obviously have more, such as general no_mangle collisions etc).

I'm interested to see the complete list though, not just some example issues. We need to be exhaustive regarding sources of UB.

@VorpalBlade
Copy link

One case of a (1) bad situation is that you dlopen a function with a name that already exists, I think this has to be UB. It's not clear to me whether we need to make an exception for "benign redeclarations".

That is actually fine at the top level, and standard practice for plugins: the plugin exports one or more entry points with well known names (no-mangled). These are then accessed using dlsym.

The key point here is that you resolve these with dlsym in a handle to the specific dlopened library. You don't attempt to call them directly. Also already resolved symbols stay resolved (I don't know how dlopen interacts with lazy binding actually, that is an interesting question!).

@digama0
Copy link

digama0 commented Aug 13, 2024

Is there a way to use dlopen such that all the names stay in a separate universe, as it were, and you only use dlsym to look up function pointers in the library by name instead of having name coincidences cause random things to be merged? That sounds like a much saner linking strategy to me, much easier to ensure soundness that way.

@dlight
Copy link

dlight commented Aug 13, 2024

Due to lack of stable ABI you will most probably be using C ABI anyway.

Well Bevy has that scheme to improve compile times where the Bevy crate itself is compiled into a dynamic library, bypassing the slow (static) linker to link at runtime. I don't know if it uses dlopen, but it could. Or if this is not allowed, it should be documented.

Also: if Rust's name mangling is a soundness issue for dlopen I suppose it is also an issue for standard dynamic linking? (using ld.so) Dynamic linking of Rust libraries with Rust ABI should be well supported when you guarantee that everything was compiled by the same compiler.

@RalfJung
Copy link
Member Author

Also: if Rust's name mangling is a soundness issue for dlopen I suppose it is also an issue for standard dynamic linking? (using ld.so) Dynamic linking of Rust libraries with Rust ABI should be well supported when you guarantee that everything was compiled by the same compiler.

I was wondering the same thing. However, in the common case the crates you depend on will be known at compile-time to the Rust compiler, they are just not linked-in. In that case the compiler will still check for StableCrateId collisions so there should be no issues.

I don't know if there are situations where regular dynamic linking can bypass the hash collision check.

@VorpalBlade
Copy link

VorpalBlade commented Aug 13, 2024

Is there a way to use dlopen such that all the names stay in a separate universe, as it were, and you only use dlsym to look up function pointers in the library by name instead of having name coincidences cause random things to be merged? That sounds like a much saner linking strategy to me, much easier to ensure soundness that way.

That seems to be what the dlmopen glibc extension is (but see caveats in my above comment on it). Also as I understand it, you would get separate copies of every loaded transitive dependency, which might interact poorly with assuming statics or functions have address stability (if you get two different instances of libstd.so for example) (see #522, I guess).

Also on non-ELF (so, Windows PE and MacOS MachO) I believe resolution doesn't use a global name space to begin with (you don't ask for symbol X, you ask for symbol X from library Y). I don't know if those are "fool-proof" though, e.g. what happens if you have two LibA.dll from two different directories.

@bjorn3
Copy link
Member

bjorn3 commented Aug 13, 2024

Well Bevy has that scheme to improve compile times where the Bevy crate itself is compiled into a dynamic library, bypassing the slow (static) linker to link at runtime. I don't know if it uses dlopen, but it could. Or if this is not allowed, it should be documented.

Bevy doesn't use dlopen for loading libbevy_dynamic.so, nor can it. Bevy has support for loading dylib plugins though using bevy_dynamic_plugin, but it has been marked as deprecated in bevyengine/bevy#13080 as there were several crashes caused by it.

@chorman0773
Copy link
Contributor

StableCrateId is a 32bit hash (AFAIK), so (according to this) with 2900 crates in the overall dependency tree there is a 0.1% chance of collision if the hashes are assumed to be fully random.

And with 64k crates, it's a 50% chance. With a whole lot of things being dlopened (my own project, lccc, comes to mind here, where each frontend, backend, and optimizer are separate dsos), hitting 64k aggregate doesn't seem unrealistic.

@bjorn3
Copy link
Member

bjorn3 commented Aug 13, 2024

64k crates all with the same crate name is very unlikely to happen.

@Diggsey
Copy link

Diggsey commented Aug 13, 2024

Also on non-ELF (so, Windows PE and MacOS MachO) I believe resolution doesn't use a global name space to begin with (you don't ask for symbol X, you ask for symbol X from library Y). I don't know if those are "fool-proof" though, e.g. what happens if you have two LibA.dll from two different directories.

On windows, DLLs are more like executables that happen to be running in the same address space rather than shared libraries. You have to make sure you're loading the correct DLL in the first place, but there is no danger of symbol conflicts between DLLs. When loading a DLL by name, there is a specific search order (assuming an absolute path is not used). Manifests can be used to ensure correct versions of dependencies and transitive dependencies - https://learn.microsoft.com/en-us/windows/win32/sbscs/about-side-by-side-assemblies-

@michaelwoerister
Copy link
Member

from a pragmatic point of view two questions come to mind:

StableCrateId is a 32bit hash (AFAIK), so (according to this) with 2900 crates in the overall dependency tree there is a 0.1% chance of collision if the hashes are assumed to be fully random.

To clarify: StableCrateId is a 64-bit hash.

@michaelwoerister
Copy link
Member

Given 64-bit hashes, one would get a collision probability of 10-15 when having 190 versions of the same crate in the crate graph (according to the table here).

@RalfJung
Copy link
Member Author

Turns out there are actually two levels of hashes here, with a theoretical chance of collision in each level: cargo hashes some stuff into -Cmetadata, and then rustc hashes that together with other stuff into StableCrateId.

I don't think that changes anything about the probabilities, but it seems to make it harder to actually check for collisions, since we'd need the original data cargo hashes together to ensure it is all globally unique.

@michaelwoerister
Copy link
Member

Cargo could probably switch to using 256 bit cryptographic hashes to reduce the concern here.

@RalfJung
Copy link
Member Author

For symbol name collisions I believe you have to collide both the StableCrateId and crate name (assuming the v0 symbol mangling scheme)

The "I believe" is giving me pause, would be good to have that checked. :)
So the crate name goes into StableCrateId but is then also separately repeated in the symbol name?

Also, v0 symbol mangling is still not stable...

@bjorn3
Copy link
Member

bjorn3 commented Aug 14, 2024

So the crate name goes into StableCrateId but is then also separately repeated in the symbol name?

Yes. The demangled version of a crate reference in the v0 symbol mangling scheme is crate_name[crate_disambiguator] where the crate disambiguator happens to be the StableCrateId in the current rustc version, though any byte sequence unique between different crates is allowed by the specification.

Also, v0 symbol mangling is still not stable...

It is stable, but not the default. You can enable it using -Csymbol-mangling-version=v0.

@michaelwoerister
Copy link
Member

We could probably harden v0 symbol names against collisions by adding a single (but wide) hash to each symbol that includes more information about the all crate-ids occurring in the name. So instead of

my_crate[{64-bit hash}]::foo<my_other_crate[{64-bit hash}]::Bar>

we could have something like

my_crate[0]::foo<my_other_crate[0]::Bar> @ {256 bit hash}

where @ {256 bit hash} is some kind of suffix on the symbol name that includes the full information that would have gone into computing the StableCrateId of both my_crate and my_other_crate. Assuming that we trust a 256 bit cryptographic hash to have truly negligible collision risk, that might solve the problem (but would not be cheap in terms of symbol name length as encoding the hash would take 42 bytes).

Some other kind of means that does not rely on hashing at all might be preferable. I don't know if something like rust_checked_dlopen is feasible.

@bjorn3
Copy link
Member

bjorn3 commented Aug 15, 2024

That would actually break the existing collision detection for the non-dlopen case. It also means if you have two versions of the same crate, you can no longer know which crate was used after demangling without parsing the crate metadata of all crates and trying every combination of StableCrateId to see if it matches the given combined hash.

@michaelwoerister
Copy link
Member

That would actually break the existing collision detection for the non-dlopen case.

In what way?

It also means if you have two versions of the same crate, you can no longer know which crate was used after demangling without parsing the crate metadata of all crates and trying every combination of StableCrateId to see if it matches the given combined hash.

Yes, that's true. But that information is already pretty opaque, right?

@bjorn3
Copy link
Member

bjorn3 commented Aug 15, 2024

That would actually break the existing collision detection for the non-dlopen case.

In what way?

Rustc is guaranteed to error when StableCrateId collide if this collision occurs when both crates with the same StableCrateId are used as (indirect) dependencies by a crate. When you the StableCrateId in the symbol names together, this may result in a collision even if no StableCrateId collides.

Yes, that's true. But that information is already pretty opaque, right?

In the case of two my_crate[{64-bit hash}]::foo<my_other_crate[{64-bit hash}]::Bar> with distinct hashes, you can distinguish my_crate being a different version from my_other_crate being a different version. As well as check if my_crate[{64-bit hash}]::foo and my_crate[{64-bit hash}]::bar are from the same crate. And finally you can use rustc -Zls=root to read the disambiguator from the rlib/rmeta file of a crate.

@RalfJung
Copy link
Member Author

RalfJung commented Aug 15, 2024

Rustc is guaranteed to error when StableCrateId collide if this collision occurs when both crates with the same StableCrateId are used as (indirect) dependencies by a crate. When you the StableCrateId in the symbol names together, this may result in a collision even if no StableCrateId collides.

If you do that with a cryptographic hash, the chances of a collision are astronomically low so we may disregard that possibility. Like, there's not a single known example of a collision for SHA256 or other comparable hashes.

@michaelwoerister
Copy link
Member

When you the StableCrateId in the symbol names together, this may result in a collision even if no StableCrateId collides.

Actually, the proposal would be something like:

  • For every crate losslessly store the full information that goes into computing StableCrateId in crate metadata, so it is accessible downstream.
  • When generating a symbol name that references a given crate, feed that full information into the cryptographic hash.

There would only be one level of hashing (disregarding that Cargo feeds hashes into -Cmetadata) and that would be a cryptographic hash. So I think the risk of collisions should be very low.

In the case of two my_crate[{64-bit hash}]::foo<my_other_crate[{64-bit hash}]::Bar> with distinct hashes, you can distinguish my_crate being a different version from my_other_crate being a different version. [...]

This could be mitigated by using short prefixes of the StableCrateId like git does for commit hashes. But it's not a perfect solution as it might lead to "phantom collisions" that don't affect correctness but are misleading to humans.

@chorman0773
Copy link
Contributor

I think that dylibs are just a pain in the ass in general for opsem - so are rlibs to an extent, because you can "guess" (or determine) the mangled name of a symbol and deliberately produce a function with that export_name.

rustc could probably fix at least some here by using STV_PROTECTED. I don't know what opsem can do here. DSOs are a mess everywhere.

@bjorn3
Copy link
Member

bjorn3 commented Aug 15, 2024

rustc could probably fix at least some here by using STV_PROTECTED. I don't know what opsem can do here. DSOs are a mess everywhere.

I agree we should use protected visibility for non-#[no_mangle] symbols. We already have rust-lang/rust#105518 open as issue for that. I tried to implement it once, but got linker errors on arm64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants