Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Option to) Fingerprint by file contents instead of mtime #6529

Open
illicitonion opened this issue Jan 9, 2019 · 65 comments · May be fixed by #14137
Open

(Option to) Fingerprint by file contents instead of mtime #6529

illicitonion opened this issue Jan 9, 2019 · 65 comments · May be fixed by #14137
Labels
A-rebuild-detection Area: rebuild detection and fingerprinting C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted.

Comments

@illicitonion
Copy link
Contributor

illicitonion commented Jan 9, 2019

Describe the problem you are trying to solve
The primary problem I have is that when building my code on travis, the actual code in my workspace builds every time, even though much of it hasn't changed and I have target directory caching on. The reason is that travis makes a new clone of my git repo, which doesn't preserve mtimes. This can add about 5 minutes to every travis run. My project is mixed rust and non-rust code, so this adds 5 minutes to those runs even if no rust code has been affected. I started futzing with mtimes, but that seems fragile and not solving the root of the problem.

Additionally, edit-undo loops cause re-compilation locally, which is a little annoying.

Describe the solution you'd like
Add a new LocalFingerprint::ContentBased(Digest, PathBuf) variant to

#[derive(Serialize, Deserialize, Hash)]
enum LocalFingerprint {
Precalculated(String),
MtimeBased(MtimeSlot, PathBuf),
EnvBased(String, Option<String>),
}
which reads the content of the PathBuf, passes it through a SipHasher, and mixes that into any aggregate fingerprints. Use this instead of LocalFingerprint::MtimeBased.

Notes
This will probably slow down no-op builds slightly (in some circumstances, such as with large build script inputs over NFS, significantly), so may want to be behind a flag (perhaps --fingerprint-strategy={mtime,content}).

This would probably also make more shared caching (that people talk about a lot, most recently at https://internals.rust-lang.org/t/idea-cargo-global-binary-cache/9002) easier.

I'd be happy to implement this if the PR is likely to be accepted :)

This would probably also fix

@illicitonion illicitonion added the C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` label Jan 9, 2019
@alexcrichton
Copy link
Member

Funnily enough this is also highly related to recent discussions on #2426 and #6484 for entirely different reasons (filesystems with second-level mtime granularity and modifications/builds all happen in the same second).

cc @Eh2406, @matklad, this might be a good tracking issue for this!

@Eh2406
Copy link
Contributor

Eh2406 commented Jan 9, 2019

At the moment I would start by removing the mtime based system and replacing it with a hash based one. (except for the corner cases where we have to use mtime, as we don't know which files to hash until after build. cc #5918) Then I would see how big a perf impact this has in reality. If it is small then grand, if not then we need to have a hybrid system that uses mtime to decide whether to hash.

I was thinking of trying to impl this soon, but would much prefer to help you do it @illicitonion. I seem to recall someone had an abandoned branch with a start... was it @ehuss?

Do we want to use SipHasher or a different hashing scheme that is designed for fingerprinting? I only ask because of this:
"Although the SipHash algorithm is considered to be generally strong, it is not intended for cryptographic purposes. As such, all cryptographic uses of this implementation are strongly discouraged."

@alexcrichton
Copy link
Member

We're not really too interested in cryptographic hashing here for its security properties, so I think picking any reasonable fast algorithm should be fine (and I think SipHasher is reasonably fast as well). I would personally prefer we not wholesale switch away from mtimes just yet, but having this as at least a runtime option to choose from is a great start!

@dwijnand
Copy link
Member

Would be interesting to compare it to using FxHash.

@ehuss
Copy link
Contributor

ehuss commented Jan 11, 2019

I'm not sure what I did with my old branch, so I'm just going off memory. My original attempt tried to only hash the contents if the mtime was near the start of compile time. However, since it is not known which files to hash until after compilation, it didn't work out too well. I don't know how to solve that problem. I guess one option is to pre-hash every file in the package directory and maybe hash anything that was included from outside after compilation? But that sounds like it could be very problematic. Or maybe hash protection is just not available for the very first compile (or in other words the second compile is not guaranteed to be correct). Maybe another option is to run --emit=dep-info first, which seems to be pretty fast, but there might be some races still.

I don't know of any easy answers here.

@ehuss ehuss added the A-rebuild-detection Area: rebuild detection and fingerprinting label Jan 31, 2019
@Eh2406
Copy link
Contributor

Eh2406 commented Feb 8, 2019

I just came across a long blog post on the troubles with using mtime in build tools. It discusses using only hash, and points out that go switched to that, but recommends a hybrid approach. Witch is surprisingly similar to what we are doing with the fingerprint hash as a database.

@fluffysquirrels
Copy link
Contributor

fluffysquirrels commented May 22, 2019

That blog post [1] is an excellent reference! It describes some of the problems in using mtimes and hashes in build systems and describes the approach used by the tool redo, roughly:

  • Storing cheap metadata (mtime, size, inode number, owner uid and gid, a sequence number (for outputs only)) for each input and output file.
  • Rebuild whenever any of that metadata changes.

I found it helpful to consider the sequence number for an output as a logical time that we control, so it should avoid problems with low resolution, mtime going backward, builds being missed.

This wouldn't stop the false-positive builds we see on CI systems that don't preserve mtimes, but to avoid these we could check hashes of source files when their metadata changes. One of the problems claimed in [1] is that hashing large build output files (to confirm downstream should rebuild) is expensive, but we could benchmark this to examine the trade offs. Maybe we could get away with hashing source files only?

[1]: "mtime comparison considered harmful"

@gerion0
Copy link

gerion0 commented Oct 14, 2019

Please also have a look at chash. This is a mechanism for a meaningful fingerprint (based on the AST of the program). Maybe this can be adapted for cargo as well.

@d33tah
Copy link

d33tah commented Nov 13, 2019

I was just hit by this bug. Docker doesn't refresh timestamps of already existing files and I hacked around the need to cache dependency builds, so I did echo fn main() {} > src/main.rs. Pretty painful to debug and the touch I added to work around looks clunky.

Steps to reproduce

  1. Create the following files;

Dockerfile

FROM rust
COPY project/Cargo.toml project/Cargo.toml
RUN cd project && mkdir src && echo 'fn main() { println!("hello1"); }' > src/main.rs && cargo build --release
COPY project/src project/src
RUN cd project && cargo run --release

src/Cargo.toml

[package]
name = "hello"
version = "0.1.0"
authors = ["Jacek Wielemborek"]
edition = "2018"

[dependencies]
flate2 = "*"

project/src/main.rs

fn main() {
    println!("Hello2");
}
  1. Run docker build .

Expected:

Hello2 is printed

Actual result:

hello1 is printed

Workaround

Before building, do touch project/src/main.rs

@matklad
Copy link
Member

matklad commented Jun 20, 2020

I think this also has a chance of speeding up some dev workflows around switching branches. Specifically, I often check out branches temporary to look at the PRs (without building them). Sometimes I also run git switch master && git pull --rebase && git switch my-feature-branch && git rebase master. I think both of these workflows lead to busted mtimes for a bunch of files, which haven't actually changed between compilations. For sprawling workspaces (like rust-analyzer one), I hypothesize that this leads to some unneeded crate rebuilds.

@gilescope
Copy link
Contributor

Has anyone got a PR or partial WIP PR on this yet? This would be huge to have the option to do this for docker caching which as rust builds get bigger becomes more important for enterprises. Would love to be able to test this out in nightly under a -Z flag.

@Eh2406
Copy link
Contributor

Eh2406 commented Jul 31, 2020

If there was progress it would have been linked here. As I recall I thought it could be done strate forwardly, but the fact that Eric's branch did not work suggest I was dunning-kruger myself. My memory of the approach was to change the code that reads the mtime to return a new type (mtime, Option<hash>). If the -Z flag is set fill in the hash. Thread that type all the way thru. When we get to the compare step, check the -Z flag for witch to use. If you are offering to try, that is where I would start.

@gilescope
Copy link
Contributor

Was just checking there's no PR out there to build on. If the mtime's not the same we could check cheaply if the file size was the same before doing an expensive hash contents check - that could cut the perf costs down a bit.

@slonopotamus
Copy link

I believe this problem was already extensively studied in the past, so there's no need to reinvent the solution and just pick a good existing one. For example, git's index is known to store mtime+ctime+device+inode+hash.

@gilescope
Copy link
Contributor

gilescope commented Aug 8, 2020

ctime, device and inode will thwart trying to do clever things like copying the target dir and putting it back but on a different machine/dir. When mtime check fails it's probably best to rely on intrinsic properties of the content (size/hash). For a non-cryptographic hash, redox's seahash seems pretty quick.

@ibaryshnikov
Copy link

ibaryshnikov commented Mar 31, 2021

mtime comparison considered harmful linked above is an excellent blog post! I've ran into one of the problems described there today:

If you accidentally set your system clock ahead by a day and build some stuff, then set your clock back to the present, all the stuff you built during that time will show up as "in the future" and therefore newer than source files you edit today, preventing all rebuilds

@vlovich
Copy link

vlovich commented May 2, 2024

I think I'm hitting this quite a bit - as part of interactive rebases fixing up my history, I frequently get compilation errors because cargo is reusing an older crate dependency even though that crate has had modified files. I'm kind of surprised though that mtime would be causing that as the file size should be changing too - is Cargo not even using the file size as a prefilter for mtime logic?

As for hashing, I would recommend XXH3 - the 64-bit version is very good and the 128-bit version is basically the same speed but lets you get even stronger protection if you give up back compat with 64-bit hashes. It's ~20x faster than SipHash.

@adam-azarchs
Copy link

The time it takes to hash files is usually going to be dominated by I/O. Faster hash algorithms are fun to talk about but ultimately irrelevant; I would probably go with sha1 simply because it leaves the option open to take shortcuts when using a git repository, since git already knows the hashes of the files. However, in the case of vendored crates, I really want to emphasize that it's already creating a .cargo-checksum.json file with sha256 checksums, and verifying it on build. So, we're already paying the cost of a sha256 hash, but it'll still recompile if the mtime changes.

@overlookmotel
Copy link

@scottlamb Can I ask, did you ever publicly release "retimer"? Or would you be able/willing to do so?

@bjorn3
Copy link
Member

bjorn3 commented May 8, 2024

I would probably go with sha1 simply because it leaves the option open to take shortcuts when using a git repository, since git already knows the hashes of the files.

I believe git adds a header to blobs before hashing them. While sha1 allows adding trailing bytes with minimal extra work (enabling length extension attacks), thus is not the case for leading bytes. Adding leading bytes requires recomputing from scratch.

@vlovich
Copy link

vlovich commented May 8, 2024

I like the idea of using git, but it only knows the sha1 of tracked & unmodified files. You'd still have to detect dirtied / unmodified files & handle them correctly. It's probably wiser to just make sure that the algorithm is stored in the metadata per file so that it can be heterogeneous & swappable (e.g. using git's metadata for the majority of files & falling back to something else for untracked / unmodified).

As for the speed difference being negligible, I'd caution that codebases will often sit within the page cache, so assuming an I/O penalty for reads may be a faulty assumption. But as I said, you'd want to only bother computing a hash when the file size is the same as it was previously.

The best approach is to have a daemon that monitors for change events live so that you don't even need to bother hashing unless the daemon restarts. That's what Bazel and Buck do.

I do wish filesystems just implemented a 64-bit counter that incremented on any mutation of a file. That would solve all these constant problems in a super-cheap way without needing a complicated daemon & not have any performance or correctness footguns (just need to save & check a mapping of inode -> version number).

@bjorn3
Copy link
Member

bjorn3 commented May 8, 2024

I do wish filesystems just implemented a 64-bit counter that incremented on any mutation of a file.

On Linux there a counter for every time a file is recreated which in practice is what most editors do to atomically write to a file. The combination of the inode id and i_generation is supposed to uniquely identify a single file for the entire lifetime of the filesystem even across deletes and recreates. Not every filesystem implements it though. It also doesn't get affected by edits that are not recreating the file.

@alexpyattaev
Copy link

On the other note, if you look at well-optimized IO logic, such as that found in 1BRC submissions you can quickly see that any plausible amount of source code can be loaded into RAM in seconds, if that is what is needed. Modern machines with NVMe are not IO bound in any meaningful way, and very few people would build a huge rust codebase on a machine from the 90's.

@adam-azarchs
Copy link

Most of our team has their home directory mounted over NFS; NFS can have great throughput but will generally have pretty terrible metadata latency. Regardless, even a naïve implementation of sha256 can easily handle 500 MiB/s on a modern system; the hashing time should still be negligible on any reasonably-sized codebase, especially since it's trivially parallelizable.

@vlovich
Copy link

vlovich commented May 13, 2024

It's a very odd setup to have a git repo running on a NFS share. Even still, you'd expect the source code to be sitting in the page cache which again still invalidates the whole "I/o latency dominates" argument because you're going to be doing a local memcpy+hash. The best choice is designing such that performance is optimized in cases where hash speed could be the bottleneck.

Speaking of hash algorithms, came across gxhash which is a Pareto frontier faster by a lot than xxh3 at smaller inputs and ahash for large inputs.

As for whether or not it's embarrassingly parallel, this kind of stuff can surprisingly be less so because there's coordination required to send and receive the result from a background thread. Given that most of the checks will stay the same not requiring a rebuild, the overhead of that distribution can easily be the bottleneck vs having a fast hash function running inline.

@ClementTsang
Copy link

@overlookmotel here's a tool that should have similar logic to the retimer tool, though it's a bit of a bandaid solution: https://crates.io/crates/mtime-travel

@overlookmotel
Copy link

@ClementTsang Thank you! Yes, a bandaid solution, but some kind of solution at least...

@briansmith
Copy link

We're not really too interested in cryptographic hashing here for its security properties, so I think picking any reasonable fast algorithm should be fine (and I think SipHasher is reasonably fast as well).

I do think it is worth doing a complete security analysis. Consider a file foo.rs that contains a security bug, which has hash(foo.rs) = X. Now somebody submits a patch for foo.rs that contains a "fix" for a bug, such that the hash of the patched foo.rs also has the value X. Then when one rebuilds the project to get the security "fix," they silently keep using the broken version. This could be made to happen easily if the security bug is intentional and the developer of the security bug plans for this.

I know such a think can seem far-fetched, but AFAICT this is why Bazel and other things that rely heavily on (distributed) caching for incremental rebuilds use strong hashes.

@bjorn3
Copy link
Member

bjorn3 commented Jun 20, 2024

A partially mitigating factor for that attack would be that the fingerprint for non-local dependencies (anything from a registry like crates.io or a git dependency) rebuilding doesn't happen because the fingerprint changes (in fact it isn't checked at all) Instead it is rebuilt because the package identity changes due to a different package version (and in case of git dependencies the git commit) is used, which by definition needs to be different to even pull in your changes. As such only for files in the project you are building can this attack be pulled off. I can imagine that we do still want to check the mtime for equality before checking the hash to improve performance, which would fully mitigate the attack as the attacker can't force the mtime to be identical between the original and the new file. It wouldn't be enough for a distributed cache like Bazel uses though.

@adam-azarchs
Copy link

I would like to emphasize once again that if you're using cargo vendor (as our team does, exclusively), cargo is already verifying sha256 sums against .cargo-checksum.json. It just then goes on to ignore that when deciding whether the file has changed.

@michaelwoerister
Copy link
Member

For reference, sccache started using BLAKE3 for fingerprinting a while ago and I don't think that causes any performance problems. BLAKE3 should be fast enough for this kind of bulk-hashing scenario for the actual hashing to not factor into performance too much.

@adam-azarchs
Copy link

But, it's already doing sha256 hashing. Why would we add additional hashing on top of that?

@bjorn3
Copy link
Member

bjorn3 commented Jun 21, 2024

The .cargo-checksum.json check is done by cargo, while the file content fingerprinting would be done by rustc. Also cargo doesn't actually check if the source files of non-local dependencies (which includes vendored dependencies) are modified. As such there is no hashing that would be saved by rustc using sha256 too. For vendored dependencies when building the first time both rustc and cargo do their own independent hashing either way and when checking if rebuilding is necessary cargo will immediately consider the vendored dependency to not be modified and check neither .cargo-checksum.json nor the hash calculated by rustc. In other words it doesn't matter what hash is used by rustc. It is never used at all by cargo for vendored dependencies.

@adam-azarchs
Copy link

Cargo absolutely does verify the checksums of files against .cargo-checksum.json. And I'm pretty sure no one here was actually suggesting adding content hashing to rustc; the request is very specifically to have at least an option for cargo to not use absolute path and mtime as part of the fingerprint that it uses for determining whether rebuilding is required, and use a content hash instead. By the time rustc sees the file in question, it's too late - we've already started the rebuild.

@bjorn3
Copy link
Member

bjorn3 commented Jun 22, 2024

Cargo absolutely does verify the checksums of files against .cargo-checksum.json.

It does when building the first time. It does not check it when the crate has already been built. It will immediately consider it not needing any rebuild way before checking .cargo-checksum.json. I actually tried it myself.

And I'm pretty sure no one here was actually suggesting adding content hashing to rustc

The first attempt at implementing this did actually modify rustc to include the source hashes in the dep info file and then lets cargo hash it again on later builds to check if the file is changed as this is the only race free way of implementing it. I haven't seen any suggestion to do it differently.

@adam-azarchs
Copy link

It does when building the first time.

Well, yes, but the problem is that cargo's definition of "the first time" includes some situations that it really shouldn't, and then it still uses the mtime + absolute path as the fingerprint for resulting outputs, so it still ends up unnecessarily invalidating downstream caches.

The first attempt at implementing this did actually modify rustc to include the source hashes in the dep info file and then lets cargo hash it again on later builds to check if the file is changed as this is the only race free way of implementing it.

Ok, yes, that makes sense (in general; I'm mostly focused on the CI scenario where such races should be impossible). And rustc is already reading all of the file content, so it's relatively cheap to hash as it does so. But it would still be cargo doing the hashing on subsequent builds to check whether a rebuild was required.

I think the most common use case for this feature request is to be able to do useful caching of the target directory in CI builders, e.g. GitHub actions. It's essentially pointless to cache any build outputs from within the workspace because the mtime will be different every time the repo is checked out so the cache will never be valid. We have some workspaces with dozens of crates, and most of the "core" crates that everything else depends on rarely see any changes, so it's frustrating to see cargo rebuilding them every time.

@bjorn3
Copy link
Member

bjorn3 commented Jun 22, 2024

Well, yes, but the problem is that cargo's definition of "the first time" includes some situations that it really shouldn't, and then it still uses the mtime + absolute path as the fingerprint for resulting outputs, so it still ends up unnecessarily invalidating downstream caches.

Non-local dependencies (including vendored dependencies) are always assumed to be up to date if any compiled artifacts exist. It will never check the mtime, absolute path, .cargo-checksums.json, file hashes in the dep info file or anything else involving the actual source files. In fact the fingerprint for non-local files doesn't even contain the list of source files. (target/debug/.fingerprint/mycrate-hash/dep-lib-mycrate is empty for non-local dependencies) And if compiled artifacts don't exist, it will check .cargo-checksum.json before building, but will rebuild independently of whether source files were modified.

@Xaeroxe
Copy link

Xaeroxe commented Jun 25, 2024

Hi, I've made a tracking issue for this to go along with my two PRs. One for cargo and one for rustc. #14136

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rebuild-detection Area: rebuild detection and fingerprinting C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted.
Projects
None yet