Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdlib Path has inconsistent normalisation behaviour #29008

Closed
aidanhs opened this issue Oct 13, 2015 · 14 comments
Closed

stdlib Path has inconsistent normalisation behaviour #29008

aidanhs opened this issue Oct 13, 2015 · 14 comments
Assignees
Labels
A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools C-bug Category: This is a bug. E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. P-medium Medium priority T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@aidanhs
Copy link
Member

aidanhs commented Oct 13, 2015

Given:

use std::path::{Path, PathBuf};

fn main() {
    let mut p = PathBuf::from("/a/b/c");
    println!("{:?} {:?}", p, p.file_name());
    p.push("");
    println!("{:?} {:?}", p, p.file_name());
    p.push("");
    println!("{:?} {:?}", p, p.file_name());
    p.pop();
    println!("{:?} {:?}", p, p.file_name());

    p = PathBuf::from("/");
    println!("{:?} {:?}", p, p.file_name());
    p.push("");
    println!("{:?} {:?}", p, p.file_name());
    p.push("");
    println!("{:?} {:?}", p, p.file_name());
    p.pop();
    println!("{:?} {:?}", p, p.file_name());
}

You get:

"/a/b/c" Some("c")
"/a/b/c/" Some("c")
"/a/b/c/" Some("c")
"/a/b" Some("b")
"/" None
"/" None
"/" None
"/" None

http://is.gd/XbyUdj

First, the documentation is misleading (wrong?). file_name (https://doc.rust-lang.org/std/path/struct.PathBuf.html#method.file_name) is documented as "The final component of the path, if it is a normal file.", but it's returning the directory when there's a trailing slash, which is not (by my book) a normal file?
Looking at the implementation of components it will return the final component of the path unless it is the root directory or a relative component. Maybe it just needs clarifying.

There's a stranger issue with normalisation of trailing slashes. Initially I thought it should be documented (https://doc.rust-lang.org/std/path/index.html#normalization), because trailing slashes are ignored by components(). However, unlike any other normalisation, it is possible to reconstruct the original path (by doing .push(""))!

I'm unsure what the intended behaviour is here and what needs fixing (implementation or docs).
Given stabilisation, I'm going to assume that implemented behaviour wins at this point? It's slightly irritating because being able to detect a trailing slash can be useful (as rsync does). Failing that I might be tempted to make .push("") do nothing - this uncertain state with trailing slashes is strange. Is there an RFC I can look at maybe?

@aidanhs
Copy link
Member Author

aidanhs commented Oct 13, 2015

use std::path::PathBuf;
fn main() {
    let mut p4 = PathBuf::from("/a/b/c/..");
    println!("{:?} {:?}", p4, p4.file_name());
    println!("pop: {}", p4.pop());
    println!("{:?} {:?}", p4, p4.file_name());
}

Result:

"/a/b/c/.." None
pop: true
"/a/b/c" Some("c")

http://is.gd/E1YHHm

This one contradicts the docs for pop https://doc.rust-lang.org/std/path/struct.PathBuf.html#method.pop

Truncate self to self.parent().

Returns false and does nothing if self.file_name() is None. Otherwise, returns true.

file_name was None but it did something. Could do with looking at the most recent path RFC or whatever then I can start fixing.

@alexcrichton
Copy link
Member

cc @aturon

@aidanhs
Copy link
Member Author

aidanhs commented Oct 18, 2015

This one surprised me a lot (and wasted a fair amount of time)!

use std::path::Path;
use std::collections::HashSet;
fn main() {
    let p1 = Path::new("/a/b/");
    let p2 = Path::new("/a/b/");
    let p3 = Path::new("/a/b");
    println!("{} {} {}", p1 == p2, p2 == p3, p1 == p3);
    let mut hs = HashSet::new();
    hs.insert(p1);
    hs.insert(p2);
    hs.insert(p3);
    println!("{}", hs.len());
}

http://is.gd/yli8ls
Result:

true true true
2

@aidanhs aidanhs changed the title No way to strip a trailing empty path component stdlib Path has inconsistent normalisation behaviour Oct 18, 2015
@retep998
Copy link
Member

I'm pretty sure it's a bug if the implementation of Hash doesn't match the behavior of Eq (note that floating point numbers don't implement Eq, only PartialEq). At the same time changing this behavior will likely break existing code, so it's a lose-lose situation either way.

@aidanhs
Copy link
Member Author

aidanhs commented Oct 18, 2015

I confess, I've not found the edge-case handling of the Path library entirely intuitive (as you can probably tell from this issue).
Unfortunately, a fix for any of these problems will probably cause subtle bugs, arguably worse than the memcpy (or whatever it was) argument reordering because it'll just cause different results rather than crashes.

I'm secretly a bit pleased about the Hash vs Eq thing because it seems like the kind of thing that absolutely requires some change to be made, at which point some of the other backwards-incompatible complaints could be addressed.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Oct 18, 2015
@alexcrichton
Copy link
Member

triage: I-nominated

@alexcrichton
Copy link
Member

triage: P-high

@rust-highfive rust-highfive added P-high High priority and removed I-nominated labels Oct 28, 2015
@alexcrichton
Copy link
Member

seems like we should fix the Hash/Eq situation at the very least

@aturon
Copy link
Member

aturon commented Nov 2, 2015

Thanks @alexcrichton for taking care of the hash issue -- it's an interesting footgun with derive.

Regarding the semantic/normalization issues: there's definitely disagreement with the docs here, which is bad (and indicates insufficient unit testing). The question about trailing / is particularly interesting, since we explored a few different normalization approaches and ended up only being sensitive to a (semantic) extra . in the front of the path (see https://doc.rust-lang.org/std/path/index.html#normalization).

In any case, I agree that the behavior here is counterintuitive and there are cases where you want trailing / sensitivity; I think it was probably a mistake to normalize the trailing slash away.

If the APIs weren't stable, I'd suggest treating foo/ and foo/. as if they have a trailing "current directory" component, which is currently normalized away. That would mean that . is significant when leading or trailing, but not when embedded. It would mean that foo/ and foo/. are equal, and that neither is equal to foo. And it would mean that file_name would return None for these cases.

The behavior of pop on .. just seems like an outright bug to me.

All that said, I'm not sure what's best to do given the stable status of these APIs. We should definitely discuss this at the next libs team meeting.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Nov 2, 2015
Almost all operations on Path are based on the components iterator in one form
or another to handle equivalent paths. The `Hash` implementations, however,
mistakenly just went straight to the underlying `OsStr`, causing these
equivalent paths to not get merged together.

This commit updates the `Hash` implementation to also be based on the iterator
which should ensure that if two paths are equal they hash to the same thing.

cc rust-lang#29008, but doesn't close it
bors added a commit that referenced this issue Nov 2, 2015
Almost all operations on Path are based on the components iterator in one form
or another to handle equivalent paths. The `Hash` implementations, however,
mistakenly just went straight to the underlying `OsStr`, causing these
equivalent paths to not get merged together.

This commit updates the `Hash` implementation to also be based on the iterator
which should ensure that if two paths are equal they hash to the same thing.

cc #29008, but doesn't close it
@aidanhs
Copy link
Member Author

aidanhs commented Nov 3, 2015

Just focusing on the trailing slash aspect of this issue.

Some notes on using basename/filename, i.e. "get the final component":

  • Python 2 and C++ (boost 1.59) treats a trailing slash as implying an empty component
  • Posix (the posix basename command, i.e. coreutils and the posix function by the same name i.e. the C api), tcl 8.5, ruby 2.2, nodejs 5, java 1.8, PHP 5.6 and Go all ignore trailing slash.

As much as I hate to admit it, I think treating trailing slashes as meaningful in terms of components is probably outvoted (especially since this behaviour is stable).
This wouldn't be a problem if there was a decent way to work with OsStrings ("get me the last character of this path") - Path::new is not lossy (not sure about PathBuf), so as long as the library (or wherever you get the path from) isn't doing split+join, you'd still be able to check for the trailing slash if you really wanted to. Sadly, I have not found OsString manipulation easy.

So: document trailing slashes being ignored as part of components and (someday) make OsString easier to work with?


Python 2:

>>> import os.path
>>> os.path.basename('/a/b/c')
'c'
>>> os.path.basename('/a/b/c/')
''

C++ (boost 1.59):

    boost::filesystem::path p1("/a/b/c");
    boost::filesystem::path p2("/a/b/c/");
    cout << p1.filename() << "\n" << p2.filename();
c
.

Tcl 8.5

% llength [file split /a/b/c]
4
% llength [file split /a/b/c/]
4

Ruby 2.2

   Pathname.new("/a/b/c").basename
=> #<Pathname:c>
   Pathname.new("/a/b/c/").basename
=> #<Pathname:c>

Nodejs 5

> path.basename('/a/b/c');
'c'
> path.basename('/a/b/c/');
'c'

Java 1.8

    System.out.println(Paths.get("/a/b/c").getFileName());
    System.out.println(Paths.get("/a/b/c/").getFileName());
[...]
c
c

PHP 5.6

echo basename("/a/b/c") . " " . basename("/a/b/c/");
[...]
c c

Go

Base returns the last element of path. Trailing slashes are removed before extracting the last element. [...]

@aturon aturon self-assigned this Nov 18, 2015
@alexcrichton
Copy link
Member

@aturon is going to send a PR about docs to clarify trailing slash behavior and some of the other issues going on here.

@jminer
Copy link

jminer commented Jan 9, 2016

.NET handles it like Python and C++, which does seem like more useful behavior: https://msdn.microsoft.com/en-us/library/system.io.path.getfilename%28v=vs.110%29.aspx

string path = @"C:\mydir\";
Console.WriteLine("GetFileName('{0}') returns '{1}'",
    path, Path.GetFileName(path););
[...]
GetFileName('C:\mydir\') returns ''

@aidanhs
Copy link
Member Author

aidanhs commented Jan 10, 2016

The stdlib Python 3 pathlib module strips the trailing slash:

>>> from pathlib import Path
>>> print(Path('/a/b/c/').parts[-1] + ' ' + Path('/a/b/c').parts[-1])
c c

@aturon aturon added P-medium Medium priority and removed P-high High priority labels Feb 10, 2016
@brson brson added A-libs P-low Low priority labels Aug 25, 2016
@brson brson added E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. and removed P-medium Medium priority labels Aug 25, 2016
@brson
Copy link
Contributor

brson commented Aug 25, 2016

Just needs to update docs pr @aturon's last comment.

@steveklabnik steveklabnik added the A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools label Mar 10, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-bug Category: This is a bug. label Jul 24, 2017
@steveklabnik steveklabnik added P-medium Medium priority and removed P-low Low priority labels Sep 11, 2017
@steveklabnik steveklabnik added E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. and removed E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue. labels May 28, 2018
steveklabnik added a commit to steveklabnik/rust that referenced this issue Jan 10, 2019
Centril added a commit to Centril/rust that referenced this issue Jan 12, 2019
make note of one more normalization that Paths do

Fixes rust-lang#29008
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools C-bug Category: This is a bug. E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. P-medium Medium priority T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants