Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stat_summary_bin gives 1 bin more than specified. #3824

Closed
JoFAM opened this issue Feb 18, 2020 · 6 comments · Fixed by #4433
Closed

stat_summary_bin gives 1 bin more than specified. #3824

JoFAM opened this issue Feb 18, 2020 · 6 comments · Fixed by #4433
Labels
bug an unexpected problem or unintended behavior layers 📈
Milestone

Comments

@JoFAM
Copy link

JoFAM commented Feb 18, 2020

stat_summary_bin() consistently gives one more bin than specified in the function call

ggplot(data=mtcars, aes(y=mpg,x=wt)) +
  stat_summary_bin(fun.y = "mean", geom = "point",
                   bins = 3)

This shows 4 points. If you change bins to 4, it'll show 5 points and so on.

@yutannihilation
Copy link
Member

Maybe the same issue? At least, I feel this needs some better documentation

#1739 (comment)

@JoFAM
Copy link
Author

JoFAM commented Feb 19, 2020

I highly doubt that floating point errors are a reason for a consistent off-by-one error. Especially since this is specific to that function and doesn't occur with hist, cut, ...

@thomasp85
Copy link
Member

This is a bug in bin2d_breaks and basically boils down to how we choose binwidth from a range that is closed in both ends

@thomasp85 thomasp85 added bug an unexpected problem or unintended behavior layers 📈 labels Feb 19, 2020
@thomasp85
Copy link
Member

we need to extend the range before we calculate the bin width but it is not obvious to me what the best approach is...

Or maybe we should just add a small padding to the last bin so it includes the last data points?

@JoFAM
Copy link
Author

JoFAM commented Feb 19, 2020

@thomasp85 thx for the pointer, didn't have time to dig deeper yet (the stack of function calls to weed through is massive to be honest...). fwiw, cut adds a small padding like this (with nb the number of breaks, 1 more than the number of bins) :

rx <- range(x)
dx <- diff(rx)
breaks <- seq.int(rx[1L], rx[2L], length.out = nb)
breaks[c(1L, nb)] <- c(rx[1L] - dx/1000, rx[2L] + dx/1000)

Judging from the code of bin2d_breaks, adding a similar approach is pretty trivial.

@yutannihilation
Copy link
Member

Thanks, I didn't notice this is such a problem... Here's a rendered version of reprex from #1739

library(ggplot2)

x <- seq(0, 1, length = 1e4)
y <- x + rnorm(length(x))
dt <- data.frame(x, y)

# NOT OK
ggplot(dt, aes(x, y)) +
  geom_point(colour = "grey80") +
  stat_summary_bin(fun = mean, bins = 10, geom = "point", colour = "red")

Created on 2020-02-21 by the reprex package (v0.3.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior layers 📈
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants