Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantics of change_on_boundary in ceiling_date #443

Closed
hadley opened this issue Jul 27, 2016 · 37 comments
Closed

Semantics of change_on_boundary in ceiling_date #443

hadley opened this issue Jul 27, 2016 · 37 comments

Comments

@hadley
Copy link
Member

hadley commented Jul 27, 2016

I'm reasonably certain that change_on_bounary is really about whether the intervals are [) or (]. By correspondence with cut(), right would be a better name.

I also think it's a really really really bad idea to make the default value an option.

@vspinu
Copy link
Member

vspinu commented Jul 28, 2016

right open or right closed? The fact that it's used in cut doesn't make it an intuitive name. The interval indirection complicates the logic, and for most people who face the problem of not ceiling on the boundary, spotting change_on_boundary should ring all relevant bells. I don't see that happening with right or even right_open.

As for defaults, I don't mind changing that. I find it ugly as well. But what is so "very^3" bad about it? It's just as bad or good as any other defaults in other functions that change the computation.

@hadley
Copy link
Member Author

hadley commented Jul 28, 2016

I think it's a suboptimal name currently because it's very long, and it doesn't connect to any bigger concepts. Sure right is not the most intuitive idea, but you'll encounter it in most of other places. Maybe a better option is to copy geom_histogram() and have closed = c("right", "left").

Having an option that changes the behaviour of computation is a terrible idea because it means an option set a very long way away in the code (potentially in another package) can affect computation. (i.e. if you ever use one of these functions in a package, you have to set it every time in order to ensure consistent behaviour). I feel very strongly about this because of many bad experiences with stringsAsFactors.

@vspinu
Copy link
Member

vspinu commented Aug 2, 2016

I have removed the global option.

As to the name, that change has been released and I know for a fact that quite a few people are already using it (there are 5 related issues only on our github page). The naming was discussed in #390 and no-one came with a better name.

I don't think long name is a problem. It hardly matters with completion and you can always partial it in interactive work. I also think that if we decide to add similar option to floor_date it should have the same name. Any interval-inspired name will have to be different for symmetry reasons.

Your interpretation of open/close interval is also not quite a right model of what's happening. For instance when ceiling 2016-09-01 you would like to get 2016-10-01, it might appear that (...] is the right paradigm, but that would work only for first application of the function. For ceiling_date(ceiling_date(...)) you would need to mentally slide the left-open interval for each application of the function. If you prefer to think in terms of intervals, a sequence of open intervals would be a better mental model - (...)(...)....

To wrap up, if we are to change the name it must be much, much better to be worth the pain of change. So I am closing this one for now; feel free to re-open if you have further ideas.

@vspinu vspinu closed this as completed Aug 2, 2016
@shrektan
Copy link
Contributor

shrektan commented Aug 3, 2016

I don't use the global option personally. It's ok for me. Another solution would be to define a new function name with change-on-boundary being setting to TRUE…

@vspinu
Copy link
Member

vspinu commented Aug 18, 2016

@hadley, how about sticky = TRUE instead of change_on_boundary? It's short and reasonably suggestive in my opinion.

@hadley
Copy link
Member Author

hadley commented Aug 18, 2016

Hmmm, let's think about this a bit more from first principles. It seems reasonable for repeated invocations of floor or ceiling to always return the same value, i.e. ceiling(ceiling(x)) = ceiling(x) should always be true.

change_on_boundary = TRUE breaks that principle so that repeated invocation will always increment (after the first call, they'll always increment by unit. So maybe the key difference is >= vs >: change_on_boundary = TRUE will always generate a value that is greater than the current value.

So I think this does come down to the side of the interval that's closed. Ceiling is usually defined as min(n in Z, n >= x), but when change_on_boundary = TRUE it's defined as min(n in Z, n > x).

@hadley hadley reopened this Aug 18, 2016
@hadley
Copy link
Member Author

hadley commented Aug 18, 2016

Oh that's just floor(x) + 1. So I think that implies that ceiling_date(x, unit, TRUE) == floor(date, unit ) + unit(1).

In which case does it even need to be an option?

@hadley
Copy link
Member Author

hadley commented Aug 18, 2016

That would also fix the lack of symmetry with floor_date().

So I'd recommend deprecating change_on_boundary() - keep the current behaviour for now, but warn suggesting floor + 1, and then remove in the next release.

@vspinu
Copy link
Member

vspinu commented Aug 18, 2016

Oh that's just floor(x) + 1. So I think that implies that ceiling_date(x, unit, TRUE) == floor(date, unit ) + unit(1).

Hmm. I wonder why didn't this occur to me :/. I was always computing it from the first principles. Deprecation seems to be the right way to go. Some efficiency will be lost but that's not a big deal.

@vspinu
Copy link
Member

vspinu commented Aug 19, 2016

There is an inconvenience in this approach. First, not all ceiling units have a constructor counterpart ("bimonth", "halfyear"). Second, with multi-unit rounding you would need something like floor_date(x, "2months") + months(2) which is not so nice if you have to do it programatically.

I am also doubting the readability of the code. For a person who is not aware of the boundary issue, this code might not be at all clear.

@hadley
Copy link
Member Author

hadley commented Aug 19, 2016

Well, the insight should at least make it easier to implement, and I think it argues that closed is an appropriate argument name. (And now "bimonth" and "halfyear" could be trivially transformed to "2 months" and "6 months") (Which makes me wonder if we should support periods directly, i.e. ceiling_data(x, months(2))).

However, it's not clear to me that this is such a common operation that lubridate needs to bend over backwards to make it happen, and I wonder if forcing people to realise this is a floor + increment might improve their mental model of the underlying data manipulation.

@vspinu
Copy link
Member

vspinu commented Aug 19, 2016

Which makes me wonder if we should support periods directly, i.e. ceiling_data(x, months(2))

I think this makes sense and is very much related to #459. There is no reason to stick to multiples of a unit either; one should be able to do floor_date(x, "1m 30s") just as well.

However, it's not clear to me that this is such a common operation that lubridate needs to bend over backwards to make it happen,

The use case is primarily for units that don't have natural 0 - months, weeks and multi-days. If your observation is on Monday, you would probably want to shift it towards next Monday together with all other observations within that week. Not changing on boundary rarely makes sense for weeks and months.

and I think it argues that closed is an appropriate argument name.

I don't see why that would be. "closed" is a shortcut for "closed_restriction", "open" is for "open_restriction", "change" is for "change_on_boundary".

I think your chain of thought starts with the ceiling formula as ceiling(x)=min(n \in Z, n > x) where Z is a set of boundary points. While correct, that formula would need a thousand words to be explained. You need to define (Z) set of boundary points, (R) set of all points, (n> x) set of restricted points, and then take a min over the intersection of Z and the restricted set. That seems like an excursion into topology not a user friendly explanation of a simple concept.

Note that open/closed in this definition refers to the restriction that starts with x. So it's not "close_boundary" of the unit interval. I think people tend to think of a unit intervals between two boundary points when they think about floor/ceiling, and not about closed/open interval that starts with x. This is why I believe that our parameter should refer to the interval boundary. If we call it "closed" people would think about "closed-boundary-of-the-unit-interval", but that doesn't make much sense.

@shrektan
Copy link
Contributor

Not sure if my idea is relevant to your topic...

One of my observation is that, at least for days, weeks, or months, when we talk about the monthly ceiling of 2016.08.01, what we really want to say is the ceiling of the middle of 2016.08.01, instead of the 2016.08.01 00:00:00. However, what lubridate::ceiling_date do is always treat the date as the inception of the date, instead of the middle.

I think this might be the root of the issue...

@vspinu
Copy link
Member

vspinu commented Aug 20, 2016

owever, what lubridate::ceiling_date do is always treat the date as the inception of the date, instead of the middle.

That's a valid point and it applies to smaller units as well. Hour is also an interval not an instant. When you think about events occurring in the first hour after midnight you also think about the "middle" between 00:00 and 01:00. So if you would need to round 2016-01-01 01 you would end up in the same conceptual problem - confound 2016-01-01 01 with 2016-01-01 01:00:01 or not.

@vspinu
Copy link
Member

vspinu commented Aug 26, 2016

@hadley, how about always change on boundary for "months", "weeks" and "days" and removing the argument altogether?

I fail to see a single use case when current behavior for these units would be useful. The mathematical argument of ceiling_date(ceiling_date(x)) = ceiling_date(x) has very little practical importance and its validity is highly questionable due to the confusing definition of the boundary as discussed above.

I think this would solve the issue, if anyone would ever want the current behavior it can always be achieved as ceiling_date(x - 1). I like this option the best so far.

@hadley
Copy link
Member Author

hadley commented Aug 26, 2016

I think if you want to do that, you have to give the function a different name, because it's no longer implementing the ceiling.

Thinking of it as an (] or a [) interval seems totally naturally to me, but obviously not to anyone else 😓

@shrektan
Copy link
Contributor

A new function is a good idea, and it's what I do for myself, actually. The name could be something like date_end and date_begin

@vspinu
Copy link
Member

vspinu commented Aug 26, 2016

because it's no longer implementing the ceiling.

This is open to debate and entirely depends on our confounding of 2000-01-01 with 2000-01-01 00:00:00. 2000-01-01 is not an instant and that's the root of the problem. You cannot simply put it on a real line and start doing interval math with it.

If you define 2000-01-01 as an open interval starting at 2000-01-01 00:00:00 and ending at 24:00:00, it would suddenly (after trivial generalization) comply with the mathematical definition of the ceiling because every instant from the interval would comply with it.

A new function is a good idea,

I wouldn't be that sure about that but we might not have a better solution indeed. Two virtually identical functions one of which seem to be doing a wrong thing and is unusable for practical applications will likely bring more confusion.

@hadley
Copy link
Member Author

hadley commented Aug 26, 2016

Well technically speaking floating point numbers are also intervals that we discuss as if they were points 😉

@vspinu
Copy link
Member

vspinu commented Aug 26, 2016

Not sure what you are trying to say by this. The set of machine floating numbers is well defined and each of them is a point on a real line. Date 2000-01-01 is conceptually set and we need to expand our ceiling semantics to sets. IMO whatever semantic we pick wouldn't contradict the mathematical definition of ceiling.

In any case, I am afraid we need to reach a conclusion on this. Would you have a proposal for a new function?

Unless we can find a good name for an extra function I will be changing the semantics. It's all about tradeoffs and my balance is heavily leaning towards changing the semantics. The month ceiling issue has been running for too long. I would like to settle it once and for all.

@hadley
Copy link
Member Author

hadley commented Aug 27, 2016

I feel very strongly that your proposed change would not be a ceiling function.

@vspinu
Copy link
Member

vspinu commented Aug 27, 2016

The current version is not a ceiling function either. Technically 2001-01-01 00:00:00 is not part of 2000-01-01, the day hasn't even started then. I think the correct way to interpret a day for the purpose of ceiling is as open interval (2001-01-01 00:00:00 - 2001-01-01 24:00:00]. The ceiling is well defined then.

In any case, this issue is directly connected with the new POSIX <> Date comparison so I am delaying the release till we are fully clarified on the semantics of Date <> POSIX interface.

@vspinu
Copy link
Member

vspinu commented Aug 27, 2016

I have checked for how other languages deal with rounding.

In Java world, java.util.Date and java.util.Calendar for Java < 8 have nor functionality for rounding. New java.time library in Java 8 has truncation but no ceiling or rounding. org.joda.time has no rounding either. The only library that does rounding or ceiling is apache commons DateUtils. And DateUtils does change the value on exact boundaries.

C++ boost.date_time and cctz, and python's dateutils don't deal with rounding either.

So, leaving alone the thorny issue of converting Dates to Instants, even for ceiling of instants there is little to build upon.

@vspinu vspinu changed the title change_on_boundary -> right? Semantics of change_on_boundary in ceiling_date Aug 27, 2016
@shrektan
Copy link
Contributor

The current version is not a ceiling function either. Technically 2001-01-01 00:00:00 is not part of 2000-01-01, the day hasn't even started then.

I agree. Also, maybe the way of treating date as special datetime is not proper at all. If date is just date, then the inception of the August 2016 should be 2016-08-00, which is essentially 2016-07-31. So naturally, the month ceiling of 2016-08-01 should be 2016-09-00 = 2016-08-31. Although slightly different, it's the most used case in my experience that is to find the beginning or last date of month, quarter, or year.

@vspinu
Copy link
Member

vspinu commented Aug 28, 2016

I think @shrektan, your intuition is right. The stuff is confusing due to an implicit convolution of several things. I am preparing a formal statement on the semantics of lubridate operations which I will be including into the official docs.

The problem that you are referring to is the forward looking nature of 1-based units (day,months). If months and dates were 0-based then 2000-01-01 01:02:03 would have been written as 2000-00-00 01:02:03. That is, when we write 1H:2M in 0-based units we mean that 1h has been completed and 20min is from the second hour. When we write 2000-01-01 01:00:00 we mean that 0 days have been completed and 1hour passed from the first day. So when rounding date up you really should be rounding 2000-01-01 to 2000-01-02 00:00:00 not to 2000-01-01 00:00:00. The latter doesn't make much sense.

@hadley
Copy link
Member Author

hadley commented Aug 28, 2016

Why is 2001-01-01 00:00 not part of 2001-01-01?

@vspinu
Copy link
Member

vspinu commented Aug 28, 2016

If something has started at 0 and ended at 0, did it happen? If you include 0 into your time measurement then an abstract event that lasted for 0 seconds must have happened. This doesn't make sense IMO.

On the other hand, inclusion of the upper bound into the measurement always makes sense. When you measure 1 meter, do you think of it as upper open interval? When you run for 10sec would you think of it as 9.9999999... seconds?

I couldn't find any reference to open/close time intervals on the web. ISO8601 seem to explicitly avoid defining it in mathematical terms.

Leaving out 0 provides a clean way to generalize ceiling functions for sets, and it would work both for higher day/month units and smaller HMS units. I don't see such a definition for right closed interval.

While R/lubridate don't support other partials than Date, thinking about more "friendly" partials like hours and minutes can help getting more insight. Would you round up first hour of the day to 00:00:00 or to 01:00:00? If the latter, then same should hold for first day of the month.

@hadley
Copy link
Member Author

hadley commented Aug 28, 2016

I don't understand the reasoning. A day starts at particular time and ends before the start of the next day.

9.999999 repeating is usually consider to be the same as 10

@hadley
Copy link
Member Author

hadley commented Aug 28, 2016

I think it would be useful if you used standard interval notation. I think we might be arguing about whether a day is (00:00, 23:59] or [00:00, 23:59). Maybe to make it more clear: is a day (0, 86400] or [0, 86400)?

@vspinu
Copy link
Member

vspinu commented Aug 28, 2016

A day starts at particular time and ends before the start of the next day

You are measuring days (just like kilos, meters or whatever else). Starting from the unix origin you have to measure the passage of first day, then second. etc. So if you accept that first day is (0, 24:00:00] and not [0, 24:00:00) then you should accept the same for all the days that followed.

9.999999 repeating is usually consider to be the same as 10

Right, so it doesn't even make sense to say that mass of water in 1liter is open from above - 1kg). If you say you measured 1kg, it's 1kg. Why cannot you say same for days? 1 day is exactly 84600 seconds from 0 with upper bound included, just like 1kg is 1000g from 0.

Maybe to make it more clear: is a day (0, 86400] or [0, 86400)?

Yes, or (00:00, 24:00:00] vs [00:00:00 , 24:00:00).

This mathematical aside adds rigor but it ads a small point to the argument. I think intuitive and practical arguments should have already made it clear what the right solution should be.

@vspinu
Copy link
Member

vspinu commented Aug 28, 2016

A day starts at particular time and ends before the start of the next day.

Or, a day ends at particular time and next day starts after the end of previous day ;)

We are used to treat 00:00:00 is part of the day but that doesn't necessary make it useful or meaningful. Can it be part of the day if the first second of that day hasn't started ticking yet?

@vspinu
Copy link
Member

vspinu commented Aug 28, 2016

Let say INT is a partial like 2000-01-01 which is an open interval (2000-01-01 00:00:00 -- 2000-01-02 00:00:00]. Let Z be the set (a grid) of all ceiling points (in case of ceiling to days these are all points of the form Y-m-d 00:00:00) then ceiling of the interval INT is defined as:

ceiling(INT) = min{ z ∈ Z | ∃i ∈ INT, z ≥ i}

This above definition has some interesting properties:

  1. If INT is a point it defaults to the standard definition of ceiling function
  2. If unit of the Z grid is higher or equal to the partial's unit, we get the following meaningful behavior. First hour of the day is rounded up to 01:00:00, first day of the months Y-m-01 is rounded to Y-m-02 00:00:00 for day rounding, and to Y-(m+1)-01 00:00:00 for month rounding.
  3. It is defined for units smaller than the partial's unit. That is, rounding Y-m-d to 1 minute grid, will produce Y-m-d 00:01:00.

The 3rd implication is defenitely lubridate's stretch, but I guess it's exactly what users would expect.It also provides a way to reconcile ceiling with with the Date<>POSIX comparison for which we are confounding date with 00:00 instant.

I think this is the most uniform solution to all our problems. So I would suggest setting change_on_boundary to NULL and make it follow the above semantics. For backward compatibility I would still leave TRUE and FALSE options. Who knows who would ever need them.

@vspinu vspinu closed this as completed in 0ab5c86 Sep 7, 2016
@vspinu
Copy link
Member

vspinu commented Sep 7, 2016

I have converged on a simple algorithmic solution that IMO achieves the right tradeoff between rigor, intuitiveness and usefulness. Due to current date<>instant conversions and R's Date class limitations whichever way we would define ceiling is bound to be problematic in one regard or another.

Details:

     In ‘lubridate’ rounding of a date-time objects tries to preserve
     the class of the input object whenever that is meaningful. This is
     done by first rounding to an instant and then converting to the
     original class by usual R conventions.

Rounding Up Date Objects:

     By default rounding up ‘Date’ objects follows 3 steps:

       1. Convert to an instant representing lower bound of the Date:
          ‘2000-01-01’ -> ‘2000-01-01 00:00:00’

       2. Round up to the *next* closest rounding unit boundary. For
          example, if the rounding unit is ‘month’ then next boundary
          for ‘2000-01-01’ will be ‘2000-02-01 00:00:00’.

          The motivation for this behavior is that ‘2000-01-01’ is
          conceptually an interval ‘(2000-01-01 00:00:00 -- 2000-01-02
          00:00:00)’ and the day hasn't started clocking yet at the
          exact boundary ‘00:00:00’. Thus, it seems wrong to round up a
          day to its lower boundary.

       3. If rounding unit is smaller than a day, return the instant
          from step 2 above (‘POSIXct’), otherwise return the ‘Date’
          immediately following that instant.

     The behavior on the boundary in the second step above can be
     changed by setting ‘change_on_boundary’ to a non-‘NULL’ value.

@hadley
Copy link
Member Author

hadley commented Sep 7, 2016

Why is the date interval open ended on both sides ((2000-01-01 00:00:00 -- 2000-01-02 00:00:00))? Is that a typo?

@vspinu
Copy link
Member

vspinu commented Sep 7, 2016

That's intentional. It's just a representation of an interval without focus on open/closed boundaries. Now I think [] would be more customary, so I will change that.

That asymmetric open/closed interval idea brought more complications than it solved. From measure theoretic prospect it doesn't matter anyways - measure of (0,1) and [0, 1] is 1, but defining ceiling in terms of asymmetric interval causes problems with the definition of floor_date. Add to this that R's conversion of Date produces lower POSIX bound and that you need to deal with smaller than day rounding units, you quicly end up in a messy soup of inconsistencies. So I just gave up on trying to rigorously generalize the rounding to intervals.

I think that mental model that most people have in mind is that days are part of the body of the month like in

...|Day1|Day2|...|Day31|...
...|      January      |...

So rounding up of Day1 to boundary between January and February is a very intuitive behavior. The issue is that we still need to convert to a Date class. R's Dates are timezoneless and producing POSIX would add tz complications which we probably want to avoid. This is why I made the 3rd step in the algorithm explicit. It is also the step that results in POSIXct or Date depending on the rounding unit. I see no other solution there except producing UTC POSIXct regardless, but that would break backward compatibility.

@hadley
Copy link
Member Author

hadley commented Sep 7, 2016

Ok, that sounds reasonable to me.

@hg77
Copy link

hg77 commented Oct 4, 2016

This patch caused some tears. "Bug compatibility" matters.
I'm with hadley when it comes to the idempotence of ceiling_date() and don't see anything had to be fixed here.

But please when you change the behavoir of a function either,
a) rename it
b) make the old behavoir the default
c) at least make sure an error is thrown by any code posibly unaware of the behavior change.

Sure, I can read the changelog. Of course, I should not trust lubridate and validate its correct behavoir with a test suite. Remember all that the day when R people decide that read.csv can delete the file after reading, because it is already in memory.

And finally thanks for providing a handy library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants