-
Notifications
You must be signed in to change notification settings - Fork 45
Date Column Design
Proposal: a new "date" column type.
Users currently use "timestamp" to represent dates. This is ambiguous: to different people, the same timestamp happens at different dates.
You should be allowed to format a "timestamp" in any timezone. That makes no sense with dates.
A new "date" format will store an int32 number of days since 1970-01-01. (Why not integer number of periods, like Pandas? Because Parquet/Arrow tools wouldn't format the integers correctly.)
A "date" column has a "unit": "year", "month", "day", "week", "quarter".
The unit determines which dates are allowed. "year" allows 1970-01-01
and not 1970-01-02
, for example. Only the first date in its unit is valid. The others are invalid: Workbench will not store them or display them. (Weeks start Monday, as per ISO8601.)
Let's defer deciding upon a "format" akin to the Number type's "format". For now, format one "auto" format per unit. These are how the data is displayed, not how it is stored. (Exception: if you "convert to text", it will be stored this way.)
- Year:
2021
- Month:
2021-03
- Quarter:
2021-Q1
- Week:
2021-03-01
- Day:
2021-03-03
The export format is always YYYY-MM-DD
.
Rationale: ISO8601 is a widely-accepted standard; there's no widely-accepted standard for "Quarter" or "Week starting Sunday".
Code-wise, this keeps our generic Arrow/Parquet tools generic. They don't need to understand "unit".
- Arrow has
date32
but no notion of "unit": only days are allowed. - Pandas has
Period
. It's very flexible: e.g., it allows "3-hour intervals starting from 2021-03-03T13:48:12.142Z", and it supports int64. There's also "business day" and "business hour" functionality. All this flexibility makes for ValueErrors and edge cases when alternating between Timestamp and Period.
- We don't deviate from Arrow
date32
, for better or worse. In most units, mostdate32
values are invalid and Workbench modules must take care to only produce valid dates. - We use the name "Date", not Pandas' "Period", because "Period" has annoying nuance. "Period" means, "continuous span between two Timestamps": our users must think of timezone when they think of the period. "Date" omits the concept of timezone. Most users don't think about timezones when they're thinking of dates. (Rephrased: many users will convert Timestamps to Dates, but few users will want to convert Dates to Timestamps.)
- What is a unit (period's) origin? ISO "Week" starts Monday, but in USA culture Sunday is common. There are 12 typical financial quarter origins (Q1=Jan-Mar, Q1=Feb-Apr, Q1=Mar-May, etc.) and week-starts-Saturday is the norm in some cultures. For now, let's stick with origin=Monday, origin=January for now. Later, we can add an "origin" field to go alongside the "unit" field.
- Are we locking ourselves away from "Hours" and "Minutes"? (These would involve timezones.)
The column header should show the column type as "Date (week starting Monday)" or "Date (Year)".
The user can convert from "year" to "day": Workbench will impute the first day of the year. Ditto for all casts to a more-specific type ("week-monday => day", "month=>quarter", etc.)
The user can "round" from "day" to "year" or "week" that the date appears in.
The user can "round" _Timestamp_s to Dates, by supplying a timezone and unit.
It should be difficult to convert from "week" to "month", because the data isn't there. The user can explicitly "convert + round" ... and forcing the user go to through those two steps should encourage them to realize they're creating errors in their workflow.
We should have something similar to Timestamp Math: how many days between two date columns, for instance? The dates must be in the same unit.
The most obvious use case is helping users chart with aggregation.
As of 2021-03-02, the "group by" module lets users group timestamps into "rounded timestamps". That was a mistake that led to misleading workflows: those timestamps have no timezone.
Ideally, "group by" + "by date" users will all migrate to grouping by a regular "date".
First, we'll build a "Timestamp to Date" conversion module.
Second, let's consider whether to retrofit the "group by" date-math section (there may be huge backwards-compatibility problems) or to nudge the user towards "timestamp to date" with a quick-fix button. Nudging existing users would be easy: they already indicated the unit they want. How do we nudge new users?
Line charts will also have an "auto" format per unit. This should look good enough for now.
TODO put these on Pivotal:
- ✓ Make
cjwpandasmodule
create and supportpandas.PeriodArray
. (Nix most ofcjwkernel.pandas.types
to get there.) - ✓ Make a
DateCellFormatter
. - ✓ Write a first module that converts to Date.
- ✓ Write more modules that use Date.
- Prototype a Group By that groups by Date.
- Help users migrate to Group By Date.
- convert-date (??? What to do with this one???)
- ✓ converttotext
- countbydate (??? what to do with this one???)
- ✓ filter (condition)
- ✓ formula (need to convert from int32 to Excel int date)
- groupby (??? what to do with this one???)
- ✓ linechart
- ✓ reshape
- selectsql (maybe)
- splittimestamp (output dates, not text)
- timestampmath
- ✓ transpose
- ✓ converttimestamptodate (requires timezone)
- converttexttodate (no timezone? We could warn when data has timestamps)
- convertdatetotimestamp?
- datemath