This library implements Python classes that manipulate numerous time series datasets
into standardized pandas
DataFrames. The sub-modules are ts_datasets.anomaly
for time series anomaly detection, and
ts_datasets.forecast
for time series forecasting. Simply install the package by calling pip install -e .
from the
command line. Then, you can load a dataset (e.g. the "realAWSCloudwatch" split of the Numenta Anomaly Benchmark) by
calling
from ts_datasets.anomaly import NAB
dataset = NAB(subset="realAWSCloudwatch", rootdir=path_to_NAB)
Note that if you have installed this package in editable mode (i.e. by specifying -e
), the root directory
need not be specified.
Each dataset supports the following features:
__getitem__
: you may callts, metadata = dataset[i]
.ts
is a time-indexedpandas
DataFrame, with each column representing a different variable (in the case of multivariate time series).metadata
is a dict orpd.DataFrame
with the same index asts
, with different keys indicating different dataset-specific metadata (train/test split, anomaly labels, etc.) for each timestamp.__len__
: Callinglen(dataset)
will return the number of time series in the dataset.__iter__
: You may iterate over thepandas
representations of the time series in the dataset withfor ts, metadata in dataset: ...
For each time series in the dataset, metadata
is a dict or pd.DataFrame
that will always have the following keys:
trainval
: (bool
) apd.Series
indicating whether each timestamp of the time series should be used for training/validation (ifTrue
) or testing (ifFalse
)
For anomaly detection datasets, metadata
will also have the key:
anomaly
: (bool
) apd.Series
indicating whether each timestamp is anomalous
We currently support the following datasets for time series anomaly detection (ts_datasets.anomaly
):
- IOps Competition
- Numenta Anomaly Benchmark
- Synthetic (synthetic data generated using this script)
- SMAP & MSL (multivariate time series anomaly detection datasets from NASA)
- SMD (server machine dataset)
We currently support the following datasets for time series forecasting (ts_datasets.forecast
):
- M4 Competition
- There are 100,000 univariate time series with different granularity, including Yearly (23,000 sequences), Quarterly (24,000 sequences), Monthly (48,000 sequences), Weekly (359 sequences), Daily (4,227 sequences) and Hourly (414 sequences) data.
- Energy Power Grid
- There is one 10-variable time series.
- Each univariate records the energy power usage in a particular region.
- Seattle Trail for Bike and Pedestrian
- There is one 5-variable time series.
- Each univariate records the bicycle/pedestrian flow along a different direction on the trail
- Solar Energy Plant
- There is one 405-variable time series.
- Each univariate records the solar energy power in each detector in the plant
- By default, the data loader returns only the first 100 of 405 univariates
More details on each dataset can be found in their class-level docstrings, or in the API doc.