Skip to content

Latest commit

 

History

History

ts_datasets

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

ts_datasets

This library implements Python classes that manipulate numerous time series datasets into standardized pandas DataFrames. The sub-modules are ts_datasets.anomaly for time series anomaly detection, and ts_datasets.forecast for time series forecasting. Simply install the package by calling pip install -e . from the command line. Then, you can load a dataset (e.g. the "realAWSCloudwatch" split of the Numenta Anomaly Benchmark) by calling

from ts_datasets.anomaly import NAB
dataset = NAB(subset="realAWSCloudwatch", rootdir=path_to_NAB)

Note that if you have installed this package in editable mode (i.e. by specifying -e), the root directory need not be specified.

Each dataset supports the following features:

  1. __getitem__: you may call ts, metadata = dataset[i]. ts is a time-indexed pandas DataFrame, with each column representing a different variable (in the case of multivariate time series). metadata is a dict or pd.DataFrame with the same index as ts, with different keys indicating different dataset-specific metadata (train/test split, anomaly labels, etc.) for each timestamp.
  2. __len__: Calling len(dataset) will return the number of time series in the dataset.
  3. __iter__: You may iterate over the pandas representations of the time series in the dataset with for ts, metadata in dataset: ...

For each time series in the dataset, metadata is a dict or pd.DataFrame that will always have the following keys:

  • trainval: (bool) a pd.Series indicating whether each timestamp of the time series should be used for training/validation (if True) or testing (if False)

For anomaly detection datasets, metadata will also have the key:

  • anomaly: (bool) a pd.Series indicating whether each timestamp is anomalous

We currently support the following datasets for time series anomaly detection (ts_datasets.anomaly):

We currently support the following datasets for time series forecasting (ts_datasets.forecast):

  • M4 Competition
    • There are 100,000 univariate time series with different granularity, including Yearly (23,000 sequences), Quarterly (24,000 sequences), Monthly (48,000 sequences), Weekly (359 sequences), Daily (4,227 sequences) and Hourly (414 sequences) data.
  • Energy Power Grid
    • There is one 10-variable time series.
    • Each univariate records the energy power usage in a particular region.
  • Seattle Trail for Bike and Pedestrian
    • There is one 5-variable time series.
    • Each univariate records the bicycle/pedestrian flow along a different direction on the trail
  • Solar Energy Plant
    • There is one 405-variable time series.
    • Each univariate records the solar energy power in each detector in the plant
    • By default, the data loader returns only the first 100 of 405 univariates

More details on each dataset can be found in their class-level docstrings, or in the API doc.