Time series classification

Ahmet Zamanis

Introduction
Data preparation
- Recurrence plot transformations
Time series classification
Performance comparison
- Metrics table
- Confusion matrix plots
Conclusion

Introduction

In time series classification, the goal is to predict the class label attributed to an entire time series rather than a single observation. A common example is classifying ECG readings as normal and abnormal, or predicting whether a machine is faulty from a sequence of sensor readings.

In this report, we’ll take multivariate sequences of weather measurements from several locations, and try to predict which location these sequences were observed in. We’ll test and compare several time series classifiers from the sktime package. We’ll also try the method of transforming our time series into images (with the pyts package) and classifying the images with a convolutional neural network built in PyTorch Lightning.

The data consists of daily mean temperature and total precipitation measurements across 13 Canadian locations, some ranging from 1940 to 2020, others starting from 1960. The data was downloaded from OpenML, shared by user Elif Ceren Gök.

The Python scripts for this analysis are available on the GitHub repository. They may not be fully up to date with the report.

Show imports

# Data handling
import pandas as pd
import numpy as np
import arff # Installed as liac-arff
import warnings

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# PyTorch Lightning & Optuna
import torch
import lightning as L
import optuna

# Time series classifiers
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.classification.kernel_based import RocketClassifier, Arsenal
from sktime.classification.dictionary_based import MUSE

# Transformations
from pyts.image import RecurrencePlot
from sklearn.preprocessing import OneHotEncoder

# Performance metrics
from sklearn.metrics import accuracy_score, log_loss, confusion_matrix

# Custom Lightning classes
from X_LightningClassesClassif import TrainDataset, TestDataset, CNN, OptunaPruning

# Helper functions
from X_HelperFunctionsClassif import plot_confusion, scale_dims, get_images, plot_images, validate_cnn, test_model

Show settings

# Set print options
np.set_printoptions(suppress=True, precision=4)
pd.options.display.float_format = '{:.4f}'.format
pd.set_option('display.max_columns', None)

# Set plotting options
plt.rcParams['figure.dpi'] = 200
plt.rcParams['savefig.dpi'] = 200
plt.rcParams["figure.autolayout"] = True
sns.set_style("darkgrid")

# Set Torch settings
torch.set_default_dtype(torch.float32)
torch.set_float32_matmul_precision('high')
L.seed_everything(1923, workers = True)
warnings.filterwarnings("ignore", ".*does not have many workers.*")

Data preparation

The data is in .arff format, and we’ll use the liac-arff package to read it. We have daily mean temperature and total precipitation values for each of the 13 locations.

# Load raw data
raw_data = arff.load(open("./InputData/canadian_climate.arff", "r"))

# Convert to pandas dataframe & view
df = pd.DataFrame(
  raw_data["data"], columns = [x[0] for x in raw_data["attributes"]])
print(df.iloc[:,0:4])

                 LOCAL_DATE  MEAN_TEMPERATURE_CALGARY  \
0      01-Jan-1940 00:00:00                  -11.4000   
1      02-Jan-1940 00:00:00                  -12.0000   
2      03-Jan-1940 00:00:00                  -12.0000   
3      04-Jan-1940 00:00:00                  -11.4000   
4      05-Jan-1940 00:00:00                  -13.1000   
...                     ...                       ...   
29216  28-Dec-2019 00:00:00                   -7.7000   
29217  29-Dec-2019 00:00:00                   -3.3000   
29218  30-Dec-2019 00:00:00                   -1.6000   
29219  31-Dec-2019 00:00:00                    4.3000   
29220  01-Jan-2020 00:00:00                   -0.3000   

       TOTAL_PRECIPITATION_CALGARY  MEAN_TEMPERATURE_EDMONTON  
0                           0.5000                        NaN  
1                           0.5000                        NaN  
2                           1.0000                        NaN  
3                           0.8000                        NaN  
4                           0.5000                        NaN  
...                            ...                        ...  
29216                       0.0000                   -10.4000  
29217                       0.0000                    -8.6000  
29218                       0.0000                   -10.3000  
29219                       0.0000                    -2.6000  
29220                       0.0000                    -4.0000  

[29221 rows x 4 columns]

# Check missing values
pd.isnull(df).sum()

LOCAL_DATE                           0
MEAN_TEMPERATURE_CALGARY           189
TOTAL_PRECIPITATION_CALGARY        173
MEAN_TEMPERATURE_EDMONTON         7657
TOTAL_PRECIPITATION_EDMONTON      7646
MEAN_TEMPERATURE_HALIFAX          7164
TOTAL_PRECIPITATION_HALIFAX       7226
MEAN_TEMPERATURE_MONCTON          2336
TOTAL_PRECIPITATION_MONCTON       2767
MEAN_TEMPERATURE_MONTREAL          755
TOTAL_PRECIPITATION_MONTREAL       730
MEAN_TEMPERATURE_OTTAWA             76
TOTAL_PRECIPITATION_OTTAWA          81
MEAN_TEMPERATURE_QUEBEC           1214
TOTAL_PRECIPITATION_QUEBEC        1227
MEAN_TEMPERATURE_SASKATOON        2473
TOTAL_PRECIPITATION_SASKATOON     3755
MEAN_TEMPERATURE_STJOHNS           808
TOTAL_PRECIPITATION_STJOHNS        822
MEAN_TEMPERATURE_TORONTO            74
TOTAL_PRECIPITATION_TORONTO         80
MEAN_TEMPERATURE_VANCOUVER          53
TOTAL_PRECIPITATION_VANCOUVER       55
MEAN_TEMPERATURE_WHITEHORSE       1691
TOTAL_PRECIPITATION_WHITEHORSE    4095
MEAN_TEMPERATURE_WINNIPEG          124
TOTAL_PRECIPITATION_WINNIPEG       247
dtype: int64

We’ll focus our attention on three of the pre-1960 locations: Ottawa, Toronto and Vancouver. First we’ll long convert the data, and retrieve the rows only for these three cities.

# Wide to long conversion
df = pd.wide_to_long(
  df, stubnames = ["MEAN_TEMPERATURE", "TOTAL_PRECIPITATION"],
  i = "LOCAL_DATE", j = "LOCATION", sep = "_", suffix = r"\w+")
df = df.reset_index()

# Select observations only for Ottawa, Toronto, Vancouver
df = df[df["LOCATION"].isin(["OTTAWA", "TORONTO", "VANCOUVER"])]

# Convert LOCAL_DATE to datetime, set index for NA interpolation
df["LOCAL_DATE"] = pd.to_datetime(df["LOCAL_DATE"])
df = df.set_index("LOCAL_DATE")

# View long data
print(df)

             LOCATION  MEAN_TEMPERATURE  TOTAL_PRECIPITATION
LOCAL_DATE                                                  
1940-01-01     OTTAWA          -17.0000               0.0000
1940-01-02     OTTAWA          -16.7000               0.5000
1940-01-03     OTTAWA          -12.3000               0.0000
1940-01-04     OTTAWA          -16.4000               0.0000
1940-01-05     OTTAWA          -19.5000               0.0000
...               ...               ...                  ...
2019-12-28  VANCOUVER            5.3000               3.0000
2019-12-29  VANCOUVER            7.1000               2.4000
2019-12-30  VANCOUVER            7.5000               5.8000
2019-12-31  VANCOUVER            8.4000              21.2000
2020-01-01  VANCOUVER            7.8000               0.8000

[87663 rows x 3 columns]

We have a few missing weather measurements across the time period. We’ll interpolate them with a simple method.

# Interpolate missing values in Ottawa, Toronto, Vancouver
df = df.groupby("LOCATION", group_keys = False).apply(
  lambda g: g.interpolate(method = "time"))
df = df.reset_index()

The daily mean temperature and total precipitation values will be our weather variables. To predict the city of a sequence of weather measurements, we’ll need to compare these variables with the dates they were observed in.

To this end, we’ll extract month, week of year and day of year features from our date column.
We’ll use cyclical encoding to transform these categorical variables into numeric, while reflecting their cyclical nature to the algorithms we’ll use. This creates a pair of sine and cosine values for each time variable, resulting in a total of 8 variables per location.

# Add cyclic terms for month, week of year and day of year
df["month_sin"] = np.sin(2 * np.pi * df["LOCAL_DATE"].dt.month / 12)
df["month_cos"] = np.cos(2 * np.pi * df["LOCAL_DATE"].dt.month / 12)
df["week_sin"] = np.sin(2 * np.pi * df["LOCAL_DATE"].dt.isocalendar().week / 53)
df["week_cos"] = np.cos(2 * np.pi * df["LOCAL_DATE"].dt.isocalendar().week / 53)
df["day_sin"] = np.sin(2 * np.pi * df["LOCAL_DATE"].dt.dayofyear / 366)
df["day_cos"] = np.cos(2 * np.pi * df["LOCAL_DATE"].dt.dayofyear / 366)

We’ll split the data for each city into 28-day sequences. We’ll store the data in a 3D numpy array, which is accepted as input by both sktime and pyts.

Show code to split data into 28-day sequences

# Assign n as one less than the target sequence length
n = 27

# Add counter for days
df["DAYCOUNT"] = df.groupby("LOCATION").LOCATION.cumcount().add(1)

# Add rowgroups: A unique number for each n+1 day sequence
df["ROWGROUP"] = (df["DAYCOUNT"] // (n + 1)).astype(str)
df = df.drop("DAYCOUNT", axis = 1)


# Eliminate rowgroups which are not of length N + 1 per city
n_rowgroups = len(df["ROWGROUP"].unique())
rowgroup_lengths = [
  int(len(df[df["ROWGROUP"] == str(x)]) / 3) for x in range(0, n_rowgroups)]
rowgroups_to_keep = np.where(np.array(rowgroup_lengths) == n + 1)[0].tolist()
rowgroups_to_keep = [str(x) for x in rowgroups_to_keep]
df = df.loc[df["ROWGROUP"].isin(rowgroups_to_keep)]


# Retrieve targets for each sequence
y = df.groupby(["LOCATION", "ROWGROUP"]).head(1)["LOCATION"]
y = y.reset_index().drop("index", axis = 1).values.flatten()

# Get class labels
classes = np.unique(y)


# Retrieve features as 3Darray of shape (n_sequences, n_dimensions, seq_length)

# Get 2D arrays of (n_dimensions, seq_length) for each sequence
x = df.groupby(["LOCATION", "ROWGROUP"], as_index = False).apply(
    lambda g: np.array(
  [g["MEAN_TEMPERATURE"].values,
  g["TOTAL_PRECIPITATION"].values,
  g["month_sin"].values,
  g["month_cos"].values,
  g["week_sin"].values,
  g["week_cos"].values,
  g["day_sin"].values,
  g["day_cos"].values
    ]
  )
)

# Get a single 3Darray
x = np.array([x[i] for i in range(0, len(x))], dtype = np.float64)

# Print data shape
print("Shape of data (n_sequences, n_dimensions, seq_length): " + str(x.shape))

Shape of data (n_sequences, n_dimensions, seq_length): (3126, 8, 28)

We have 3126 multivariate sequences of 8 variables, each sequence 28 days long. Each of the three cities have an equal number of sequences. Many time series classifiers don’t accept sequences of unequal lengths, so we dropped the remainder days which could not make up a 28-day sequence.

Next, we’ll split the training and testing data while minding the time order of the sequences. The training data will consist of the first 80% of sequences for each city, and the testing data will consist of the last 20%. We’ll do this split manually, as shown in the cell below.

Show code to split training & testing data for each city

l = len(y) # Length of entire data, for all 3 cities
len_test = int(l / 3 * 0.2) # Length of test data for one city
len_train = int(l / 3 - len_test) # Length of train data for one city
j = int(l / 3) # Length of entire data for one city

# Get indices for training set, for each city
idx_train = list(range(0, len_train)) + list(range(j, len_train + j)) + list(range(j * 2, len_train + (j * 2))) 

# Get indices for testing set as the difference from the training set 
idx_test = list(range(0, l))
idx_test = list(set(idx_test).difference(idx_train))

# Perform train-test split
y_train, y_test = y[idx_train], y[idx_test]
x_train, x_test = x[idx_train], x[idx_test]

# Print data shapes
print(
  "Shape of training data (n_sequences, n_dimensions, seq_length): " + 
  str(x_train.shape))
print(
  "Shape of testing data (n_sequences, n_dimensions, seq_length): " + 
  str(x_test.shape))

Shape of training data (n_sequences, n_dimensions, seq_length): (2502, 8, 28)
Shape of testing data (n_sequences, n_dimensions, seq_length): (624, 8, 28)

Recurrence plot transformations

The pyts package offers several methods to convert time series into images, which can then be classified with image classification models. We’ll convert our sequences into recurrence plots: Black-and-white images that represent the pairwise distances between each time step in the sequence (called trajectories).

For each sequence, we’ll end up with a 28 x 28 matrix of distances for each of the 8 variables.
Then the problem practically turns into image classification, with 8 images (channels) of size 28 x 28 for each sequence.

After the recurrence plot transformation, we’ll also scale each channel between 0 to 1. We’ll do the scaling manually as the usual scaler functions do not accept our data dimensions. See the helper functions script for more details.

# Create RecurrencePlot transformer
trafo_image = RecurrencePlot()

# Transform the features
x_train_img, x_test_img = get_images(x_train, x_test, trafo_image)

# Print data shape 
print(
  "Shape of one image-transformed sequence (n_dims, seq_length, seq_length): " + str(x_train_img[0].shape))

Shape of one image-transformed sequence (n_dims, seq_length, seq_length): (8, 28, 28)

We can plot our transformed data as black and white images. We’ll plot two consecutive sequences of weather measurements for each city.

# Plot the recurrence plot for two consecutive sequences per city, for the weather
# dimensions
plot_images(x_train_img, 0, "MeanTemp", 0, 1, len_train)
plot_images(x_train_img, 1, "TotalPrecip", 0, 1, len_train)

Each plot is the pairwise similarity matrix of each trajectory in that sequence, for that variable. The resulting images should identify the city when compared with other images for the same city.

We can also plot the time dimensions, and confirm they are the same for each city.

These dimensions won’t identify the cities by themselves, they are only fed into the models for interaction with the weather variables.
A particular weather measurement alone may not identify the city, but the same measurement at specific times could.

# Plot the recurrence plot for two consecutive sequences per city, for the month dimensions
plot_images(x_train_img, 2, "MonthSin", 0, 1, len_train)
plot_images(x_train_img, 3, "MonthCos", 0, 1, len_train)

We’ll try several time series classification methods, focusing on those that natively handle multivariate time series.

The idea behind most of these multivariate methods is to perform certain transformations on the time series, or to extract certain features from them. These can then be used as inputs in regular classification algorithms.
The image transformation we applied is one such method. Other methods take a text classification approach, transforming time series into word representations and deriving features from these.
Some methods even rely on extracting simple summary statistics from the time series.

K-nearest neighbors with DTW distance

The first approach we’ll use is the k-nearest neighbor classifier adapted to time series data. The main difference is the distance metric used: Dynamic time warping distance.

DTW is a method that calculates the optimal match between the observations of two time series, even if they are of unequal length. This is done by minimizing the differences between each pair subject to certain conditions.
DTW is often used in tasks like speech recognition, where the analyzed sequences may be of different lengths.
DTW outputs a similarity metric for two time series, which will be our distance metric in the kNN classifier.

The number of nearest neighbors used to classify each observation is the key parameter in kNN. We’ll just use 3, but the result may considerably change with different values.

# Create KNN classifier
model_knn = KNeighborsTimeSeriesClassifier(
  n_neighbors = 3, # N. of nearest neighbors to compare each observation
  n_jobs = -1)

We’ll train our models, test their performances and retrieve their predictions on the test set. We’ll aggregate and compare the performance metrics at the end.

# Test KNN classifier
preds_knn, probs_knn, acc_knn, loss_knn = test_model(
  model_knn, x_train, x_test, y_train, y_test, scale = True)
  
# View predicted probabilities for each city
print("KNN classifier predicted probabilities for each city:")
print(pd.DataFrame(probs_knn, columns = ["Ottawa", "Toronto", "Vancouver"]))

KNN classifier predicted probabilities for each city:
     Ottawa  Toronto  Vancouver
0    0.6667   0.3333     0.0000
1    0.6667   0.3333     0.0000
2    0.0000   0.6667     0.3333
3    0.0000   0.3333     0.6667
4    0.0000   1.0000     0.0000
..      ...      ...        ...
619  0.0000   0.0000     1.0000
620  0.0000   0.0000     1.0000
621  0.0000   0.0000     1.0000
622  0.0000   0.0000     1.0000
623  0.0000   0.0000     1.0000

[624 rows x 3 columns]

Since our model makes predictions with the 3 nearest neighbors, the probability predictions are fairly simplistic.

ROCKET & Arsenal

ROCKET is a fast and efficient transformation used to derive features for time series classification. It relies on convolutional kernels, not unlike the convolutional neural network approach we’ll use later. See here for a good explanation of convolution, which is best made visually.

Instead of learning the weights for the convolutional kernels during training (as we will do with a CNN), ROCKET simply generates a large number of kernels randomly. In the sktime implementation, the kernels are randomly assigned a length of 7, 9 or 11.
The convolutions are applied to each observation (which are univariate or multivariate time series, and not single values), and two features are extracted from the resulting feature maps: The maximum value and the proportion of positive values. These can then be used as features in classifier algorithms.
- With a univariate time series of length N, the convolution is between an input vector of length N, and a kernel vector of length L.
- With a multivariate time series, each kernel is assigned a random selection of D features as the input channels. A set of weights are randomly generated for each selected input channel. The convolution is between an input matrix of size N x D, and a kernel matrix of L x D.
In effect, we convert our 3D data of shape (n_sequences, n_features, seq_length) into 2D data of shape (n_sequences, n_kernels * 2).
sktime’s RocketClassifier combines a ROCKET transformation with sklearn’s RidgeClassifierCV: L2-regularized logistic regression with built-in crossvalidation. We’ll stick to this approach, but other classifiers can also be used.

We’ll also try Arsenal, which is simply an ensemble of ROCKET + RidgeClassifierCV classifiers, by default 25 of them. The ensemble weighs the predictions of each classifier according to their built-in crossvalidation accuracies.

Besides the default ROCKET transformation described above, there are two more variations of the ROCKET transform, MiniRocket and MultiRocket.

MiniRocket is a highly deterministic version of ROCKET: The kernels are fixed at length 9 and kernel weights are restricted to two values. The only feature derived from the feature maps is the proportion of positive values.
MultiRocket applies MiniRocket to both the original and first-order differenced series. It also derives three more features from the feature maps: Mean of positive values, mean of indices of positive values, and longest stretch of positive values.
Both MiniRocket and MultiRocket performed considerably better than the default ROCKET on this problem. We’ll continue with MultiRocket in this demonstration.
RocketClassifier automatically chooses the univariate or multivariate version of either ROCKET variant based on the input data shape.
- MultiRocket is not to be confused with a multivariate version of the default ROCKET. Both variants have univariate and multivariate versions.

# Create RocketClassifier
model_rocket = RocketClassifier(
  rocket_transform = "multirocket", # Use MultiRocket variant
  n_jobs = -1, random_state = 1923)
  
# Create Arsenal classifier (probabilistic ROCKET ensemble, memory intensive)
model_arsenal = Arsenal(
  rocket_transform = "multirocket", # Use MultiRocket variant
  random_state = 1923)

# Test RocketClassifier
preds_rocket, probs_rocket, acc_rocket, loss_rocket = test_model(
  model_rocket, x_train, x_test, y_train, y_test, scale = False)

# View predicted probabilities for each city (ROCKET is non-probabilistic)
print("MultiRocket classifier predicted probabilities for each city:")
print(pd.DataFrame(probs_rocket, columns = ["Ottawa", "Toronto", "Vancouver"]))

MultiRocket classifier predicted probabilities for each city:
     Ottawa  Toronto  Vancouver
0    0.0000   1.0000     0.0000
1    0.0000   1.0000     0.0000
2    1.0000   0.0000     0.0000
3    1.0000   0.0000     0.0000
4    0.0000   1.0000     0.0000
..      ...      ...        ...
619  0.0000   0.0000     1.0000
620  0.0000   0.0000     1.0000
621  0.0000   0.0000     1.0000
622  0.0000   0.0000     1.0000
623  0.0000   0.0000     1.0000

[624 rows x 3 columns]

# Test Arsenal classifier
preds_arsenal, probs_arsenal, acc_arsenal, loss_arsenal = test_model(
  model_arsenal, x_train, x_test, y_train, y_test, scale = False)

# View predicted probabilities for each city
print("Arsenal classifier predicted probabilities for each city:")
print(pd.DataFrame(probs_arsenal, columns = ["Ottawa", "Toronto", "Vancouver"]))

Arsenal classifier predicted probabilities for each city:
     Ottawa  Toronto  Vancouver
0    0.7535   0.2465    -0.0000
1    0.8332   0.1668    -0.0000
2    0.8313   0.1687    -0.0000
3    0.7561   0.2439    -0.0000
4    0.5687   0.4313    -0.0000
..      ...      ...        ...
619 -0.0000  -0.0000     1.0000
620 -0.0000  -0.0000     1.0000
621  0.0430  -0.0000     0.9570
622 -0.0000  -0.0000     1.0000
623 -0.0000  -0.0000     1.0000

[624 rows x 3 columns]

A single RocketClassifier does not yield probabilistic predictions, but an Arsenal ensemble does, even though they are derived from the average of 25 binary predictions.

MUSE

The next method we’ll use is a bag-of-words approach.

In general, these approaches assign “letters” to values in a time series, binning similar values into the same categories.
Subsequences can be extracted from time series, and transformed into a sequence of letters, i.e. “words”.
Predictions can be made based on the appearance frequencies of words, borrowing from text classification concepts such as TF-IDF, or by using the derived words as features in regular classifiers.

The WEASEL method extracts words from a time series with multiple sliding windows of varying sizes, and selects the most predictive ones, with the chi-squared test by default. MUSE extends this approach to multivariate time series.

The sktime implementation of MUSE combines the WEASEL + MUSE transformation with a logistic regression classifier, trained on the features chosen by a one-way ANOVA test. We’ll use this method with pretty much the default settings.

We won’t use the first-order differences of the features, as they are likely not meaningful for our time features. MiniRocket does not use first-order differences while MultiRocket does, and both methods perform very similarly on this problem.

# Create MUSE classifier
model_muse = MUSE(
  use_first_order_differences = False, # Not meaningful for time features
  support_probabilities = True, # Train LogisticRegression which outputs probs.
  n_jobs = -1, random_state = 1923)

# Test classifier
preds_muse, probs_muse, acc_muse, loss_muse = test_model(
  model_muse, x_train, x_test, y_train, y_test, scale = True)
  
# View predicted probabilities for each city
print("MUSE classifier predicted probabilities for each city:")
print(pd.DataFrame(probs_muse, columns = ["Ottawa", "Toronto", "Vancouver"]))

MUSE classifier predicted probabilities for each city:
     Ottawa  Toronto  Vancouver
0    0.9806   0.0193     0.0001
1    0.9182   0.0815     0.0004
2    0.1718   0.8188     0.0094
3    0.8385   0.1569     0.0046
4    0.3225   0.5561     0.1215
..      ...      ...        ...
619  0.1121   0.0023     0.8856
620  0.0054   0.0128     0.9818
621  0.0651   0.0010     0.9339
622  0.0006   0.0230     0.9764
623  0.0001   0.0289     0.9710

[624 rows x 3 columns]

The probability predictions of the MUSE classifier are likely more sophisticated than the previous methods.

Convolutional neural network with recurrence plots

Finally, we’ll perform image classification on the data we transformed into recurrence plots, using a typical convolutional neural network. For details on the code & architecture, see the custom Lightning classes script. Also see here for an excellent source on CNNs and deep learning in general. Below, we’ll talk briefly about the intuition behind a CNN and our implementation.

Convolutional layers apply convolution kernels to their inputs, similar to ROCKET. Unlike the kernels in ROCKET though, the kernel weights in a convolutional layer are optimized as parameters during model training, just like weights and bias terms in a dense layer.

The main idea in CNNs is to take in a high-dimensional input, and to reduce its dimensionality with multiple convolutional layers, while also increasing the number of channels. This basically trades image resolution in favor of more features, allowing us to capture non-linear relationships & interaction effects from the data in a computationally efficient manner.

The convolution operations “summarize” the values of neighboring points in a matrix, which may be a matrix of pixels of an image. In our case, it is a matrix of pairwise distances for time steps in a sequence.
The performance of CNNs depend on several assumptions about the nature of the problem and the data (see the source above for detailed explanations). The way I understand it, we’re assuming we can solve our problem by “locally summarizing” neighboring values, and looking for relationships & interactions between these “local summaries”.

Our network inputs (one sequence, ignoring training batch size) are of shape (8, 28, 28): Images of size 28 x 28, with 8 channels (features). Our network starts with three successive blocks of convolutional + maximum pooling layers. Each block will roughly reduce the input size into half, while roughly doubling the channel size.

Pooling simply applies a window to its input (the output of convolution), and pools the values in the window according to a function, usually the maximum. This can help the network generalize better.
The fourth convolutional + pooling block will reduce the channel size to 3, the number of target class labels in our classification problem.
Finally, we’ll apply global average pooling & flattening to arrive at a vector of length 3 as our final output: The predicted logits of belonging to each target class. To convert these into probability predictions, we can simply apply a softmax activation function.
- Older CNN architectures feed the outputs of convolutional layers into dense layers to arrive at the final predictions. More recent architectures instead leverage global average pooling, which reduces model complexity and often performs better.
- See here for a detailed breakdown of some modern CNN architectures, which were referenced while designing the network for this problem.
- The CNN used in this report mostly resembles the Network in Network (NiN) architecture, which avoids using dense layers at the end. Instead, there are two 1x1 convolutions after each “main” convolution layer. These induce local non-linearities while avoiding large & computationally expensive dense layers in the network.

Data prep

Below are the additional data preparation steps needed to train our CNN model with Lightning. One advantage of CNNs over most other NNs is relatively fewer number of hyperparameters to optimize. In this case, I only tuned the learning rate & learning rate decay with Optuna. Here we’ll load the best tune previously saved, see here for the code used for hyperparameter tuning.

Show code to prepare data for CNN training

# Convert multiclass targets into binary matrices of shape (n_seq, n_classes)
encoder_onehot = OneHotEncoder(sparse_output = False)
y_train_img = encoder_onehot.fit_transform(y_train.reshape(-1, 1))
y_test_img = encoder_onehot.fit_transform(y_test.reshape(-1, 1))


# Load data into TrainDataset
train_data = TrainDataset(x_train_img, y_train_img)
test_data = TestDataset(x_test_img)

# Create dataloaders
train_loader = torch.utils.data.DataLoader(
  train_data, batch_size = 128, num_workers = 0, shuffle = True)
test_loader = torch.utils.data.DataLoader(
  test_data, batch_size = len(test_data), num_workers = 0, shuffle = False)


# Import best trial
best_trial_cnn = pd.read_csv("./OutputData/trials_cnn1.csv").iloc[0,]

# Retrieve best hyperparameters
hyperparams_dict = {
    "input_channels": x_train.shape[1],
    "learning_rate": best_trial_cnn["params_learning_rate"],
    "lr_decay": best_trial_cnn["params_lr_decay"]
  }

Model testing

We’ll train our CNN, predict the testing data and retrieve both the probability & class predictions in the same format as the previous models.

# Create trainer
trainer = L.Trainer(
  max_epochs = int(best_trial_cnn["user_attrs_n_epochs"]),
  accelerator = "gpu", devices = "auto", precision = "16-mixed",
  enable_model_summary = True,
  logger = True,
  log_every_n_steps = 20,
  enable_progress_bar = True,
  enable_checkpointing = True
)

# Create & train model
model = CNN(hyperparams_dict)
trainer.fit(model, train_loader)

Training: 0it [00:00, ?it/s]

# Predict testing data
probs_cnn = trainer.predict(model, test_loader)
probs_cnn = probs_cnn[0].cpu().numpy().astype(np.float32)

# Convert to class predictions
preds_cnn = classes[np.argmax(probs_cnn, axis = 1)]

# Calculate performance metrics
acc_cnn = accuracy_score(classes[np.argmax(y_test_img, axis = 1)], preds_cnn) 
loss_cnn = log_loss(y_test_img, probs_cnn) 

# View predicted probabilities for each city
print("CNN classifier predicted probabilities for each city:")
print(pd.DataFrame(probs_cnn, columns = ["Ottawa", "Toronto", "Vancouver"]))

Predicting: 0it [00:00, ?it/s]

CNN classifier predicted probabilities for each city:
     Ottawa  Toronto  Vancouver
0    0.4491   0.5261     0.0247
1    0.7688   0.2198     0.0113
2    0.6921   0.2930     0.0149
3    0.7281   0.2501     0.0218
4    0.6701   0.3141     0.0158
..      ...      ...        ...
619  0.0002   0.0000     0.9998
620  0.0001   0.0000     0.9998
621  0.0005   0.0000     0.9995
622  0.0001   0.0000     0.9999
623  0.0000   0.0000     1.0000

[624 rows x 3 columns]

Again, one benefit of the CNN model over some previous models is meaningful probability predictions.

Performance comparison

Now we can compare the performances of our classifiers. Since our data is perfectly balanced between the three target classes, we can use accuracy as a simple and intuitive metric for class predictions overall. We’ll also assess the quality of probability predictions with log loss.

Metrics table

We’ll compare the performance metrics of our classifiers with random chance predictions as a baseline. With 3 perfectly balanced target classes, random chance would yield an accuracy of roughly 33%. We’ll also compute log loss for random chance, assuming each class is assigned a probability of 0.33 for all sequences.

Show code to create performance metrics table

# Generate random chance performance metrics
p = 0.33
probs_random = np.repeat(np.repeat(p, 3), len(y_test))
probs_random = np.reshape(probs_random, (len(y_test), 3))
loss_random = log_loss(y_test, probs_random) 

# Gather performance metrics as dictionary
dict_metrics = {
  "Accuracy": [round(x, 4) for x in [
    p, acc_knn, acc_rocket, acc_arsenal, acc_muse, acc_cnn]],
  "Log loss": [round(x, 4) for x in [
      loss_random, loss_knn, loss_rocket, loss_arsenal, loss_muse, loss_cnn]]   
}

# Print as table
print(pd.DataFrame(dict_metrics, index = [
  "Random choice", "kNN with DTW", "MultiRocket", "Arsenal", "MUSE", "CNN"]))

               Accuracy  Log loss
Random choice    0.3300    1.0986
kNN with DTW     0.6603    5.5750
MultiRocket      0.6667   12.0146
Arsenal          0.7452    0.8516
MUSE             0.6074    1.1958
CNN              0.6554    0.7633

We see the Arsenal method performs considerably better than the others in accuracy, and even a single MultiRocket + Ridge model comes second.

The CNN model is second from last in accuracy, though it comes very close to kNN and MultiRocket.
MUSE takes last place in accuracy, though it’s not far from the other methods, and all methods are a huge improvement over random chance.

When it comes to the quality of probability predictions, we have a different story, as CNN has the best log loss value, followed by Arsenal.

MUSE performs slightly worse in log loss compared to the random chance method.
kNN and MultiRocket do not output meaningful probability predictions.

Confusion matrix plots

We have three target classes, and neither performance metric will inform us about the isolated performance for one class. We’ll plot the confusion matrices for the class predictions of each model to visually assess this. We won’t plot one for MultiRocket, as Arsenal is an ensemble of MultiRocket + Ridge models.

Show code to create confusion matrix plots

# Gather class predictions as dictionary
dict_preds = {
  "kNN with DTW": preds_knn,
  # "ROCKET + ridge": preds_rocket,
  "Arsenal + ridge": preds_arsenal,
  "MUSE + logistic": preds_muse,
  "CNN with recurrence plots": preds_cnn
}

# Gather axes as dictionary
dict_axes = {
  "kNN with DTW": (0, 0),
  "Arsenal + ridge": (0, 1),
  "MUSE + logistic": (1, 0),
  "CNN with recurrence plots": (1, 1)
}

# Create figure
fig, ax = plt.subplots(2,2, sharey = True, sharex = True)
_ = fig.suptitle("Confusion matrices of time series classifiers")

# Generate plots
for key in dict_preds.keys():
  
  # Get confusion matrix
  matrix = confusion_matrix(y_test, dict_preds[key], labels = classes)
  
  # Create plot
  idx = dict_axes[key]
  _ = sns.heatmap(
      matrix, xticklabels = classes, yticklabels = classes, cmap = "Reds", 
      annot = True, fmt = "g", square = True, cbar = False, linecolor = "black", 
      linewidths = 0.5, ax = ax[idx])
  _ = ax[idx].set_xlabel("Predicted classes")
  _ = ax[idx].set_ylabel("True classes")
  _ = ax[idx].set_title(key)

plt.show()
plt.close("all")

A simple way to assess the confusion matrix plots is to look at the diagonals, which represent cases of true classes matching predicted classes. This way, we can see Arsenal does the best job in class predictions, overall and for each class individually.

Looking at horizontal rows will give us an idea of isolated performances for each class.

We see all models do a great job of identifying sequences from Vancouver, but generally struggle telling apart Ottawa and Toronto. This is expected as Vancouver is on the west coast, with a considerably different climate, while Ottawa and Toronto are close to one another on the east coast.
kNN often confuses Ottawa & Toronto for Vancouver, in addition to one another. Other models mostly confuse Ottawa & Toronto for one another, and not for Vancouver.
Arsenal has the best performance in identifying all cities.

Conclusion

We compared several different approaches to multivariate time series classification, and the results are subjective.

While the Arsenal ensemble of MultiRocket + Ridge classifiers clearly performed better in class predictions, the CNN model trained on image-transformed data performed better in probability predictions.
The latter can be more important in many cases, especially when it comes to predicting inherently unstable phenomena like the weather. We may prefer a more realistic probability estimate over a simplified class prediction, so we can assess the uncertainty ourselves.
Still, the Arsenal ensemble also achieved a comparable log loss score, which is very impressive considering the simple nature of its probability predictions. Also keep in mind that we can combine a ROCKET transformation with any classifier in theory, instead of just logistic regression.
Personally, I believe in finding the simplest tool that solves a problem. In future time series classification problems, ROCKET variations will likely be my first choices.

Any comments or suggestions are welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReportClassification.md

ReportClassification.md

Time series classification - Canadian weather data

Introduction

Data preparation

Recurrence plot transformations