-
Notifications
You must be signed in to change notification settings - Fork 0
/
documentation.txt
362 lines (204 loc) · 21.9 KB
/
documentation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
ABOUT SPEECH COMMANDS DATASET
__________________________________
Out of the available datasets in torchaudio.datasets I have chosen the SPEECH COMMANDS dataset for the following reasons : -
1. small dataset
2. available annotations which specify each spoken word in the audio.
3. It is a dataset of 35 commands spoken by different people.
4. All audio files are about 1 second long
-- The actual loading and formatting steps happen when data point is being accessed, torchaudio converts the audio to tensors automatically.
-- If you want to load the audio file directly use ```torchaudio.load()``` which returns a tuple containing newly created tensor along with sampling frequency of the audio file.
IMPORTS
_________________
** Framework - is a set of tools , libraries that provide a foundation for developing something. They also provide pre- built components and functionalities.
** module - set of self contained unit of software that encapsulates related functions, variables and other elements.
** interface - defines how the other parts of the program can interact with the functionality provided by a module without needing to know the internal implementation details.
- pytorch : Provides tensor computation (similar to NumPy) and GPU acceleration support.
- torch.nn : nn is a neural network module which includes pre defined layers, loss functions and other utilities for building neural networks.
- torch.nn.functional : Imports the functional interface of the nn module provides activation functions , convolutional operations and other functionalities that are commonly used in neural networks.
- optim : Contains optimization algorithms like SGD, Adam
- tqdm : Provides progress bars for loops and iterable tasks to monitor lengthy operations.
- torchaudio.functional : Contains various functional operations for audio processing like resampling , filtering etc.
- torchaudio.transforms : Contains transformations like spectrogram , time-frequency transformations.
SPLIT THE AUDIO DATA INTO TRAINING , TESTING , AND VALIDATION SUBSETS
______________________________________________________________________
The SubsetSC class is initialized with an optional argument subset, which specifies which subset of the data to load ("training", "testing", or "validation").
--> ``` def __init__(self , subset : str = None) : ```
This means that the constructor of this class is taking an argument subset of string type which is defaulted to None.
So, if the string is not passes as a parameter then the default value will be taken as None.
--> ``` super().__init__("./" , download = True) ```
This means that the constructor of the inherited class 'SPEECHCOMMANDS' is being called with the argument "./" which is the path to the directory where the data should be downloaded and that is also why download is set equal to true.
--> ```def load_list(filename):
filepath = os.path.join(self._path , filename)
- os.path.join() is a function from the os.path module which is used to construct a path name by concatenating various path components.
- self._path refers to attribute named _path of the object instance.
Note that attributes prefixed with _ indicate that the attributes are intended to be private
- So, os.path.join(self._path, filename) constructs a full path to the file by joining the directory path (self._path) and the filename (filename). The resulting full path is stored in the variable filepath.
--> ``` with open(filepath) as fileobj :
return [os.path.normpath(os.path.join(self._path, line.strip() ) ) for line in fileobj]
```
- with statement is used in python to acquire and release resources for the execution of a block of code that follows the with statement with an indentation.
- open(filepath) returns a file object which is captured by the variable fileobj. Inside the indented block you can read from and write to the file using fileonj.
--> ``` return [os.path.normpath(os.path.join(self._path , line.strip() )) for line in fileobj] ```
- This is a list comprehension, the square brackets represent list comprehension.
- os.path.normpath normalizes a path name by collapsing redundant seperators. It takes a string as an input and returns a normalized version of that string
- os.path.join(self._path , line.strip() )
This expression constructs a full path by joinng the directory path (self._path) and the stripped line from the line.strip()
-- line.strip() removes whitespaces from each line in he fileobj
- return statement consists of a list of full paths to each line in the fileobj directory
--> ```if subset == "validation":
self._walker = load_list("validation_list.txt")
elif subset == "testing":
self._walker = load_list("testing_list.txt")
elif subset == "training":
excludes = load_list("validation_list.txt") + load_list("testing_list.txt")
excludes = set(excludes)
self._walker = [w for w in self._walker if w not in excludes]```
- This code is a conditional block that sets the _walker attribute of an instance based on the value of the subset parameter passed to the class initializer.
- It then concatenates the two lists and converts the resulting list into a set to remove any duplicates (excludes = set(excludes)).
- Next, it filters the existing _walker list to exclude any paths that are present in the excludes set. This is done using a list comprehension that iterates over each path in _walker and checks if it is not present in the excludes set.
Inside the constructor (__init__ method) of SubsetSC:
It calls the constructor of the parent class SPEECHCOMMANDS, passing "./" as the path and download=True.
It defines a nested function load_list(filename) to load a list of file paths from a text file.
Depending on the value of the subset argument, it loads different lists of file paths:
If subset is "validation", it loads the validation list from "validation_list.txt".
If subset is "testing", it loads the testing list from "testing_list.txt".
If subset is "training", it loads the training list, excluding paths that appear in the validation and testing lists to avoid data leakage.
After defining the SubsetSC class, it creates instances of SubsetSC for training and testing data (train_set and test_set respectively) by passing "training" and "testing" as arguments to the constructor.
Finally, it accesses the first item in the training set (train_set[0]), which returns a tuple containing waveform data, sample rate, label, speaker ID, and utterance number.
Overall, this code snippet is designed to manage subsets of the Speech Commands dataset for training and testing machine learning models.
EXPLORING THE DATA
___________________
The Speech Commands dataset object returned by datasets.SPEECHCOMMANDS() does not have attributes like 'labels', 'speaker_ids', or 'utterance_numbers'
instead each sample in the dataset is a tuple containing the waveform tensor, sample rate, label (spoken word), speaker ID, and utterance number.
To explore the contents of the Speech Commands dataset, you typically access the data samples directly using indexing.
-- sample: This variable will be assigned the waveform tensor of the audio sample. It represents the actual audio data in tensor format, which you can analyze, preprocess, or use as input to machine learning models.
-- sample_rate: This variable will be assigned the sample rate of the audio sample. The sample rate indicates how many samples per second are included in the waveform tensor. It is typically measured in Hertz (Hz).
-- label: This variable will be assigned the label (spoken word) associated with the audio sample. For example, if the audio sample contains someone saying "yes", then the label will be "yes".
-- speaker_id: This variable will be assigned the ID of the speaker who uttered the audio sample. This information can be useful for tasks involving speaker recognition or diarization.
-- utterance_number: This variable will be assigned the utterance number of the audio sample. It identifies the specific instance of the spoken word within the dataset.
Overall, the line sample, sample_rate, label, speaker_id, utterance_number = data[0] retrieves the properties of the first audio sample in the Speech Commands dataset and assigns them to individual variables for further analysis or processing.
The shape of the tensor waveform of audio sample will be a one dimensional vector
of size equal to sampling rate.
For example if the sampling rate is 16000, then the size of a tensor will be
pytorch.size[1,16000] : This means that the tensor is one dimensional with a length of 16000
in other words it is a 1D vector with 16000 elements.
Note that It contains the amplitude values of the audio signal sampled over time.
FORMATTING THE DATA
____________________
This process is called downsampling. It's like squeezing the audio to make it take up less space, so the computer can process it faster.
This is a good place to apply transformations to the data. For the waveform, we downsample the audio for faster processing without losing too much of the classification power.
```new_sample_rate = 8000
transform = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=new_sample_rate)
transformed = transform(waveform)
```
Earlier the sampling rate was 16000, I made it 8000
Now, sometimes with other collections of sound recordings, you might have files that have more than one audio channel, like having sound come from both the left and right side of your headphones.
But in our case, each recording we have only has one channel, which is perfect for what we need.
So, we don't need to do anything special to handle multiple channels because there's only one channel to begin with. We can just focus on the single channel we have and work with that.
--> ENCODING EACH WORD USING ITS INDEX IN THE LIST OF LABELS
NOTE : after instantiating the class subsetSC we got a list of audio recordings.
Now we want to turn that list into two batched tensors for the model.
(Also note that data in the SPEECHCOMMANDS dataset is in the form of tuples which already contain the tensor representation of audio)
WHAT ARE TWO BATCH TENSORS AND WHAT ARE THEY USED FOR AND HOW ARE THEY UNRELATED TO THE TENSORS IN THE TUPLES ?
Two Batched tensors allow parallel processing of multiple data instances.
- Before that make all the tensors in a batch of the same length by padding with zeroes.
- Pytorch has a pad_sequence function in the module torch.nn.utils.rnn
This function is commonly used in NLP, particularly when dealing with sequences of variable lengths.
batch = [item.t() for item in batch]: This line transposes each tensor in the batch using .t(). Transposing is performed to ensure that padding is applied correctly along the correct dimension. After this operation, the sequences are represented as columns in each tensor.
--> ```torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)```
- batch : a list of tensors each representing a sequence which may have different lengths.
- padding_value=0.: This argument specifies the value used for padding shorter sequences to match the length of the longest sequence in the batch. In this case, 0. (float zero) is used as the padding value.
- batch_first=True: This argument specifies whether the batch dimension should be the first dimension in the output tensor. If True, the output tensor will have shape (batch_size, max_sequence_length, *). If False, the output tensor will have shape (max_sequence_length, batch_size, *), where max_sequence_length is the length of the longest sequence in the batch.
return batch.permute(0, 2, 1): After padding, the dimensions of the resulting tensor are permuted using .permute() method. This rearranges the dimensions of the tensor to match a specific order. In this case, permute(0, 2, 1) reorders the dimensions so that the batch dimension remains first, followed by the sequence length dimension, and finally the feature dimension. This step ensures that the tensor has the correct shape for further processing, as some PyTorch modules expect a specific order of dimensions.
COLLATE FUNCTION
This code defines a function named collate_fn that serves as a custom collation function for data loading in PyTorch. Let's break down each part of the function:
Input Argument:
batch: This argument represents a list of data tuples. Each data tuple contains information about a single sample, typically consisting of the waveform (audio data), sample rate, label (class/category), speaker ID, and utterance number.
Initialization:
tensors, targets = [], []: Two empty lists, tensors and targets, are initialized. These lists will be used to gather the waveforms and labels from the batch, respectively.
Processing Data Tuples:
The function iterates over each data tuple in the batch using a for loop. For each tuple, it extracts the waveform and label information. Since the function only requires the waveform and label, other elements such as sample rate, speaker ID, and utterance number are ignored (using _ to denote unused variables).
The label is then encoded as an index using the label_to_index function. This likely converts the label from its original string format to a numerical index, which is commonly used in machine learning tasks.
Batch Preparation:
After gathering the waveforms and labels from the batch, the list of tensors (waveforms) is passed to the pad_sequence function. This function pads the individual tensors in the list to make them the same length, producing a batched tensor where sequences are padded with zeros.
The list of encoded labels is converted into a single tensor using torch.stack(). This function concatenates the tensors along a new dimension, creating a single tensor representing the labels for the entire batch.
Output:
The function returns two objects: the batched tensor of waveforms (tensors) and the tensor of labels (targets). These objects are typically used as input to a neural network model during training or evaluation.
Overall, this collate_fn function is responsible for preprocessing a batch of data tuples, preparing the waveforms as a batched tensor and encoding the labels as numerical indices, ready to be consumed by a machine learning model in PyTorch.
DATA LOADERS
data loaders play a crucial role in the training and evaluation pipeline of machine learning models by efficiently loading and preprocessing data from datasets. They enable batch processing, shuffling, preprocessing, and parallelism, making it easier for researchers and practitioners to work with large-scale datasets and train complex models.
Two data loaders are created: train_loader for the training dataset and test_loader for the testing dataset.
Each data loader is created using torch.utils.data.DataLoader, which provides an iterable over a dataset, supporting batch processing and multi-threading.
For both loaders, the dataset (train_set for train_loader and test_set for test_loader) is specified as the first argument.
batch_size is set to the previously defined value.
shuffle=True is used for the training loader to shuffle the dataset before creating batches, while shuffle=False is used for the testing loader to maintain the order of the dataset.
drop_last=False ensures that the last batch is not dropped if it's smaller than the specified batch size.
collate_fn=collate_fn specifies a custom collation function collate_fn to be used for batch creation.
num_workers and pin_memory are set based on the device as determined earlier.
VISUALIZATION PARAMETERS
________________________
There are approx 100,000 audio samples in the dataset.
visualizing all of them would be impractical and overwhelming.
Instead, you can select a subset of samples to visualize that represent
the variety of spoken words, speakers, and recording conditions present in the dataset.
A few things to keep in mind for a holistic visualization are :-
-- Random Sampling: Randomly select a subset of samples from the dataset to visualize.
This approach ensures that you get a diverse representation of the dataset without biasing towards specific words or speakers.
-- Stratified Sampling: If the dataset is labeled with spoken words (labels), you can perform stratified sampling to
ensure that each spoken word is represented in the subset proportionally to its occurrence in the dataset.
This helps ensure that you visualize samples from all spoken words in the dataset.
-- Speaker-Based Sampling:If the dataset includes speaker IDs, you can select samples from different speakers to visualize the
variability in speech patterns and accents across speakers.
-- Keyword-Based Sampling: If you're interested in specific spoken words or keywords, you can select samples corresponding to those words
for visualization. This can be useful for understanding how different words are pronounced and how their waveforms vary.
-- Environmental Conditions: Consider visualizing samples recorded under different environmental conditions,
such as background noise levels, to understand how the audio quality varies across recordings.
-- Error Analysis: If you've already trained a model on the dataset, you can visualize samples that were misclassified or
samples that the model struggled with. This can help identify challenging cases and areas for improvement.
NEURAL NETWORKS
_________________
This code defines a convolutional neural network (CNN) model called M5 using PyTorch's nn.Module class. Let's break down each part of the code:
Class Definition (M5):
The M5 class inherits from nn.Module, which is the base class for all neural network modules in PyTorch.
The class contains an __init__ method and a forward method, which define the model's architecture and the forward pass computation, respectively.
Initialization (__init__ method):
The __init__ method initializes the model's layers and parameters.
It takes several arguments:
n_input: Number of input channels (default: 1, assuming grayscale audio data).
n_output: Number of output classes (default: 35).
stride: Stride value for the first convolutional layer (default: 16).
n_channel: Number of channels in the convolutional layers (default: 32).
The method defines the following layers:
Four convolutional layers (conv1, conv2, conv3, conv4) with specified kernel sizes and number of channels.
Four batch normalization layers (bn1, bn2, bn3, bn4) to stabilize and accelerate training.
Four max pooling layers (pool1, pool2, pool3, pool4) to reduce spatial dimensions.
One fully connected layer (fc1) to produce the final output.
Forward Pass (forward method):
The forward method defines the computation performed during the forward pass of the model.
It applies the defined layers sequentially to the input tensor x, followed by activation functions (ReLU) after each convolutional and batch normalization layer.
The output of the last convolutional layer is passed through an average pooling layer (F.avg_pool1d) to compute the average value along the temporal dimension.
Finally, the output is passed through a linear layer (fc1) followed by a logarithmic softmax function (F.log_softmax) along the last dimension (dimension 2).
Model Instantiation and Parameter Counting:
After defining the model, an instance of M5 is created with specific arguments (n_input and n_output) based on the transformed input shape and the number of labels.
The model is then moved to the specified device (e.g., CPU or GPU).
The function count_parameters is defined to count the total number of trainable parameters in the model. It iterates through the model's parameters and counts the number of elements (numel()) for each parameter that requires gradients (requires_grad).
The total number of parameters is printed out to the console.
Overall, this code defines a CNN model architecture for processing audio data and computes the total number of trainable parameters in the model. It demonstrates how to define and instantiate a neural network model in PyTorch for audio classification tasks.
model = M5(n_input=transformed.shape[0], n_output=len(labels)): This line instantiates an instance of the M5 class, which represents the convolutional neural network (CNN) model.
n_input=transformed.shape[0]: The number of input channels (n_input) is set to the first dimension of the transformed tensor's shape.
This likely corresponds to the number of channels in the input data.
n_output=len(labels): The number of output classes (n_output) is set to the length of the labels list, indicating the number of classes the model needs to predict.
Moving the Model to the Specified Device:
model.to(device): This line moves the model (including all its parameters and buffers) to the specified device. The device variable typically represents a computing device such as a GPU or CPU.
This step is crucial for utilizing hardware acceleration (e.g., GPU) during model training and inference.
Printing Model Summary:
print(model): This line prints a summary of the model, which includes the model architecture along with the number of parameters in each layer.
The model summary provides a concise overview of the model's structure, making it easier to verify the correctness of the architecture.
Counting the Number of Parameters:
The count_parameters function is defined to count the total number of trainable parameters in the model.
n = count_parameters(model): This line calculates the total number of trainable parameters in the model by calling the count_parameters function.
The function iterates through all model parameters and counts the number of elements for each parameter that requires gradients (i.e., trainable parameters).
Printing the Number of Parameters:
print("Number of parameters: %s" % n): This line prints the total number of trainable parameters in the model.
Knowing the number of parameters is useful for understanding the complexity of the model and estimating its memory requirements.
In summary, this code snippet demonstrates model instantiation, moving the model to a specified device, printing a summary of the model architecture, and calculating the total number of trainable parameters in the model. These steps are common practices when working with deep learning models in PyTorch.