Preamble: PyTorch Datasets
Few words about PyTorch Datasets
This short preamble will briefly go through the basic notions of Dataset offered natively by PyTorch. A solid grasp of these notions are needed to understand:
How PyTorch data loading works in general
How AvalancheDatasets differs from PyTorch Datasets
📚 Dataset: general definition
In PyTorch, a Dataset
is a class exposing two methods:
__len__()
, which returns the amount of instances in the dataset (as anint
).__getitem__(idx)
, which returns the data point at indexidx
.
In other words, a Dataset instance is just an object for which, similarly to a list, one can simply:
Obtain its length using the Python
len(dataset)
function.Obtain a single data point using the
x, y = dataset[idx]
syntax.
The content of the dataset can be either loaded in memory when the dataset is instantiated (like the torchvision MNIST dataset does) or, for big datasets like ImageNet, the content is kept on disk, with the dataset keeping the list of files in an internal field. In this case, data is loaded from the storage on-the-fly when __getitem__(idx)
is called. The way those things are managed is specific to each dataset implementation.
PyTorch Datasets
The PyTorch library offers 4 Dataset implementations:
Dataset
: an interface defining the__len__
and__getitem__
methods.TensorDataset
: instantiated by passing X and Y tensors. Each row of the X and Y tensors is interpreted as a data point. The__getitem__(idx)
method will simply return theidx
-th row of X and Y tensors.ConcatDataset
: instantiated by passing a list of datasets. The resulting dataset is a concatenation of those datasets.Subset
: instantiated by passing a dataset and a list of indices. The resulting dataset will only contain the data points described by that list of indices.
As explained in the mini How-Tos, Avalanche offers a customized version for all these 4 datasets.
Transformations
Most datasets from the torchvision libraries (as well as datasets found "in the wild") allow for a transformation
function to be passed to the dataset constructor. The support for transformations is not mandatory for a dataset, but it is quite common to support them. The transformation is used to process the X value of a data point before returning it. This is used to normalize values, apply augmentations, etcetera.
As explained in the mini How-Tos, the AvalancheDataset
class implements a very rich and powerful set of functionalities for managing transformations.
Quick note on the IterableDataset class
A variation of the standard Dataset
exist in PyTorch: the IterableDataset. When using an IterableDataset
, one can load the data points in a sequential way only (by using a tape-alike approach). The dataset[idx]
syntax and len(dataset)
function are not allowed. Avalanche does NOT support IterableDataset
s. You shouldn't worry about this because, realistically, you will never encounter such datasets.
DataLoader
The Dataset
is a very simple object that only returns one data point given its index. In order to create minibatches and speed-up the data loading process, a DataLoader
is required.
The PyTorch DataLoader
class is a very efficient mechanism that, given a Dataset
, will return minibatches by optonally shuffling data brefore each epoch and by loading data in parallel by using multiple workers.
Preamble wrap-up
To wrap-up, let's see how the native, non-Avalanche, PyTorch components work in practice. In the following code we create a TensorDataset
and then we load it in minibatches using a DataLoader
.
Next steps
With these notions in mind, you can start start your journey on understanding the functionalities offered by the AvalancheDatasets by going through the Mini How-Tos.
Please refer to the list of the Mini How-Tos regarding AvalancheDatasets for a complete list. It is recommended to start with the "Creating AvalancheDatasets" Mini How-To.
🤝 Run it on Google Colab
You can run this chapter and play with it on Google Colaboratory by clicking here:
Last updated