1 of 1

Preamble: PyTorch Datasets

Few words about PyTorch Datasets

This short preamble will briefly go through the basic notions of Dataset offered natively by PyTorch. A solid grasp of these notions are needed to understand:

How PyTorch data loading works in general
How AvalancheDatasets differs from PyTorch Datasets

📚 Dataset: general definition

In PyTorch, a Dataset is a class exposing two methods:

__len__(), which returns the amount of instances in the dataset (as an int).
__getitem__(idx), which returns the data point at index idx.

In other words, a Dataset instance is just an object for which, similarly to a list, one can simply:

Obtain its length using the Python len(dataset) function.
Obtain a single data point using the x, y = dataset[idx] syntax.

The content of the dataset can be either loaded in memory when the dataset is instantiated (like the torchvision MNIST dataset does) or, for big datasets like ImageNet, the content is kept on disk, with the dataset keeping the list of files in an internal field. In this case, data is loaded from the storage on-the-fly when __getitem__(idx) is called. The way those things are managed is specific to each dataset implementation.

PyTorch Datasets

The PyTorch library offers 4 Dataset implementations:

Dataset: an interface defining the __len__ and __getitem__ methods.
TensorDataset: instantiated by passing X and Y tensors. Each row of the X and Y tensors is interpreted as a data point. The __getitem__(idx) method will simply return the idx-th row of X and Y tensors.

As explained in the mini How-Tos, Avalanche offers a customized version for all these 4 datasets.

Transformations

Most datasets from the torchvision libraries (as well as datasets found "in the wild") allow for a transformation function to be passed to the dataset constructor. The support for transformations is not mandatory for a dataset, but it is quite common to support them. The transformation is used to process the X value of a data point before returning it. This is used to normalize values, apply augmentations, etcetera.

As explained in the mini How-Tos, the AvalancheDataset class implements a very rich and powerful set of functionalities for managing transformations.

Quick note on the IterableDataset class

A variation of the standard Dataset exist in PyTorch: the . When using an IterableDataset, one can load the data points in a sequential way only (by using a tape-alike approach). The dataset[idx] syntax and len(dataset) function are not allowed. Avalanche does NOT support IterableDatasets. You shouldn't worry about this because, realistically, you will never encounter such datasets.

DataLoader

The Dataset is a very simple object that only returns one data point given its index. In order to create minibatches and speed-up the data loading process, a DataLoader is required.

The PyTorch DataLoader class is a very efficient mechanism that, given a Dataset, will return minibatches by optonally shuffling data brefore each epoch and by loading data in parallel by using multiple workers.

Preamble wrap-up

To wrap-up, let's see how the native, non-Avalanche, PyTorch components work in practice. In the following code we create a TensorDataset and then we load it in minibatches using a DataLoader.

Next steps

With these notions in mind, you can start start your journey on understanding the functionalities offered by the AvalancheDatasets by going through the Mini How-Tos.

Please refer to the for a complete list. It is recommended to start with the "Creating AvalancheDatasets" Mini How-To.

🤝 Run it on Google Colab

You can run this chapter and play with it on Google Colaboratory by clicking here: