avalanche-datasets
Converting PyTorch Datasets to Avalanche Dataset
Datasets are a fundamental data structure for continual learning. Unlike offline training, in continual learning we often need to manipulate datasets to create streams, benchmarks, or to manage replay buffers. High-level utilities and predefined benchmarks already take care of the details for you, but you can easily manipulate the data yourself if you need to. These how-to will explain:
PyTorch datasets and data loading
How to instantiate Avalanche Datasets
AvalancheDataset features
In Avalanche, the AvalancheDataset
is everywhere:
The dataset carried by the
experience.dataset
field is always an AvalancheDataset.Many benchmark creation functions accept AvalancheDatasets to create benchmarks.
Avalanche benchmarks are created by manipulating AvalancheDatasets.
Replay buffers also use
AvalancheDataset
to easily concanate data and handle transformations.
📚 PyTorch Dataset: general definition
In PyTorch, a Dataset
is a class exposing two methods:
__len__()
, which returns the amount of instances in the dataset (as anint
).__getitem__(idx)
, which returns the data point at indexidx
.
In other words, a Dataset instance is just an object for which, similarly to a list, one can simply:
Obtain its length using the Python
len(dataset)
function.Obtain a single data point using the
x, y = dataset[idx]
syntax.
The content of the dataset can be either loaded in memory when the dataset is instantiated (like the torchvision MNIST dataset does) or, for big datasets like ImageNet, the content is kept on disk, with the dataset keeping the list of files in an internal field. In this case, data is loaded from the storage on-the-fly when __getitem__(idx)
is called. The way those things are managed is specific to each dataset implementation.
Quick note on the IterableDataset class
How to Create an AvalancheDataset
To create an AvalancheDataset
from a PyTorch you only need to pass the original data to the constructor as follows
The dataset is equivalent to the original one:
Classification Datasets
Classification dataset
returns triplets of the form <x, y, t>, where t is the task label (which defaults to 0).
The wrapped dataset must contain a valid targets field.
Avalanche provides some utility functions to create supervised classification datasets such as:
make_tensor_classification_dataset
for tensor datasets all of these will automatically create thetargets
andtargets_task_labels
attributes.
DataLoader
Dataset Operations: Concatenation and SubSampling
While PyTorch provides two different classes for concatenation and subsampling (ConcatDataset
and Subset
), Avalanche implements them as dataset methods. These operations return a new dataset, leaving the original one unchanged.
Dataset Attributes
AvalancheDataset allows to add attributes to datasets. Attributes are named arrays that carry some information that is propagated by concatenation and subsampling operations. For example, classification datasets use this functionality to manage class and task labels.
Thanks to DataAttribute
s, you can freely operate on your data (e.g. to manage a replay buffer) without losing class or task labels. This makes it easy to manage multi-task datasets or to balance datasets by class.
Transformations
Most datasets from the torchvision libraries (as well as datasets found "in the wild") allow for a transformation
function to be passed to the dataset constructor. The support for transformations is not mandatory for a dataset, but it is quite common to support them. The transformation is used to process the X value of a data point before returning it. This is used to normalize values, apply augmentations, etcetera.
Next steps
With these notions in mind, you can start start your journey on understanding the functionalities offered by the AvalancheDatasets by going through the Mini How-Tos.
🤝 Run it on Google Colab
Was this helpful?