Datasetis a class exposing two methods:
__len__(), which returns the amount of instances in the dataset (as an
__getitem__(idx), which returns the data point at index
x, y = dataset[idx]syntax.
__getitem__(idx)is called. The way those things are managed is specific to each dataset implementation.
Dataset: an interface defining the
TensorDataset: instantiated by passing X and Y tensors. Each row of the X and Y tensors is interpreted as a data point. The
__getitem__(idx)method will simply return the
idx-th row of X and Y tensors.
ConcatDataset: instantiated by passing a list of datasets. The resulting dataset is a concatenation of those datasets.
Subset: instantiated by passing a dataset and a list of indices. The resulting dataset will only contain the data points described by that list of indices.
transformationfunction to be passed to the dataset constructor. The support for transformations is not mandatory for a dataset, but it is quite common to support them. The transformation is used to process the X value of a data point before returning it. This is used to normalize values, apply augmentations, etcetera.
AvalancheDatasetclass implements a very rich and powerful set of functionalities for managing transformations.
Datasetexist in PyTorch: the IterableDataset. When using an
IterableDataset, one can load the data points in a sequential way only (by using a tape-alike approach). The
len(dataset)function are not allowed. Avalanche does NOT support
IterableDatasets. You shouldn't worry about this because, realistically, you will never encounter such datasets.
Datasetis a very simple object that only returns one data point given its index. In order to create minibatches and speed-up the data loading process, a
DataLoaderclass is a very efficient mechanism that, given a
Dataset, will return minibatches by optonally shuffling data brefore each epoch and by loading data in parallel by using multiple workers.
TensorDatasetand then we load it in minibatches using a