Datasets#
Mammoth datasets define a complete and separate Continual Learning benchmark. This means that each dataset must statically define all the necessary information to run a continual learning experiment, including:
Required properties
Name of the dataset: NAME attribute (
str).Incremental setting (class-il, domain-il, or general-continual): SETTING attribute (
str). See more in section Experimental settings.Size of the input data: SIZE attribute (
tuple[int]).
Required properties for class-il and domain-il settings
Number of tasks: TASKS attribute (
int).Number of classes per task: N_CLASSES_PER_TASK attribute (
int|tuple[int]). This can be a list of integers (one for each task and only for class-il setting), or a single integer.
Required methods for all settings
get_epochs static method (
int): returns the number of epoch for each task. This method is optional only for datasets that follow the general-continual setting.get_batch_size static method (
int): returns the batch size for each task.get_data_loaders static method (
[DataLoader, DataLoader]): returns the train and test data loaders for each task. See more in Utils.get_backbone static method (
nn.Module): returns the backbone model for the experiment. Backbones are defined in backbones folder. See more in Backbones.get_transform static method (
callable): returns the data-augmentation transform to apply to the data during train.get_loss static method (
callable): returns the loss function to use during train.get_normalization_transform static method (
callable): returns the normalization transform to apply on torch tensors (no ToTensor() required).get_denormalization_transform static method (
callable): returns the transform to apply on the tensors to revert the normalization. You can use the DeNormalize function defined in datasets/transforms/denormalization.py.
See continual_dataset for more details or SequentialCIFAR10 in seq_cifar10 for an example.
Note
Datasets are downloaded by default in the data folder. You can change this default location by setting the base_path function in conf.
Experimental settings#
Experimental settings follow and extend the notation of Three Scenarios for Continual Learning, and are defined in the SETTING attribute of each dataset. The following settings are available:
- class-il: the total number of classes increases at each task, following the N_CLASSES_PER_TASK attribute.
On task-il and class-il
Using this setting metrics will be computed both for class-il and task-il. Metrics for task-il will be computed by masking the correct task for each sample during inference. This allows to compute metrics for both settings without having to run the experiment twice.
domain-il: the total number of classes is fixed, but the distribution of the input data changes at each task.
general-continual: the distribution of the classes change gradually over time, without notion of task boundaries. In this setting, the TASKS and N_CLASSES_PER_TASK attributes are ignored as there is only a single long tasks that changes over time.
cssl: this setting is the same as class-il, but with some of the labels missing due to limited supervision. This setting is used to simulate the case where a percentage of the labels is not available for training. For example, if
--label_percis set to0.5, only 50% of the labels will be available for training. The remaining 50% will be masked with a label of-1and ignored during training if the currently used method does not support partial labels (check out the COMPATIBILITY attribute in Models).
Experiments on the joint setting
Mammoth datasets support the joint setting, which is a special case of the class-il setting where all the classes are available at each task. This is useful to compare the performance of a method on what is usually considered the upper bound for the class-il setting. To run an experiment on the joint setting, simply set the --joint to 1. This will automatically set the N_CLASSES_PER_TASK attribute to the total number of classes in the dataset and the TASKS attribute to 1.
Steps to create a new dataset:#
All datasets must inherit from the ContinualDataset class, which is defined in continual_dataset. The only exception are datasets that follow the general-continual setting, which inherit from the GCLDataset class, (defined in gcl_dataset). These classes provide some useful methods to create data loaders and store masked data loaders for continual learning experiments. See more in section Utils.
Create a new file in the datasets folder, e.g.
my_dataset.py.Define a new class that inherits from ContinualDataset or GCLDataset and implements all the required methods and attributes.
Define the get_data_loaders method, which returns a list of train and test data loaders for each task (see more in section Utils).
Tip
For convenience, most datasets are initially created with all classes and then masked appropriately by the store_masked_loaders function. For example, in seq_cifar10 the get_data_loaders function of SequentialCIFAR10 dataset first inizializes the MyCIFAR10 and TCIFAR10 datasets with train and test data for all classes respectively, and then masks the data loaders to return only the data for the current task.
Important
The train data loader must return both augmented and non-augmented data. This is done to allow the storage of raw data for replay-based methods (for more information, check out Rethinking Experience Replay: a Bag of Tricks for Continual Learning). The signature return for the train data loader is
(augmented_data, labels, non_augmented_data), while the test data loader should return(data, labels).
If all goes well, your dataset should be picked up by the get_dataset function and you should be able to run an experiment with it.
Utils#
get_data_loaders: This function should take care of downloading the dataset if necessary, make sure that it contains samples and labels for
only the current task (you can use the store_masked_loaders function), and create the data loaders.
store_masked_loaders: This function is defined in continual_dataset and takes care of masking the data loaders to return only the data for the current task.
- It is used by most datasets to create the data loaders for each task.
If the
--permute_classesflag is set to1, it also applies the appropriate permutation to the classes before splitting the data.If the
--label_percargument is set to a value between0and1, it also randomly masks a percentage of the labels for each task.
Module attributes and functions#
- datasets.get_all_datasets()[source]#
Returns the list of all the available datasets in the datasets folder.
- datasets.get_dataset(args)[source]#
Creates and returns a continual dataset among those that are available. If an error was detected while loading the available datasets, it raises the appropriate error message.
- Parameters:
args (Namespace) – the arguments which contains the hyperparameters
- Return type:
- Exceptions:
AssertError: if the dataset is not available Exception: if an error is detected in the dataset
- Returns:
the continual dataset instance
- Return type:
- datasets.get_dataset_class(args)[source]#
Return the class of the selected continual dataset among those that are available. If an error was detected while loading the available datasets, it raises the appropriate error message.
- Parameters:
args (Namespace) – the arguments which contains the –dataset attribute
- Return type:
- Exceptions:
AssertError: if the dataset is not available Exception: if an error is detected in the dataset
- Returns:
the continual dataset class
- Return type:
- datasets.get_dataset_names()[source]#
Return the names of the selected continual dataset among those that are available. If an error was detected while loading the available datasets, it raises the appropriate error message.
- Parameters:
args (Namespace) – the arguments which contains the –dataset attribute
- Exceptions:
AssertError: if the dataset is not available Exception: if an error is detected in the dataset
- Returns:
the continual dataset class names