Elementary, Dear Data
One of the most important aspects of any machine learning problem is the data. The dp library provides the following data-related facilities:
- BaseSet : abstract class;
- DataSet : a dataset for input and target Views;
- SentenceSet : container of sentences (used for language modeling);
- TextSet : container of text (used for language modeling);
- ImageClassSet : container for large-scale image-classification datasets;
- Batch : a mini-batch of inputs and targets;
- DataSource : a container of train, valid and test DataSets;
- Mnist : the ubiquitous MNIST dataset;
- NotMnist : the lesser known NotMNIST dataset;
- FaceDetection : the Purdue face detection dataset;
- Cifar10 : the CIFAR-10 dataset;
- Cifar100 : the very difficult to generalize CIFAR-100 dataset;
- Svhn : the Google Street View House Numbers dataset;
- ImageNet : the Large Scale Visual Recognition Challenge 2014 (ILSVRC2014) dataset;
- PennTreeBank : the Penn Tree Bank language model dataset;
- BillionWords : the Google 1-Billion Words language model dataset;
- ImageSource : generic large image classification data wrapper;
- SmallImageSource : generic small image classification data wrapper;
- TextSource : generic text language model dataset wrapper;
- Sampler : ordered dataset iterator;
- ShuffleSampler : shuffled dataset iterator;
- SentenceSampler : samples sentences for recurrent models;
- RandomSampler : iterates through batches of random examples;
- TextSampler : iterates through a text (a single contiguous sequence of words) dataset;
BaseSet
This is the base (abstract) class inherited by subclasses like DataSet, SentenceSet and Batch. It is used for training or evaluating a model. It supports multiple-input and multiple-output datasets using ListView. In the case of multiple targets, it is useful for multi-task learning, or learning from hints . In the case of multiple inputs, richer inputs representations could be created allowing, for example, images to be combined with tags, text with images, etc. Multi-input/target facilities could be used with nn.ParallelTable and nn.ConcatTable. If the BaseSet is used for unsupervised learning, only inputs need to be provided.
dp.BaseSet{inputs, [targets, which_set]}
Constructs a dataset from inputs and targets. Arguments should be specified as key-value pairs.
inputs
is an instance of View or a table of these. In the latter case, they will be automatically encapsulated by a ListView. The encapsulated Tensor is used as inputs to amodel
.targets
is an instance ofView
or a table of these. In the latter case, they will be automatically encapsulated by aListView
. These are used as targets for training aModel
. The indices of examples intargets
must be aligned with those ininputs
.which_set
is a string identifying the purpose of the dataset. Valid values are- train for training, i.e. for fitting a model to a dataset;
- valid for cross-validation, i.e. for early-stopping and hyper-optimization;
- test for testing, i.e. comparing your model to the current state-of-the-art and such.
preprocess([input_preprocess, target_preprocess, can_fit])
Preprocesses the BaseSet.
input_preprocess
is Preprocess to be applied to the input View of the BaseSet.target_preprocess
is Preprocess to be applied to the target View of the BaseSet.can_fit
is a boolean. When true, allows measuring of statistics on the View of BaseSet to initialize the Preprocess. Should normally only be done on the training set. Default is to fit the training set.
[inputs] inputs()
Returns inputs View.
[targets] targets()
Returns targets View.
DataSet
A subclass of BaseSet. Contains input and optional target Views used for training or evaluating models
.
batch(batch_size)
A factory method that builds a Batch of size batch_size
. It effectively
calls sub with arguments start=1
and stop=batch_size
. This method
reuses the DataSet's inputs and targets, such that these shouldn't be modified, unless the
intent is to modify the original DataSet.
sub(start, stop, [new])
A factory method that builds a Batch by calling sub
with argument start
and stop
on the DataSet's inputs and targets.
This method reuses the DataSet's inputs and targets, such that these shouldn't be modified, unless the
intent is to modify the original DataSet.
index([batch,] indices)
SentenceSet
A DataSet used for language modeling. Takes a sequence of words stored as a tensor of word IDs and a Tensor holding the start index of the sentence of its commensurate word id (the one at the same index). Unlike DataSets, for memory efficiency reasons, this class does not store its data in Views. However, the outputs of factory methods batch, sub, and index are Batches containing input and target ClassViews. The returned batch:inputs() are filled according to Google 1-Billion Words guidelines.
TextSet
A DataSet used for language modeling.
Takes a sequence of words stored as a tensor of word ids.
Contrary to SentenceSet, this wrapper assumes a continuous stream of words.
If consecutive sentences are completely unrelated, you might be better off using SentenceSet,
unless your model can learn to forget (like LSTMs).
Like SentenceSet, this class does not store its data in Views.
However, the outputs of batch()
, sub()
, index()
are Batches
containing ClassViews of inputs and targets.
ImageClassSet
A DataSet for image classification tasked stored in a flat folder structure :
[data_path]/[class]/[imagename].[JPEG,png,...]
Optimized for extremely large datasets (14 million images+). This DataSet is very memory efficient in that the images are loaded from disk into memory only when requested as a Batch. It is used to wrap the training and validation sets of the ImageNet DataSource.
When first initialized, the dataset needs to build an index of all image paths which it encapsulates into torch.CharTensor for efficieny. The index is build using some heavy command-line magic, but this only needs to be executed once as the resulting index is cached to disk for the next time the dataset is used.
During queries of the dataset using sample or sub, the index is used to retrieve images from disk. This can be a major bottleneck. We strongly encourage storing your dataset on a Solid-State Drive (SSD). Furthermore, if threads-ffi is installed, the dataset can be used for asynchronous batch requests. This is implemented using multi-threading, which is necessary to speed up reading all those files.
dp.ImageClassSet{...}
ImageClassSet constructor. Arguments should be specified as key-value pairs.
data_path
is a string (or table thereof) specifying one or many paths to the data.load_size
ia a table specifying the approximate size (nChannel x Height x Width
) for which to load the images to, initially.sample_size
is a table specifying a consistent sample size to resize the images to (or crop them). Defaults toload_size
.verbose
is a boolean specifying whether or not to display verbose messages. Defaults to true.sample_func
is a string or functionf(self, dst, path)
that fills thedst
Tensor with one or many images taken from the image located atimgpath
Strings "sampleDefault", "sampleTrain" or "sampleTest" can also be provided as they refer to existing methods. Defaults to sampleDefault.sort_func
is a comparison function used for sorting the class directories. The order is used to assign each class and index. Defaults to the<
operator.cache_mode
is a string with default value "writeonce". Valid options include:- "writeonce" : read from cache if exists, else write to cache.
- "overwrite" : write to cache, regardless if exists.
- "nocache" : dont read or write from cache.
- "readonly" : only read from cache, fail otherwise.
cache_path
is a string specifiying the path of a cache file. Defaults to[data_path[1]]/cache.th7
.
The DataSet constructor arguments also apply.
[batch] sample([batch,] nSample, [sampleFunc])
For nSample
examples, uniformly samples a class, and then uniformly samples example from that class.
This keeps the class distribution balanced. Argument sampleFunc
is a
function or string used for sampling patches from a loaded image
(see constructor for details).
Defaults to whatever was passed to the constructor. The optional batch
argument, a Batch instance,
is recommended for minimizing memory allocations (see sub for details).
This Batch factor is called by the [RandomSampler]
Note that depending on the sampleFunc
, the number of returned samples may
be greater than nSample
(see sampleTest for an example).
[dst] sampleDefault([dst,] path)
Loads the image located at path
. The returned dst
Tensor will have
size sample_size
.
[dst] sampleTrain([dst,] path)
Loads the image of size load_size
located at path
. Does a random crop
of size sample_size
from the loaded image and returns it as dst
.
[dst] sampleTest([dst,] path)
Loads the image of size load_size
located at path
.
Does 10 crops, (center + 4 corners) and their horizontal flips.
Works with the TopCrop feedback.
multithread([nThread, queueSize])
Uses threads-ffi to spawn a
Threads
pool of nThread
threads communicating with the current main thread through a
queue of size
queueSize
. Each thread will load an ImageClassSet instance using the
image path index cached on disk.
A Sampler or RandomSampler can then
be set in async mode to query the threads for Batches
asynchronously. To do this, the Samplers begin by sending queueSize
batch requests
to the Thread pool. After that, for each batch requested from the
dataset, another request is sent to the pool using either subAsyncPut
or sampleAsyncPut.
subAsyncPut(batch, start, stop, callback)
Puts a sub request onto the queue such that it will be executed
by one of the threads in the Threads pool. The callback
is
a function that will be executed on the resulting Batch once it returns to
the main thread.
Must be preceeded by a call to multithread.
sampleAsyncPut(batch, nSample, sampleFunc, callback)
Same as subAyncPut, but for sending a sample request.
[batch] asyncGet()
Retrives a Batch request from the queue and returns it to the caller. The call must be preceded by a call to subAyncPut or sampleAyncPut
synchronize()
Empties the queue of asynchronous requests.
Batch
A subclass of BaseSet. A mini-batch of input and target Views. The encapsulated Tensors are to be fed into a Module and Criterion. The batch of examples is usually sampled from a DataSet via a Sampler iterator by calling the DataSet's different factory methods : batch, sub, and index.
DataSource
Abstract class used to generate up to 3 DataSets : train, valid and test:
- train for training, i.e. for fitting a model to a dataset;
- valid for cross-validation, i.e. for early-stopping and hyper-optimization;
- test for testing, i.e. comparing your model to the current state-of-the-art and such. It can also perform preprocessing using Preprocess on all DataSets by fitting only the training set.
dp.DataSource{...}
DataSource constructor. Arguments should be specified as key-value pairs.
train_set
is an optional DataSet used for training, i.e. optimizing amodel
to minimize aloss
valid_set
is an optional DataSet used for cross-validation, i.e. for early-stopping and hyper-optimizationtest_set
is an optional DataSet used to evaluate generalization performance after training (e.g. to compare different models)input_preprocess
is a Preprocess that will be applied to the inputs. Statistics are measured (fitted) on thetrain_set
only, and then reused to preprocess all provided sets. This argument may also be provided as a list (table) of Preprocesses, in which case, they will be wrapped in the composite Pipeline Preprocess.target_preprocess
is likeinput_preprocess
, but for preprocessing the targets.
Note that at least one of the 3 set
arguments should be specified. If you need guidance to build your own DataSource, the Facial Keypoint Tutorial also includes a section demonstrating how a DataSource can be built to wrap facial keypoint detection data.
[tensor, dataview, dataset] get(which_set, attribute, view, type)
This method simplifies access to tensors. This is best demonstrated with an example. Say you want to access the input tensor of the training set, you can call :
tensor = ds:trainSet():inputs():forward('default')
That is a lot of function calls. You can use the get
method instead:
tensor = ds:get('train', 'input', 'default')
These are also the default arguments, so the above are equivalent to :
tensor = ds:get()
All arguments are optional strings :
* which_set
specifies which DataSet : train, valid or test. Defaults to train;
* attribute
specifies which attribute of the DataSet : input or target. Defaults to inputs;
* view
specifies the axis order of the tensor to get : bwc, bchw, b, etc. Defaults to default. See Views;
* type
specifies the type of the Tensor to get : float, torch.FloatTensor, Float, cuda, etc.
[dataview, dataset] set(which_set, attribute, view, tensor)
This method allows for setting one of the encapsulated Tensors. Say you want to set the input tensor of the training set, you can call :
ds:trainSet():inputs():forward('bf', torch.randn(3,4))
That is a lot of function calls. You can use the set
method instead:
ds:set('train', 'input', 'bf', torch.randn(3,4))
All arguments are mandatory :
* which_set
specifies which DataSet : train, valid or test;
* attribute
specifies which attribute of the DataSet : input or target;
* view
specifies the axis order of the tensor to be encapsulated in a Views : bwc, bchw, b, etc;
* tensor
is the Tensor that you want to encapsulate.
preprocess()
If they exist, applies the input_preprocess
and target_preprocess
Preprocess
attributed specified in the constructor or via the set[Input,Target]Preprocess
methods
to the inputs and targets, respectively. Statistics are measured (fitted) on the train_set
only,
and then reused to preprocess all contained DataSets. This method is invoked by the constructor.
[path] getDataPath{name, url, data_dir, decompress_file}
A static function (not to be called via method operator :
) that
looks for a file data_dir/name/decompress_file
, and, if it is missing, downloads it.
Returns the path
to the resulting data file.
name
is a string specifying the name of the DataSource (e.g. "Mnist", "BillionWords", etc). Also the name of the directory where the file should be located. A directory with this name is created withindata_directory
to contain the downloaded files. Or is expected to find the data files in this directory.url
is a string specifying the URL from which data can be downloaded in case it is not found in the path.data_dir
is a string specifying the path to the directory containing directoryname
, which is expected to contain the data, or where it will be downloaded.decompress_file
is a string that when non-nil, decompresses the downloaded data ifdata_dir/name/decompress_file
is not found. In which case, returnsdata_dir/name/decompress_file
.
TextSource
A generic text language model dataset wrapper. Creates a DataSource out of 1 to 3 strings or text files. Text files are assumed to be arranged one sentence per line, each line beginning with a space and ending with a space and a newline.
ImageSource
A generic version of the ImageNet DataSource used for wrapping large image classification datasets. Each class is a directory of image files. The images can be asynchronously loaded into memory in one ore many different threads to speedup training.
SmallImageSource
A generic DataSource used for wrapping small image classification datasets. Each class is a directory of image files. The loaded input (images) and target (classes) tensors are cached to disk the first time the constuctor is called. The next time it is called, these cached files will be loaded instead of the directory of images, thereby speeding up the loading process.
Mnist
A DataSource subclass wrapping the simple but widely used handwritten digits
classification problem (see MNIST). The images are of size 28x28x1
. The classes are : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
NotMnist
A DataSource subclass wrapping the much larger alternative to MNIST:
NotMNIST.
If not found on the local machine, the object downloads the dataset from the
original source.
It contains 500k+ examples of 10 charaters using unicode fonts: A,B,C,D,E,F,G,H,I,J.
Like Mnist, the images are of size 28x28x1
.
Cifar10
A DataSource subclass wrapping the CIFAR-10 dataset.
It is a 3x32x32
color-image set of 10 different objects. Small dataset size makes it hard to generalize
from train to test set (Regime : overfitting).
Cifar100
A DataSource subclass wrapping the CIFAR-100
dataset. It is a 3x32x32
color-image set of 100 different objects. Small dataset (even less images
per class than Cifar-10) size makes it hard to generalize from train to test set (Regime : overfitting).
Svhn
The Google Street View House Numbers (SVHN) DataSource wraps
the originalsource.
If not found on the local machine, the object downloads the dataset from
nikopia.com.
It contains 73257 digits for training, 26032 digits for testing, and 531131 additional,
somewhat less difficult samples, to use as extra training data.
Like CIFAR, the images are of size 3x32x32
.
FaceDetection
A Purdue face detection dataset used in the data tutorial.
ImageNet
Ref.: A. http://image-net.org/challenges/LSVRC/2014/download-images-5jj5.php
This DataSource wraps the Large Scale Visual Recognition Challenge 2014 (ILSVRC2014) image classification dataset (commonly known as ImageNet). The dataset hasn't changed from 2012-2014.
Requirements
Due to its size, the data first needs to be prepared offline. Use downloadimagenet.lua to download and extract the data :
th downloadimagenet.lua --savePath '/path/to/diskspace/ImageNet'
The entire process requires about 360 GB of disk space to complete the download and extraction process.
This can be reduced to about 150GB if the training set is downloaded and extracted first,
and all the .tar
files are manually deleted. Repeat for the validation set, devkit and metadata.
If you still don't have enough space in one partition, you can divide the data among different partitions.
We recommend a good internet connection (>60Mbs download) and a good Solid-State Drives (SSD).
Use harmonizeimagenet.lua to harmonize the train and validation sets:
th scripts/harmonizeimagenet.lua --dataPath /path/to/diskspace/ImageNet --progress --forReal
The sets will then contain a directory of images for each class with name class[id]
where [id]
is a class index, between 1 and 1000, used for the ILVRC2014 competition.
Then we need to install graphicsmagick and torchx:
sudo luarocks install graphicsmagick
sudo luarocks install torchx
Memory Efficiency
Unlike most DataSources, ImageNet doesn't read all images into memory when it is first loaded.
Instead it uses ImageClassSet to encapslate the different datasets; in this case
the train
and valid
sets. The ImageClassSet builts a list of all images and indexes them per class.
In this way, each Batch is only loaded from disk and created when requested from a
Sampler, making it very memory efficient. This is also the reason why we recommend
storing the dataset on SSD.
Sampling
As in the famous (Krizhevsky et al. 2012) paper, the ImageNet training dataset samples images cropped from random 224x224 patches from the images resizes so that the smallest dimension has size 256. As for the validation set, ten 224x224 patches are cropped per image, i.e. center, four corners and their horizontal flips, and their predictions are averaged.
Example
You can test out your ImageNet datasource using the alexnet.lua training script.
BillionWords
A DataSource subclass wrapping the corpus derived from the
training-monolingual.tokenized/news.20??.en.shuffled.tokenized
data distributed for WMT11. The preprocessing suggested by
the Google 1-Billion Words language modeling benchmark was used to prepare the data.
The task consists in predicting the next word given the n
previous ones,
where n
is the context size (it can be set in the constructor).
The data consists in approximately 30 million sentences of an average length of about 25 words.
In, there are about 800 thousand (unique) words in the vocabulary, which makes it a very memory intensive problem.
The DataSource inclues data for building hierarchical softmaxes to accelerate training.
As usual the actual data is downloaded automatically when not found on disk.
It is stored as a serialized torch.Tensor
(see code for details).
Penn Tree Bank
A small TextSource used to train and evaluate language models. It has a relatively small vocabulary of 10000 words.
Sampler
A DataSet iterator which sequentially samples Batches from a DataSet for a Propagator. This iterator calls the sub Batch factory method.
dp.Sampler{batch_size, epoch_size}
A constructor having the following arguments:
batch_size
is a number specifying the number of examples per batch. Defaults to 1024.epoch_size
specifies the number of examples presented per epoch. Whenepoch_size
is less than the size of the dataset, the sampler resumes processing the dataset from its ending position the next time sampleEpoch is called. Whenepoch_size
is greater, it loops through the dataset until enough samples are draw. The default (-1) is to use then entire dataset per epoch.ppf
is an optional function that preprocesses a Batch into another Batch.
[sampleBatch] sampleEpoch(dataset)
Returns an iterator over a dataset
for one epoch. The
returned sampleBatch
iterator is a function taking one optional argument : batch
,
which is a Batch. When this argument is provided, its encapsulated
memory will be reused. When called, the sampleBatch
function returns
the next Batch of examples, until the dataset has been iterated through for the duration
of an epoch. This method is typically overwritten by sub-classes.
async()
Used in conjuction with a multithreaded dataset to iterate through Batches asynchronously. You can test the async functionality via the alexnet.lua training script.
ShuffleSampler
A subclass of Sampler which iterates over Batches in a dataset by shuffling the example indices before each epoch. This iterator calls the index Batch factory method.
dp.ShuffleSampler{batch_size, random_seed}
A constructor having the following arguments:
batch_size
specifies the number of examples per sampled batches. The default is 128.random_seed
is a number used to initialize the shuffle generator.
SentenceSampler
A subclass of Sampler which iterates over parallel
sentences of equal size one word at a time.
The sentences sizes are iterated through randomly.
Publishes to the "beginSequence"
Mediator
Channel before each new Sequence, which prompts
the recurrent Modules (like Recurrent
and LSTM) to forget the previous sequence of inputs.
Note that epoch_size
only garantees the minimum number of samples per epoch (more could be sampled).
Used for Recurrent Neural Network Language Models.
RandomSampler
A DataSet iterator which randomly samples batches from a dataset. This iterator calls the sample Batch factory method. Unlike the ShuffleSampler, this iterator is not garanteed to sample each example in the dataset during a complete epoch. You can see this Sampler in action via the alexnet.lua training script
TextSampler
A TextSet iterator which is used mostly for recurrent language models.
Successive sampled batches are contiguous to each other.
So sample i
of the j
th sampled batch is the sequence of words preceding the
sequence of words in sample i
of the j+1
th batch.
When batch_size=1
, the entire text will be sampled one sequence at a time in order of appearence in the text.