Neural Network Tutorial
We begin with a simple neural network example. The first line loads the dp package, whose first matter of business is to load its dependencies (see init.lua):
require 'dp'
Note : package Moses is imported as _
.
So _
shouldn't be used for dummy variables.
Instead use the much more annoying __
, or whatnot.
Command-line Arguments
Lets define some command-line arguments.
These will be stored into table opt
, which will be printed when the script is launched.
Command-line arguments make it easy to control the experiment and
try out different hyper-parameters without needing to modify any code.
--[[command line arguments]]--
cmd = torch.CmdLine()
cmd:text()
cmd:text('Image Classification using MLP Training/Optimization')
cmd:text('Example:')
cmd:text('$> th neuralnetwork.lua --batchSize 128 --momentum 0.5')
cmd:text('Options:')
cmd:option('--learningRate', 0.1, 'learning rate at t=0')
cmd:option('--lrDecay', 'linear', 'type of learning rate decay : adaptive | linear | schedule | none')
cmd:option('--minLR', 0.00001, 'minimum learning rate')
cmd:option('--saturateEpoch', 300, 'epoch at which linear decayed LR will reach minLR')
cmd:option('--schedule', '{}', 'learning rate schedule')
cmd:option('--maxWait', 4, 'maximum number of epochs to wait for a new minima to be found. After that, the learning rate is decayed by decayFactor.')
cmd:option('--decayFactor', 0.001, 'factor by which learning rate is decayed for adaptive decay.')
cmd:option('--maxOutNorm', 1, 'max norm each layers output neuron weights')
cmd:option('--momentum', 0, 'momentum')
cmd:option('--hiddenSize', '{200,200}', 'number of hidden units per layer')
cmd:option('--batchSize', 32, 'number of examples per batch')
cmd:option('--cuda', false, 'use CUDA')
cmd:option('--useDevice', 1, 'sets the device (GPU) to use')
cmd:option('--maxEpoch', 100, 'maximum number of epochs to run')
cmd:option('--maxTries', 30, 'maximum number of epochs to try to find a better local minima for early-stopping')
cmd:option('--dropout', false, 'apply dropout on hidden neurons')
cmd:option('--batchNorm', false, 'use batch normalization. dropout is mostly redundant with this')
cmd:option('--dataset', 'Mnist', 'which dataset to use : Mnist | NotMnist | Cifar10 | Cifar100')
cmd:option('--standardize', false, 'apply Standardize preprocessing')
cmd:option('--zca', false, 'apply Zero-Component Analysis whitening')
cmd:option('--progress', false, 'display progress bar')
cmd:option('--silent', false, 'dont print anything to stdout')
cmd:text()
opt = cmd:parse(arg or {})
opt.schedule = dp.returnString(opt.schedule)
opt.hiddenSize = dp.returnString(opt.hiddenSize)
if not opt.silent then
table.print(opt)
end
Preprocess
The --standardize
and --zca
cmd-line arguments can be toggled on
to perform some preprocessing on the data.
--[[preprocessing]]--
local input_preprocess = {}
if opt.standardize then
table.insert(input_preprocess, dp.Standardize())
end
if opt.zca then
table.insert(input_preprocess, dp.ZCA())
end
if opt.lecunlcn then
table.insert(input_preprocess, dp.GCN())
table.insert(input_preprocess, dp.LeCunLCN{progress=true})
end
A very common and easy preprocessing technique is to Standardize
the datasource, which subtracts the mean and divides
by the standard deviation. Both statistics (mean and standard deviation) are
measured on the train
set only. This is a common pattern when preprocessing data.
When statistics need to be measured across different examples
(as in ZCA and LecunLCN preprocesses),
we fit the preprocessor on the train
set and apply it to all sets (train
, valid
and test
).
However, some preprocesses require that statistics be measured
only on each example, as is the case for global constrast normalization ([GCN]](preprocess.md#dp.GCN)),
such that there is no fitting.
DataSource
We intend to build and train a neural network so we need some data,
which we encapsulate in a DataSource
object. dp provides the option of training on different datasets,
notably MNIST, NotMNIST,
CIFAR-10 or CIFAR-100.
The default for this script is the archetypal MNIST (don't leave home without it).
However, you can use the --dataset
argument to specify a different image classification
dataset.
--[[data]]--
if opt.dataset == 'Mnist' then
ds = dp.Mnist{input_preprocess = input_preprocess}
elseif opt.dataset == 'NotMnist' then
ds = dp.NotMnist{input_preprocess = input_preprocess}
elseif opt.dataset == 'Cifar10' then
ds = dp.Cifar10{input_preprocess = input_preprocess}
elseif opt.dataset == 'Cifar100' then
ds = dp.Cifar100{input_preprocess = input_preprocess}
else
error("Unknown Dataset")
end
A DataSource contains up to three DataSets:
train
, valid
and test
. The first is for training the model.
The second is used for early-stopping and cross-validation.
The third is used for publishing papers and comparing results across different models.
Model of Modules
Ok so we have a DataSource, now we need a model. Let's build a
multi-layer perceptron (MLP) with one or more parameterized non-linear layers
(note that in the case of hidden layers being ommitted (--hiddenSize '{}'
),
the model is just a linear classifier):
--[[Model]]--
model = nn.Sequential()
model:add(nn.Convert(ds:ioShapes(), 'bf')) -- to batchSize x nFeature (also type converts)
-- hidden layers
inputSize = ds:featureSize()
for i,hiddenSize in ipairs(opt.hiddenSize) do
model:add(nn.Linear(inputSize, hiddenSize)) -- parameters
if opt.batchNorm then
model:add(nn.BatchNormalization(hiddenSize))
end
model:add(nn.Tanh())
if opt.dropout then
model:add(nn.Dropout())
end
inputSize = hiddenSize
end
-- output layer
model:add(nn.Linear(inputSize, #(ds:classes())))
model:add(nn.LogSoftMax())
Output and hidden layers are defined using a Linear, which contains the parameters that will be learned, followed by a non-linear transfer function like Tanh (for the hidden neurons) and LogSoftMax (for the output layer). The latter might seem odd (why not use SoftMax instead?), but the ClassNLLCriterion only works with LogSoftMax (or with SoftMax + Log).
The Linear modules are constructed using 2 arguments,
inputSize
(number of input units) and outputSize
(number of output units).
For the first layer, the inputSize
is the number of features in the input image.
In our case, that is 1x28x28=784
, which is what ds:featureSize()
will return.
Now for the odd looking nn.Convert Module. It has two purposes. First,
whatever type of Tensor received, it will output the type of Tensor used by the Module.
Second, it can convert from different Tensor shapes
. The shape
of a typical image is
bchw, short for batch, color/channel, height and width. Modules like
SpatialConvolution
and SpatialMaxPooling
expect this type of input. Our MLP, on the other hand, expects an input of shape bf, short for batch, feature.
Its a pretty simple conversion actually, all you need to do is flatten the chw
dimensions to a single f dimension (in this case, of size 784).
For those not familiar with the nn package, all the nn.*
in the above snippet of code
are Module subclasses.
This is true even for the Sequential.
Although the latter is special. It is a Container
of other Modules, i.e. a composite.
Propagator
Next we initialize some Propagators.
Each such Propagator will propagate examples from a different DataSet.
Samplers iterate over DataSets to
generate Batches of examples (inputs and targets) to propagate through the model
:
--[[Propagators]]--
if opt.lrDecay == 'adaptive' then
ad = dp.AdaptiveDecay{max_wait = opt.maxWait, decay_factor=opt.decayFactor}
elseif opt.lrDecay == 'linear' then
opt.decayFactor = (opt.minLR - opt.learningRate)/opt.saturateEpoch
end
train = dp.Optimizer{
acc_update = opt.accUpdate,
loss = nn.ModuleCriterion(nn.ClassNLLCriterion(), nil, nn.Convert()),
epoch_callback = function(model, report) -- called every epoch
-- learning rate decay
if report.epoch > 0 then
if opt.lrDecay == 'adaptive' then
opt.learningRate = opt.learningRate*ad.decay
ad.decay = 1
elseif opt.lrDecay == 'schedule' and opt.schedule[report.epoch] then
opt.learningRate = opt.schedule[report.epoch]
elseif opt.lrDecay == 'linear' then
opt.learningRate = opt.learningRate + opt.decayFactor
end
opt.learningRate = math.max(opt.minLR, opt.learningRate)
if not opt.silent then
print("learningRate", opt.learningRate)
end
end
end,
callback = function(model, report) -- called every batch
if opt.accUpdate then
model:accUpdateGradParameters(model.dpnn_input, model.output, opt.learningRate)
else
model:updateGradParameters(opt.momentum) -- affects gradParams
model:updateParameters(opt.learningRate) -- affects params
end
model:maxParamNorm(opt.maxOutNorm) -- affects params
model:zeroGradParameters() -- affects gradParams
end,
feedback = dp.Confusion(),
sampler = dp.ShuffleSampler{batch_size = opt.batchSize},
progress = opt.progress
}
valid = dp.Evaluator{
feedback = dp.Confusion(),
sampler = dp.Sampler{batch_size = opt.batchSize}
}
test = dp.Evaluator{
feedback = dp.Confusion(),
sampler = dp.Sampler{batch_size = opt.batchSize}
}
For this example, we use an Optimizer for the training DataSet, and two Evaluators, one for cross-validation and another for testing. Now lets explore the different constructor arguments.
sampler
The Evaluators use a simple Sampler which iterates sequentially through the DataSet. On the other hand, the Optimizer uses a ShuffleSampler. This Sampler shuffles the (indices of a) DataSet before each pass over all examples in a DataSet. This shuffling is useful for training since the model must learn from varying sequences of batches through the DataSet. Which makes the training algorithm more stochastic (subject to the constraint that each example is presented once and only once per epoch).
loss
Each Propagator can also specify a loss
for training or evaluation. This argument is
only mandatory for the Optimizer, as it is required for backpropagation.
If you have previously used the nn package,
there is nothing new here. The loss
is a Criterion.
Each example has a single target class and our Model output is LogSoftMax so
we use a ClassNLLCriterion.
The criterion is wrapped in ModuleCriterion,
which is a decorator that allows us to pass
each input
and target
through a module before it is passed on to the decorated criterion
.
In our case, we want to make sure each target
gets converted to the type of the loss
.
feedback
The feedback
parameter is used to provide us with, you guessed it, feedback (like performance measures and
statistics after each epoch). We use Confusion, which is a wrapper
for the optim package's
ConfusionMatrix.
While our Loss measures the Negative Log-Likelihood (NLL) of the model
on different datasets, our Feedback
measures classification accuracy (which is what we will use for
early-stopping and comparing our model to the state-of-the-art).
callback
Since the Optimizer is used to train the model
on a DataSet,
we need to specify a callback
function that will be called after successive forward/backward
calls.
Among other things, the callback should either
updateParameters
or accUpdateGradParameters.
Depending on what is specified in the command-line, it can also be used to
updateGradParameters
(commonly known as momentum learning). You can also choose to regularize it with
weightDecay or
maxParamNorm, (personally, I prefer the latter to the former).
In any case, the callback
is a function that you can define to fit your needs.
epoch_callback
While the callback
argument is called every batch, the epoch_callback
is called between epochs.
This is useful for decaying hyper-parameters such as the learning rate, which is what we do in this example.
Since the learning rate is the most important hyper-parameter, it is a good idea to try
different learning rate decay schedules during hyper-optimization.
The --lrDecay 'linear'
decay is the easiest to use (the default cmd-line argument).
It involves specifying a starting learning rate --learningRate
, a minimum learning rate --minLR
and the
epoch at which that minimum will be reached : --saturateEpoch
.
The --lrDecay 'adaptive'
uses an exponentially decaying learning rate.
By default this mode only decays the learning rate when a minima hasn't been found for --maxWait
epochs.
But by using --maxWait -1
, we can decay the learning rate
every epoch with the following rule : lr = lr*decayFactor
.
This will decay the learning rate much faster than a linear decay.
Another option (--lrDecay
schedule) is to specify the learning rate
--schedule` manually by specifying a table mapping learning rates to epochs like '{[200] = 0.01, [300] = 0.001}',
which will decay the learning rate to the given value at the given epoch.
Of course, because this argument is just another callback function, you can use it however you please by coding your own function.
acc_update
When set to true, the gradients w.r.t. parameters (a.k.a. gradParameters
)
are accumulated directly into the parameters (a.k.a. parameters
) to produce
an update after the forward
and backward
pass.
In other words, for acc_update=true
, the sequence for propagating a batch is essentially:
updateOutput
updateGradInput
accUpdateGradParameters
Instead of the more common:
updateOutput
updateGradInput
accGradParameters
updateParameters
This means that no gradParameters
are actually used internally. The default value if false.
Some methods do not work with acc_update
as they require the the gradParameters
tensors be populated before being
added to the parameters
. This is the case for updateGradParameters
(momentum learning) and weightDecay
.
progress
Finally, we allow for the Optimizer's progress
bar to be switched on so that we
can monitor training progress.
Experiment
Now its time to put this all together to form an Experiment:
--[[Experiment]]--
xp = dp.Experiment{
model = model,
optimizer = train,
validator = valid,
tester = test,
observer = {
dp.FileLogger(),
dp.EarlyStopper{
error_report = {'validator','feedback','confusion','accuracy'},
maximize = true,
max_epochs = opt.maxTries
}
},
random_seed = os.time(),
max_epoch = opt.maxEpoch
}
Observer
The Experiment can be initialized using a list of Observers. The order is not important. Observers listen to mediator Channels. The Mediator calls them back when certain events occur. In particular, they may listen to the doneEpoch Channel to receive a report from the Experiment after each epoch. A report is nothing more than a bunch of nested tables matching the object structure of the experiment. After each epoch, the component objects of the Experiment (except Observers) can each submit a report to its composite parent thereby forming a tree of reports. The Observers can analyse these and modify the components which they are assigned to (in this case, Experiment). Observers may be attached to Experiments, Propagators, Visitors, etc.
FileLogger
Here we use a simple FileLogger which will store serialized reports in a simple text file for later use. Each experiment has a unique ID which is included in the corresponding reports, thus allowing the FileLogger to name its file appropriately.
EarlyStopper
The EarlyStopper is used for stopping the Experiment
when the error has not decreased, or accuracy has not been maximized.
It also saves to disk the best version of the Experiment when it finds a new one.
It is initialized with a channel to maximize
or minimize (the default is to minimize). In this case, we intend
to early-stop the experiment on a field of the report, in particular the accuracy field of the
confusion table of the feedback table of the validator
.
This {'validator','feedback','confusion','accuracy'}
happens to measure the accuracy of the Model on the
validation DataSet after each training epoch. So by early-stopping on this measure, we hope to find a
Model that generalizes well.
The parameter max_epochs
indicates how many consecutive epochs of training can occur
without finding a new best model before the experiment is signaled to stop
by the doneExperiment Mediator Channel.
Running the Experiment
Once we have initialized the experiment, we need only run it on the datasource
to begin training.
xp:run(ds)
We don't initialize the Experiment with the DataSource so that we may easily save it to disk, thereby keeping this snapshot separate from its data (which shouldn't be modified by the experiment).
Let's run the script from the cmd-line (with default arguments):
nicholas@xps:~/projects/dp$ th examples/neuralnetwork.lua
First it prints the command-line arguments stored in opt
:
{
batchNorm : false
batchSize : 32
cuda : false
dataset : "Mnist"
dropout : false
hiddenSize : {200,200}
learningRate : 0.1
lecunlcn : false
maxEpoch : 100
maxOutNorm : 1
maxTries : 30
momentum : 0
progress : false
schedule : {[200]=0.01,[400]=0.001}
silent : false
standardize : false
useDevice : 1
zca : false
}
After that it prints the model.
Model :
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> output]
(1): nn.Convert
(2): nn.Linear(784 -> 200)
(3): nn.Tanh
(4): nn.Linear(200 -> 200)
(5): nn.Tanh
(6): nn.Linear(200 -> 10)
(7): nn.LogSoftMax
}
The FileLogger
then prints where the epoch logs will be saved. This
can be controlled with the $TORCH_DATA_PATH
or $DEEP_SAVE_PATH
environment variables. It defaults to $HOME/save
.
FileLogger: log will be written to /home/nicholas/save/xps:1432747515:1/log
Finally, we get to the fun part : the actual training. Every epoch,
some performance data gets printed to stdout
:
==> epoch # 1 for optimizer :
==> example speed = 4508.3691689025 examples/s
xps:1432747515:1:optimizer:loss avgErr 0.012714211946021
xps:1432747515:1:optimizer:confusion accuracy = 0.8877
xps:1432747515:1:validator:confusion accuracy = 0.9211
xps:1432747515:1:tester:confusion accuracy = 0.9292
==> epoch # 2 for optimizer :
==> example speed = 4526.7213369494 examples/s
xps:1432747515:1:optimizer:loss avgErr 0.0072034133582363
xps:1432747515:1:optimizer:confusion accuracy = 0.93302
xps:1432747515:1:validator:confusion accuracy = 0.9405
xps:1432747515:1:tester:confusion accuracy = 0.9428
==> epoch # 3 for optimizer :
==> example speed = 4486.8207535058 examples/s
xps:1432747515:1:optimizer:loss avgErr 0.0056732489919492
xps:1432747515:1:optimizer:confusion accuracy = 0.94704
xps:1432747515:1:validator:confusion accuracy = 0.9512
xps:1432747515:1:tester:confusion accuracy = 0.9518
==> epoch # 4 for optimizer :
==> example speed = 4524.4831336064 examples/s
xps:1432747515:1:optimizer:loss avgErr 0.0047361240094285
xps:1432747515:1:optimizer:confusion accuracy = 0.95672
xps:1432747515:1:validator:confusion accuracy = 0.9565
xps:1432747515:1:tester:confusion accuracy = 0.9584
==> epoch # 5 for optimizer :
==> example speed = 4527.7260154406 examples/s
xps:1432747515:1:optimizer:loss avgErr 0.0041567858616232
xps:1432747515:1:optimizer:confusion accuracy = 0.96188
xps:1432747515:1:validator:confusion accuracy = 0.9603
xps:1432747515:1:tester:confusion accuracy = 0.9613
SaveToFile: saving to /home/nicholas/save/xps:1432747515:1.dat
==> epoch # 6 for optimizer :
==> example speed = 4519.2735741475 examples/s
xps:1432747515:1:optimizer:loss avgErr 0.0037086909102431
xps:1432747515:1:optimizer:confusion accuracy = 0.9665
xps:1432747515:1:validator:confusion accuracy = 0.9602
xps:1432747515:1:tester:confusion accuracy = 0.9629
==> epoch # 7 for optimizer :
==> example speed = 4528.1356378239 examples/s
xps:1432747515:1:optimizer:loss avgErr 0.0033203622647625
xps:1432747515:1:optimizer:confusion accuracy = 0.97062
xps:1432747515:1:validator:confusion accuracy = 0.966
xps:1432747515:1:tester:confusion accuracy = 0.9665
SaveToFile: saving to /home/nicholas/save/xps:1432747515:1.dat
After 5 epochs, the experiment starts early-stopping by saving to disk
the version of the model with the lowest xps:1432747515:1:validator:confusion accuracy
.
The first part of that string (xps:1432747515:1
) is the unique id of the experiment.
It concatenates the hostname of the computer (xps
in this case) and a time-stamp.
Loading the saved Experiment
The experiment is saved at /home/nicholas/save/xps:1432747515:1.dat
. You can
load it and access the model
with :
require 'dp'
require 'cuda' -- if you used cmd-line argument --cuda
require 'optim'
xp = torch.load("/home/nicholas/save/xps:1432747515:1.dat")
model = xp:model()
print(torch.type(model))
nn.Serial
For efficiency, the model
here is decorated with a nn.Serial.
You can access the model
you passed to the experiment by adding :
model = model.module
print(torch.type(model))
nn.Sequential
Hyper-optimization
Hyper-optimization is the hardest part of deep learning. In many ways, it can feel more like an art than a science.
A got this question from a dp user :
If I am attempting to train a NN with a custom dataset - how do I optimize the parameters ? I mean, if the NN takes about 3-5 days to train completely.. How do you do small experiments / tuning ( learning rate, nodes in each layer ) - so that you dont spend 5 days with each - Basically what is the quickest way to find the best parameters ?
This is what I answered :
In my opinion it is a mix of experience and computing resources. The first comes from playing around with different combinations of datasets and models. Eventually, you just know what kind of regime work well. As for the second, it lets you try different hyper-parameters at the same time. If you have access to 4-8 GPUs, that is 4-8 experiments to try at once in parallel. You can also quickly see that some of those experiments are mush slower to converge than the others. In which case you can kill them and try something similar to the ones that work. Also, the most important hyper-parameter is the learning rate. Find the highest value that doesn't make the training diverge. That way your experiments will converge faster. You can also decay it towards the end to get a little accuracy boost (not always).
I could have said more. For one, I recommend regularizing with maxParamNorm
.
A max_out_norm
around 2 is usually a good starting point, continuing with 1, 10,
and only try -1 when out of ideas.
Second, you can vary the epoch sizes to better divide processing time between evaluation and training.
It's often best to keep the evaluation sets small when you can
(like less than 10% of all data). Also, the more training data, the better.
But these are all arbitrary guidelines. No one can tell you how to hyper-optimize. You need to try optimizing a dataset for yourself to find your own methodology and tricks. The dp GitHub repository also provides a wiki that can be used to share hyper-parameter configurations as well as corresponding performance metrics and observations.
Finally, it is easier to hyper-optimize as a team than alone. Everyone has a piece of the puzzle. For example, not too long ago I was reminded the importance of a good spreadsheet for keeping track of what experiments were tried. This guy would have a column for each hyper-parameter in the cmd-line arguments, columns for error/accuracy and observations, and a row for each experiment. For me, I was too lazy to make this nice spreadsheet on my own, I just used paper and pen... But then I noticed how easy it was for him to find the best hyper-parameter configurations. Anyway, the point is, I learned from this guy who had this part of the puzzle all figured out. So keep your eyes peeled for such lessons in the art of hyper-optimization.