Stochasticity in Gradient Descent
the Diploma thesis learning –
In previous article I have talked about basic Machine Learning method – the Gradient Descent .
The Gradient Descent, albeit simple, is a strong method. It is great for small amount of data as it counts the gradient from all the data in the set. For few independent variables smaller amounts of data are sufficient, but having more independent variables means more data is needed. Yet having more data means harder computational complexity. That in turn means, with approximately same amount of iterations, that much more time is needed.
Possible improvements
The amount of time needed to approximate the algorithm to enough precision increases with the amount of independent variables and amount of data to count derivative of. Decreasing the amount of independent variables will certainly help, but it is not trivial to determine the correct independent variables that correlate with dependent variables. Another possibility to improve time complexity is to decrease amount of data to be processed in one iteration.
It is possible and simple and can be done very fast. Decreasing amount of data will make faster computations, but it will also (in most cases) not increase the accuracy of the model as much after single iteration. Unfortunately this is not the only problem. How to choose the data for single iteration? What amount of data is sufficient to be processed in single iteration? How many iterations will be needed to improve fit of the model to some accuracy? Does decreasing the amount of data in iterations improve time complexity compared to the time needed to complete processing of all data in set (to some defined accuracy)?
The goal is to finish with good enough model in smallest possible number of time and iterations with biggest impact on accuracy of model.
Solutions
Decreasing the amount of data to single piece will speed-up the processing of single iteration by proportion of the amount of the data processed. Meaning approximately amount of
, where
is amount of data. In case of processing on some CPUs and GPUs the
-multiplication can be as fast as
-multiplication. In these cases the amount of data to be processed can be increased to some size
instead of single vector with (almost) no penalty. And even on processors that do not posses this power it is often faster to count multiple vectors instead of doing more iterations.
The other problem is iterating over the set of data. Iterating one-by-one (or in case of GPUs many-by-one) could sound like wise choice but that could be counter productive. Some data could be outliers. In many cases outliers can be removed, but in some cases the outliers could mean existence of another property or phenomena. Having outliers would mean big divergence of model. The data does not necessarily need to be outliers. The existence of variance in a real model means that there is non-zero probability of divergence in some direction. The other problem of iteration through whole data set is that the model needs to be run through or else it could be incomplete (with data of some property missing).
Surprisingly enough these problems are mostly mitigated by introduction of randomness. There is random chance of
of having single data
chosen, with
being amount of chosen data and
being amount of all data set and
. With randomness introduced it is not necessary to run through all the data in set. The likelihood of hitting the incorrect data is decreased as the algorithm does not run through all the data in set and it has the speed-up of the
for single iteration. Yet as it suffers from processing less data in single iteration it is more fragile and therefore in some iterations it may diverge. It also has to run for more iterations then whole Gradient Descent (with whole dataset) but the time complexity in which the iteration is completed is unmatched.