This is the last post in series about Support Vector Machine classifier. We already feel the basics of SVM. We have our data preprocessed. Finally, we know the influence of some major hyperparameters on the classifier. Now, let's choose proper hyperparameters for a given problem. This is done by validation or cross-validation. These techniques are very common in Machine Learning and are also helpful in finding a proper SVM model. The example will cover building the classifier for the foreground/background estimation problem in Flover project.
Is it a "black art"?
Or can we automate something?
The most common hyperparameters to choose in SVM model are complexity (margin softness) and gamma (interchangeably sigma) which controls width of a gaussian kernel. But for some more complicated Machine Learning architectures there are many more hyperparameters to optimize. For example, for deep neural networks one has to choose proper learning rate, learning rate schedule, number of training iterations, number of hidden layers or momentum . Do we have any fixed routine to find these values or we have to rely on our experience with a given problem and architecture?
We can list down some common methods of finding hyperparameters for our classifier.
- Manual Search
- Grid Search
- Random Search
- Automated optimization
When we have a knowledge on a given topic and know some basics about certain classifier we can just manually search the space of hyperparameters. We simply take any set of parameters and train the model. Observing the generalization error we tweak the model until we are satisfied with the results. It sounds like more academic approach but it's still very common in the industry as well.
Here, we select ranges of our hyperparameters and choose some intervals of sampling them. This way we obtain a grid of these parameters which let us do an exhaustive search. Now, we just run a model training for every parameter set in the grid. It's quite computationally expensive, especially when we the training takes a long time and there are many hyperparameters. On the other hand, it's very easy to parallelize such process.
Like in grid search, we have to pick hyperparameters ranges. This time, the values of parameters are randomly chosen. This method is faster than grid search and also can be paralellized.
Some researchers constantly propose new methods for the automatic search of hyperparameters. The new set of parameters is chosen after each iteration of training in order to converge to the best available set. The most common method is Bayesian optimization . Another gradient-based method, specifically for SVM is presented in . There are also some gradient-free methods like Nelder-Mead optimization or evolutionary methods like genetic algorithms or particle swarm optimization.
Validation and cross-validation
It's very important to split our data set into training samples and testing samples. This is a very common approach to prevent our classifier from overfitting. Model is overfit when it shows excellent performance on the training set but doesn't respond well for the previously unseen data. That is why, we sort out a testing set to check the model real performance. When we observe the learning curves for the chosen model, it's quite normal that in some point the testing (generalization) error reaches its minimum while learning error continues to decrease. In theory, we should stop the learning in this point to prevent overfitting.
So, we can evaluate different set of hyperparameters until we find the minimum testing error. It turns out that this way we can overfit to the testing data! One can say that some information about test set leaks to the models search algorithm. Therefore, we should extract one more set which is called a validation set. Training is still performed on the trainig set, hyperparameters are chosen based on error from validation set. Then, when we are satisfied, we can perform final check on the testing set.
Unfortunately, in this method, by partitioning data into three sets we lose some of our valuable data samples which could be used for the training. Here comes cross-validation.
We get rid of the validation set leaving test set for a final check. The training set has to be divided into k parts. Then, we train the model based on k-1 sets while the validation is performed on the remaining set. Such procedure is repeated k times. Each time the validation set is different. The whole performance of a model is represented by the average of these k validations. This is the most common type of cross-validation named k-fold. Usually k is set to 5 or 10. It's quite computationally expensive method but is very helpful if we don't have much data and we can't afford extraction of an additional validation set.
SVM model selection
Let's go back to my dataset from the Flover project. It consists of superpixels with features like color, color variance and position on the image. They are labeled as foreground or background (FG/BG). I chose to perform a grid search because of its simplicity. I divided the dataset into 10000 learning samples, 5000 validation samples and 5000 testing samples. I had many more samples to use, so I could sort out a validation set without consequences. Below are the obtained results. I present the quality of FG/BG estimation in percents (accuracy of FG/BG prediction) for different hyperparameters and discussed before. Green fields indicate the best results, red - the worst.
We can see an interesting phenomena. The model with the best learning performance - 98.7% is clearly overfit because it's validation acurracy equals 84.7% which is for sure not the best result. The SVM model which is the best here have following hyperparameters: =0.08, =20. Learning accuracy equals 90% and validation accuracy - 87.5%. The performance obtained on the final testing set is very similar to the validation accuracy - 87.6%.
1. Documentation of scikit-learn library for Python.
2. Ben-Hur, Weston - "A User’s Guide to Support Vector Machines".
3. "Guideline to select the hyperparameters in Deep Learning" on StackExchange
4. Yoshua Bengio - "Practical Recommendations for Gradient-Based Training of Deep Architectures"
5. Hyperparameter optimization on Wikipedia
6. Ryan P. Adams - "Practical Bayesian Optimization of Machine Learning Algorithms"
7. Olivier Chapelle et al. - "Choosing multiple parameters for support vector machines"