Data preprocessing is a very important and quite underestimated step in Machine Learning pipelines. It provides cleaned and relevant datasets which then can be used in further steps like classification or regression. I will describe a study case for data which is fed to the SVM classifier to predict if a given image segment belongs to foreground or background. This is a second article about Support Vector Machine which is used for image segmentation in my flower species recognition project Flover.
Blogging in English
Firstly, let me mention that I decided to switch to English on my blog. There are many motives for it. Much larger community which I can communicate with is the main reason here. Definitely, it will be a great chance to polish my English what will probably make me a frequent guest of online dictionary 🙂 It can be a bit harder of course, but I feel that it's time to raise the bar after a month of blogging in Polish. However, it can turn out that English is easier sometimes when it comes to some technical explanations. I came across many strange sounding Polish translations when writing previous posts. While there are not so many articles so far on ProggBlogg, it's good time to switch and maybe translate some of them. So, let's start the discussion about data preprocessing techniques.
The first, most important thing to do after data collection is to understand the problem that we have to solve and extract relevant data. In Flover project, to eliminate the background (process described here), I use about 1300 images of flowers from this dataset and their groundtruth images defining foreground masks. On each image, a SLIC segmentation is performed to obtain 200 superpixels per image which further will be a main subject of foreground/background classification. For such problems a key is to make a proper feature extraction. We have to decide which features give relevant information about the object, what can differentiate our samples between clusters they belong to. In Flover, for FG/BG estimation, for each superpixel I extracted data about its color (R, G, B), color variances (R, G, B) and relative position (X, Y) on the picture. So, I get 200 vectors of size 8 for each image. I rejected shape or area of a superpixel as this features seem to be similar both for background segments and foreground ones. Each segment of course has a label indicating the cluster which it belongs to.
Data preprocessing steps
Now, we have 200 segments * 1300 images = 260k vectors, quite a big number of tuples which are probably redundant or unscaled or maybe even not formatted properly. We would like to prepare this data for the classifier training. Let's distinguish following steps which are commonly taken in Machine Learning for data preprocessing:
- Data Cleaning
Filling the missing values, resolving data inconsistencies, smoothing noisy data and removing outliers,
- Data Integration
Combining data with different representation, from multiple databases, removing redundant/similar data from many sources,
- Data Transformation
Normalization, aggregation and generalization of data,
- Data Reduction
Obtaining reduced representation of data which produce similar analytical results,
- Data Discretization
Dividing the range of a continuous attributes into intervals.
We are going to explore some actions from this list which were taken to prepare the set of feature vectors for SVM training.
Sometimes, it turns out that we have incomplete data. For example, in discussed case of FG/BG estimation, some pictures from the training set had to be rejected because they didn't have a foreground mask created. We cannot teach a classifier without the knowledge about the sample predicted response, so let's remove these samples. Every flower species will have still good representation of a few pictures. It resulted in 850 sample images left which still should be a sufficient number for FG/BG estimation. In my opinion, data formatting can also fall into category of data cleaning. Here, I prepared an easy readable .csv file which consists of all the feature vectors.
From the previous step we are left with 200 segments * 850 images = 170k feature vectors. Further analysis showed that, for this dataset, the relationship between foreground and background superpixels equals virtually 1:2. It is often advised to balance the training data before starting SVM classifier. To do it, we could either add some new foreground segments or remove half of background segments. Having still a large dataset we can choose the latter option to equalize number of feature vectors for both clusters. After such operation we are left with 110k feature vectors. To test this approach I trained SVM classifiers before data balancing and after, which confirmed the advantage of this action.
- Before data balancing - FG/BG segmentation acurracy for learning set: 83%, testing set: 80%
- After data balancing - FG/BG segmentation acurracy for learning set: 90%, testing set: 88%
Another transformation, which is necessary, is scaling each attribute in the dataset. If we don't do it, classifier can depend much more on attributes which scale is larger than others. Here, for example, R, G, B color values are greater than relative segment position values. To prevent it, one could perform linear scaling which can result in e.g. 0 to 1 range. It's based on maxima and minima of a given attribute in it's whole population. But it's not good idea when we expect to have any outliers. Let's try different approach. We can center each value against the calculated mean. But still, ranges are different for various features. So, we divide the result by the standard deviation of the population. This method is called a Z-score (standard score) scaling. It's calculated from the following formula:
where z is the z-score, X is the value of the element, µ is the population mean, and σ is the standard deviation. I found this explanation to be very comprehensive and concise. This is the method I chose for data normalization in my set.
It's natural that the more data we have and the more features are defined, the longer and harder it will be for a classifier to converge. If we have high-dimensional dataset we should think about proper, uncorrelated feature selection.
"Good feature subsets contain features highly correlated with the classification, yet uncorrelated to each other" 
Such redundant features can be detached by correlational analysis for example. Some popular metrics like Pearson correlation coefficient may indicate if two features are linearly dependent from each other. If so, one of these features can be removed as it's not providing more information than the other one. My dataset of image superpixels luckily showed low correlation between features. It's also not so high-dimensional set of only 8 features.
Another, well-known technique for dimensionality reduction is Principal Component Analysis method. It is a great algorithm which performs some transformations on a dataset to define new vectors with the highest variance of our features. These vectors form a newly created uncorrelated, orthogonal basis set for the data. As the vectors (principal components) are ordered from the highest variance to the lowest, many of the lower ones can be rejected. This results in reduction of dimensions. I made a research project once where the dataset could be reduced 32 times by PCA without losing the model generalization ability.
Regarding the number of data samples, there is no rule of thumb. We have to choose it experimentally. The more complex model or more noisy data, the more data is necessary. In the case of my dataset, I found out that the model is sufficiently trained by SVM on 10 000 vector samples which are just randomly chosen from the whole dataset (previously 110k samples). Increasing this number strongly affects the computational time of SVM training. Apart from training samples we should also reserve some samples for validation and testing what will be the main topic of next post.
Basic data preprocessing techniques were presented. Along with feature extraction it can be named as data preparation process. Stages like data cleaning, transformation and reduction were described using the case from the Flover project. There, a SVM classifier is used to segment an image to foreground and background parts.
1. Data preprocessing definition on Techopedia
2. Comprehensive presentation about data preprocessing by Nguyen Hung Son
3. Blog article "How to Prepare Data For Machine Learning"
4. Z-Score method explanation
5. Mark A. Hall - "Correlation-based Feature Selection for Machine Learning"
6. SVM Classification - minimum number of input sets for each class - a StackOverflow discussion
7. Kacmajor T., Michalski J.J., "Principal Component Analysis in Application for Filter Tuning Algorithm”