My Understandings of Data Pre-processing

Jodie Heqi Qiu
3 min readApr 5, 2020
  1. When pre-processing training data and test data, does it make a difference to pre-process before splitting or to split before pre-processing?

The principle here is to make sure that the model does not gain any sights from test data. Or make sure that information of test data does not leak into training data.

For instance, if it’s filling the nulls, pre-processing before splitting or splitting before pre-processing makes no difference.

But if it’s data scaling, say standard scale features, splitting must go before pre-processing. Because test data would affect the scale, and if that scale is used to pre-process training data, it means that information of test data leaked into training data.

2. How to properly pre-process categorical features?

I didn’t put too much thought into the question until I started working on the Boston house price project. When I was working on that project, I found it very difficult to pre-process so many categorical features. Here is a summary of what I tried:

a. The most common and quick way: get_dummies. The problem is its simpleness. Dropping categorical feature would miss valuable information. It would significantly damage the accuracy of model particularly when categorical features form a big part of the dataset. In such a case, value encoding is a better way.

b. Value encoding: one-hot encoding.

We know that the date type of categorical features is string. One-hot encoding transforms strings to binary vectors. For instance, neighborhood x, y, z, can be transformed into:

[1, 0, 0, 0]

[0, 1, 0, 0]

[0, 0, 1, 0]

*the last 0 represents every else’s neighborhood

But be careful with this method. When there are many possible values for the feature, the vectors would be too long which makes calculation slow. In our example, if there are 500 kinds of neighborhood, the length of vectors would be 501!

3. Check skewness (normality testing)!

I think this could also be counted as data visualization since it’s normal to check features’ distributions over the sample by histogram. But it is always better to check by actually calculating the skewness (I use skew from scipy.stats).

The more skewed the data is, the less accurate results it returns. Why is that? Taking the sale price of the Boston house price dataset for example, as the distribution graph shows, most of the prices are located at the lower side.

Using such data to train my model would make the model inaccurate for houses at the high end.

I used log1p to normalize the distribution. Codes as below:

$ SalePrice_normalized = np.log1p(SalePrice)
$ sns.distplot(SalePrice_normalized).set(xlabel=’SalePrice_normalized’)
$ plt.show()

The result is what I wanted:

--

--

Jodie Heqi Qiu

My memos of machine learning algorithms, data pre-prcocessing and statistics. Git: https://github.com/qhqqiu