### 1.首先造一個測試數據集

```#coding:utf-8
import numpy
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer

def t2():
testdata = pd.DataFrame({'pet': ['chinese', 'english', 'english', 'math'],
'age': [6 , 5, 2, 2],
'salary':[7, 5, 2, 5]})
print testdata

t2()```

### 2. 對付數值型類別變量

`OneHotEncoder(sparse = False).fit_transform(testdata.age) # testdata.age 這里與 testdata[['age']]等價`

`OneHotEncoder(sparse = False).fit_transform(testdata[['age']])`

``array([[ 0.,  1.,  0.],       [ 0.,  0.,  1.],       [ 1.,  0.,  0.],       [ 1.,  0.,  0.]])``

```import numpy

result1 = OneHotEncoder(sparse = False).fit_transform(testdata[['age']])
result2 = OneHotEncoder(sparse=False).fit_transform(testdata[['salary']])
final_output = numpy.hstack((result1,result2))
print final_output```

`result = OneHotEncoder(sparse = False).fit_transform( testdata[['age', 'salary']])`
``結果為``
``array([[ 0.,  1.,  0.,  0.,  1.,  0.],       [ 0.,  0.,  1.,  0.,  0.,  1.],       [ 1.,  0.,  0.,  1.,  0.,  0.],       [ 1.,  0.,  0.,  1.,  0.,  0.]])``

### 3. 對付字符串型類別變量

• 方法一 先用 LabelEncoder() 轉換成連續的數值型變量，再用 OneHotEncoder() 二值化

• 方法二 直接用 LabelBinarizer() 進行二值化

``# 方法一: LabelEncoder() + OneHotEncoder()a = LabelEncoder().fit_transform(testdata['pet'])OneHotEncoder( sparse=False ).fit_transform(a.reshape(-1,1)) # 注意: 這里把 a 用 reshape 轉換成 2-D array# 方法二: 直接用 LabelBinarizer()LabelBinarizer().fit_transform(testdata['pet'])``

``array([[ 1.,  0.,  0.],       [ 0.,  1.,  0.],       [ 0.,  1.,  0.],       [ 0.,  0.,  1.]])``