最近在學習sklearn,動手實現了一個根據名稱分辨性別的模型。
感覺還是蠻有趣的。
這個是參考的鏈接(英文),Working with Text Data
仿照這個教程上的步驟一步步建立自己的模型。
將csv中的record讀入,使用pandas。為了在文本文檔上執行機器學習,我們首先需要將文本內容轉換為數字特征向量。
可以將名稱分為單詞袋(bag of words),如:
為訓練集中任何文檔中出現的每個詞分配一個固定的整數id(例如,通過從單詞到整數索引建立詞典)。
對於每個文檔#i,計算每個單詞的出現次數w並將其存儲為特征的值, 其中,單詞在字典中的索引X[i, j]#jjw
單詞袋通常使用高位稀疏矩陣(high-dimensional sparse matrix)表示。
在我自己的例子中,我將讀入的數據分為特征集X和結果集Y。
# step1: prepare data, establish an iterable object,to use in CountVectorizer
# word_list, to be used as X
# gender_list, to be used as Y
word_list = []
gender_list = []
file = pd.read_csv('your_filename.csv')
df = pd.DataFrame(file)
for i in range(len(df)):
document = df[i:i + 1]
gender = document['Gender'][i]
if gender == 'men':
title = document['Title'][i]
if pd.isnull(title) == False:
word_list.append(title)
gender_list.append(0)
elif gender == 'women':
title = document['Title'][i]
if pd.isnull(title) == False:
word_list.append(title)
gender_list.append(1)
准備好兩個大的數據集之后,使用train_test_split
將兩個數據集划分為四個,分別為訓練數據集,訓練結果集,測試數據集和測試結果集。因為是supervised learning, 即有監督的學習模型,所以需要訓練結果集,而兩個測試數據集則用來測試模型的准確度。
這里有比較詳細的train_test_split
介紹的鏈接,我的另一篇blog。sklearn.train_test_split
# split data set into train set and test set using train_test_split # always use 70% for train data, 30% for test data X_train, X_test, Y_train, Y_test = train_test_split(word_list, gender_list, test_size=0.3, random_state=42)
文本預處理、分割和停止詞的過濾,包含在一個高級組件中,使用該組件能夠建立特征的字典,並將文本轉化為特征向量。
CountVectorizer
支持單詞或連續字符的計數。一旦擬合,向量化器就構建了一個特征索引字典。
# step2: tokenize text
count_vect = CountVectorizer()
X_train_sparse_matrix = count_vect.fit_transform(X_train)
X_test_sparse_matrix = count_vect.transform(X_test)
dense_numpy_matrix = X_train_sparse_matrix.todense()
統計詞的個數盡管很有用,但存在一些問題,比如,即使談論的主題是相同的,長文本也比短文本的平均計數更高。因此需要轉化為頻率。
我們首先使用fit
方法來將我們的估計量與數據擬合,其次將transform
方法轉換為tf-idf
表示。 通過跳過冗余處理,可以將這兩個步驟組合在一起,以更快地達到相同的最終結果,即使用fit_transform
。
# step3: get frequencies (features)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_sparse_matrix)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_sparse_matrix
這里先使用的是朴素貝葉斯
分類器。
# step4: training a classifier (after having features now)
clf = MultinomialNB().fit(X_train_tfidf, Y_train)
后續也有改進為其他方法,如SVM。
predicted = clf.predict(X_test_tfidf)
print(predicted)
accruacy = np.mean(predicted == Y_test)
print(accruacy)
最終的運行結果為:
[0 0 1 ... 1 1 1]
4621
0.9091105821250811
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import numpy as np
if __name__ == '__main__':
# step1: prepare data, establish an iterable object,to use in CountVectorizer
# word_list, to be used as X
# gender_list, to be used as Y
word_list = []
gender_list = []
file = pd.read_csv('your_filename.csv')
df = pd.DataFrame(file)
for i in range(len(df)):
document = df[i:i + 1]
gender = document['Gender'][i]
if gender == 'men':
title = document['Title'][i]
if pd.isnull(title) == False:
word_list.append(title)
gender_list.append(0)
elif gender == 'women':
title = document['Title'][i]
if pd.isnull(title) == False:
word_list.append(title)
gender_list.append(1)
# split data set into train set and test set using train_test_split
# always use 70% for train data, 30% for test data
X_train, X_test, Y_train, Y_test = train_test_split(word_list, gender_list,
test_size=0.3,
random_state=42)
# step2: tokenize text
count_vect = CountVectorizer()
X_train_sparse_matrix = count_vect.fit_transform(X_train)
X_test_sparse_matrix = count_vect.transform(X_test)
dense_numpy_matrix = X_train_sparse_matrix.todense()
# print(dense_numpy_matrix)
# print(count_vect.vocabulary_)
# print(X_train_sparse_matrix.shape) #(9,13) 9 sentences, 13 words
# print(count_vect.vocabulary_.get(u'slip')) # return the id of words in vocabulary???
# step3: get frequencies (features)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_sparse_matrix)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_sparse_matrix)
# print(X_train_tfidf,'\n','\n')
#
# print(X_test_tfidf)
# print(X_train_tfidf)#.toarray())
# print(type(X_train_tfidf))
# print(X_train_tfidf.shape)
# step4: training a classifier (after having features now)
clf = MultinomialNB().fit(X_train_tfidf, Y_train)
predicted = clf.predict(X_test_tfidf)
# predicted = text_clf.fit(X_train_word_list,gender_list)
print(predicted)
print(len(predicted))
accruacy = np.mean(predicted == Y_test)
print(accruacy)
# for doc,category in zip(word_list,predicted):
# print(('{}=>{}').format(doc,word_list[category]))
sklearn
提供了Pipline
來簡化上述過程。
為了使vectorizer => transformer =>classifier
更易於使用,scikit-learn提供了一個Pipeline類似復合分類器的類。
# step2: using pipline as a compound classifier
text_clf = Pipeline([('vect',CountVectorizer()),
('tfidf',TfidfTransformer()),
('clf',SGDClassifier(loss='hinge',penalty='l2',
alpha=1e-3,random_state=42,
max_iter=5,tol=None)),
])
SVM即支持向量機模型,使用該模型,速度雖然比朴素貝葉斯慢一點,但是准確率更高。
上面的SDGClassifier
。
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
import numpy as np
if __name__ == '__main__':
# step1: prepare data, establish an iterable object
# word_list, to be used as X
# gender_list, to be used as Y
word_list = []
gender_list = []
file = pd.read_csv('your_filename.csv')
df = pd.DataFrame(file)
for i in range(len(df)):
document = df[i:i + 1]
gender = document['Gender'][i]
if gender == 'men':
title = document['Title'][i]
if pd.isnull(title) == False:
word_list.append(title)
gender_list.append(0)
elif gender == 'women':
title = document['Title'][i]
if pd.isnull(title) == False:
word_list.append(title)
gender_list.append(1)
# split data set into train set and test set using train_test_split
# always use 70% for train data, 30% for test data
X_train,X_test,Y_train,Y_test = train_test_split(word_list,gender_list,
test_size=0.3,
random_state=42)
# step2: using pipline as a compound classifier
text_clf = Pipeline([('vect',CountVectorizer()),
('tfidf',TfidfTransformer()),
('clf',SGDClassifier(loss='hinge',penalty='l2',
alpha=1e-3,random_state=42,
max_iter=5,tol=None)),
])
# train data
text_clf.fit(X_train,Y_train)
# test
X_test_word_list = [str(item).encode(encoding='utf-8') for item in X_test]
predicted = text_clf.predict(X_test_word_list)
accuracy = np.mean(predicted == Y_test)
print(predicted)
print(accuracy)
繼續學習中…
備注:使用的訓練數據集全部為英文材料。
本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。