sklearn學習:訓練一個分辨性別的模型


最近在學習sklearn,動手實現了一個根據名稱分辨性別的模型。
感覺還是蠻有趣的。
這個是參考的鏈接(英文),Working with Text Data

仿照這個教程上的步驟一步步建立自己的模型。

Version1

1.准備訓練數據

將csv中的record讀入,使用pandas。為了在文本文檔上執行機器學習,我們首先需要將文本內容轉換為數字特征向量。
可以將名稱分為單詞袋(bag of words),如:

為訓練集中任何文檔中出現的每個詞分配一個固定的整數id(例如,通過從單詞到整數索引建立詞典)。
對於每個文檔#i,計算每個單詞的出現次數w並將其存儲為特征的值, 其中,單詞在字典中的索引X[i, j]#jjw

單詞袋通常使用高位稀疏矩陣(high-dimensional sparse matrix)表示。

在我自己的例子中,我將讀入的數據分為特征集X和結果集Y。

    # step1: prepare data, establish an iterable object,to use in CountVectorizer
    # word_list, to be used as X
    # gender_list, to be used as Y

    word_list = []
    gender_list = []

    file = pd.read_csv('your_filename.csv')
    df = pd.DataFrame(file)

    for i in range(len(df)):
        document = df[i:i + 1]
        gender = document['Gender'][i]
        if gender == 'men':
            title = document['Title'][i]
            if pd.isnull(title) == False:
                word_list.append(title)
                gender_list.append(0)
        elif gender == 'women':
            title = document['Title'][i]
            if pd.isnull(title) == False:
                word_list.append(title)
                gender_list.append(1)

准備好兩個大的數據集之后,使用train_test_split將兩個數據集划分為四個,分別為訓練數據集,訓練結果集,測試數據集和測試結果集。因為是supervised learning, 即有監督的學習模型,所以需要訓練結果集,而兩個測試數據集則用來測試模型的准確度。
這里有比較詳細的train_test_split介紹的鏈接,我的另一篇blog。sklearn.train_test_split

    # split data set into train set and test set using train_test_split # always use 70% for train data, 30% for test data X_train, X_test, Y_train, Y_test = train_test_split(word_list, gender_list, test_size=0.3, random_state=42)

2. 使用sklearn進行文本標記

文本預處理、分割和停止詞的過濾,包含在一個高級組件中,使用該組件能夠建立特征的字典,並將文本轉化為特征向量。
CountVectorizer支持單詞或連續字符的計數。一旦擬合,向量化器就構建了一個特征索引字典。

    # step2: tokenize text
    count_vect = CountVectorizer()
    X_train_sparse_matrix = count_vect.fit_transform(X_train)
    X_test_sparse_matrix = count_vect.transform(X_test)
    dense_numpy_matrix = X_train_sparse_matrix.todense()

3.統計詞頻

統計詞的個數盡管很有用,但存在一些問題,比如,即使談論的主題是相同的,長文本也比短文本的平均計數更高。因此需要轉化為頻率。

我們首先使用fit方法來將我們的估計量與數據擬合,其次將transform方法轉換為tf-idf表示。 通過跳過冗余處理,可以將這兩個步驟組合在一起,以更快地達到相同的最終結果,即使用fit_transform

    # step3: get frequencies (features)
    tfidf_transformer = TfidfTransformer()
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_sparse_matrix)
    X_test_tfidf = tfidf_transformer.fit_transform(X_test_sparse_matrix

4. 培訓分類器

這里先使用的是朴素貝葉斯分類器。

    # step4: training a classifier (after having features now)
    clf = MultinomialNB().fit(X_train_tfidf, Y_train)

后續也有改進為其他方法,如SVM。

5.預測和評估

    predicted = clf.predict(X_test_tfidf)
    print(predicted)
    accruacy = np.mean(predicted == Y_test)
    print(accruacy)

最終的運行結果為:
[0 0 1 ... 1 1 1]
4621
0.9091105821250811

6.Version1完整代碼

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import numpy as np


if __name__ == '__main__':
    # step1: prepare data, establish an iterable object,to use in CountVectorizer
    # word_list, to be used as X
    # gender_list, to be used as Y

    word_list = []
    gender_list = []

    file = pd.read_csv('your_filename.csv')
    df = pd.DataFrame(file)

    for i in range(len(df)):
        document = df[i:i + 1]
        gender = document['Gender'][i]
        if gender == 'men':
            title = document['Title'][i]
            if pd.isnull(title) == False:
                word_list.append(title)
                gender_list.append(0)
        elif gender == 'women':
            title = document['Title'][i]
            if pd.isnull(title) == False:
                word_list.append(title)
                gender_list.append(1)

    # split data set into train set and test set using train_test_split
    # always use 70% for train data, 30% for test data
    X_train, X_test, Y_train, Y_test = train_test_split(word_list, gender_list,
                                                        test_size=0.3,
                                                        random_state=42)


    # step2: tokenize text
    count_vect = CountVectorizer()
    X_train_sparse_matrix = count_vect.fit_transform(X_train)
    X_test_sparse_matrix = count_vect.transform(X_test)
    dense_numpy_matrix = X_train_sparse_matrix.todense()
    # print(dense_numpy_matrix)
    # print(count_vect.vocabulary_)
    # print(X_train_sparse_matrix.shape) #(9,13) 9 sentences, 13 words
    # print(count_vect.vocabulary_.get(u'slip')) # return the id of words in vocabulary???

    # step3: get frequencies (features)
    tfidf_transformer = TfidfTransformer()
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_sparse_matrix)
    X_test_tfidf = tfidf_transformer.fit_transform(X_test_sparse_matrix)
    # print(X_train_tfidf,'\n','\n')
    #
    # print(X_test_tfidf)
    # print(X_train_tfidf)#.toarray())
    # print(type(X_train_tfidf))
    # print(X_train_tfidf.shape)

    # step4: training a classifier (after having features now)
    clf = MultinomialNB().fit(X_train_tfidf, Y_train)

    predicted = clf.predict(X_test_tfidf)
    # predicted = text_clf.fit(X_train_word_list,gender_list)
    print(predicted)
    print(len(predicted))
    accruacy = np.mean(predicted == Y_test)
    print(accruacy)
    # for doc,category in zip(word_list,predicted):
    # print(('{}=>{}').format(doc,word_list[category]))

Version2

1. 使用Pipline簡化過程

sklearn提供了Pipline來簡化上述過程。
為了使vectorizer => transformer =>classifier更易於使用,scikit-learn提供了一個Pipeline類似復合分類器的類。

    # step2: using pipline as a compound classifier
    text_clf = Pipeline([('vect',CountVectorizer()),
                         ('tfidf',TfidfTransformer()),
                         ('clf',SGDClassifier(loss='hinge',penalty='l2',
                                              alpha=1e-3,random_state=42,
                                              max_iter=5,tol=None)),
                         ])

2.使用SVM模型

SVM即支持向量機模型,使用該模型,速度雖然比朴素貝葉斯慢一點,但是准確率更高。

上面的SDGClassifier

3.Version2完整代碼

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
import numpy as np

if __name__ == '__main__':
    # step1: prepare data, establish an iterable object
    # word_list, to be used as X
    # gender_list, to be used as Y
    word_list = []
    gender_list = []

    file = pd.read_csv('your_filename.csv')
    df = pd.DataFrame(file)

    for i in range(len(df)):
        document = df[i:i + 1]
        gender = document['Gender'][i]
        if gender == 'men':
            title = document['Title'][i]
            if pd.isnull(title) == False:
                word_list.append(title)
                gender_list.append(0)
        elif gender == 'women':
            title = document['Title'][i]
            if pd.isnull(title) == False:
                word_list.append(title)
                gender_list.append(1)

    # split data set into train set and test set using train_test_split
    # always use 70% for train data, 30% for test data
    X_train,X_test,Y_train,Y_test = train_test_split(word_list,gender_list,
                                                     test_size=0.3,
                                                     random_state=42)


    # step2: using pipline as a compound classifier
    text_clf = Pipeline([('vect',CountVectorizer()),
                         ('tfidf',TfidfTransformer()),
                         ('clf',SGDClassifier(loss='hinge',penalty='l2',
                                              alpha=1e-3,random_state=42,
                                              max_iter=5,tol=None)),
                         ])

    # train data
    text_clf.fit(X_train,Y_train)

    # test
    X_test_word_list = [str(item).encode(encoding='utf-8') for item in X_test]
    predicted = text_clf.predict(X_test_word_list)

    accuracy = np.mean(predicted == Y_test)
    print(predicted)
    print(accuracy)

繼續學習中…


備注:使用的訓練數據集全部為英文材料。


注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
粤ICP备14056181号  © 2014-2020 ITdaan.com