使用spark建立邏輯回歸(Logistic)模型幫Helen找男朋友


聲明:版權所有,轉載請聯系作者並注明出處  http://blog.csdn.net/u013719780?viewmode=contents


博主簡介:風雪夜歸子(Allen),機器學習算法攻城獅,喜愛鑽研Meachine Learning的黑科技,對Deep Learning和Artificial Intelligence充滿興趣,經常關注Kaggle數據挖掘競賽平台,對數據、Machine Learning和Artificial Intelligence有興趣的童鞋可以一起探討哦,個人CSDN博客:http://blog.csdn.net/u013719780?viewmode=contents




假設海倫一直使用在線約會網站尋找適合自己的約會對象。盡管約會網站會推薦不同的人選,但她沒有從中找到喜歡的人。經過一番總結,她發現曾交往過三種類型的人:

□ 不喜歡的人
□ 魅力一般的人
□ 極具魅力的人

盡管發現了上述規律,但海倫依然無法將約會網站推薦的匹配對象歸人恰當的分類。她覺得可以在周一到周五約會那些魅力一般的人,而周末則更喜歡與那些極具魅力的人為伴。海倫希望我們的分類算法可以更好地幫助她將匹配對象划分到確切的分類中。此外海倫還收集了一些約會網站未曾記錄的數據信息,她認為這些數據更有助於匹配對象的歸類。



海倫收集約會數據巳經有了一段時間,她把這些數據存放在文本文件datingTestSet中,每個樣本數據占據一行,總共有1000行。海倫的樣本主要包含以下3種特征:
□ 每年獲得的飛行常客里程數
□ 玩視頻游戲所耗時間百分比
□ 每周消費的冰淇淋公升數


 
         
file_content = sc.textFile('/Users/youwei.tan/Desktop/datingTestSet.txt')df = file_content.map(lambda x:x.split('\t'))df.take(2)


輸出結果如下:


[[u'40920', u'8.326976', u'0.953952', u'largeDoses'],
[u'14488', u'7.153469', u'1.673904', u'smallDoses']]



再將數據集轉換成dataframe格式,具體代碼如下:


 
         
 
         
dataset = sqlContext.createDataFrame(df, ['Mileage ', 'Gametime', 'Icecream', 'label'])dataset.show(5, False)dataset.printSchema


輸出結果如下:

+--------+---------+--------+----------+
|Mileage |Gametime |Icecream|label |
+--------+---------+--------+----------+
|40920 |8.326976 |0.953952|largeDoses|
|14488 |7.153469 |1.673904|smallDoses|
|26052 |1.441871 |0.805124|didntLike |
|75136 |13.147394|0.428964|didntLike |
|38344 |1.669788 |0.134296|didntLike |
+--------+---------+--------+----------+
only showing top 5 rows


<bound method DataFrame.printSchema of DataFrame[Mileage : string, Gametime: string, Icecream: string, label: string]>


建立標簽label的索引字典,目的是為了將字符串型的label轉換成數值型的label。






 
         
label_set = dataset.map(lambda x: x[3]).distinct().collect()label_dict = dict()i = 0for key in label_set:    if key not in label_dict.keys():        label_dict[key ]= i        i = i+1label_dict


輸出結果:


{u'didntLike': 0, u'largeDoses': 1, u'smallDoses': 2}



目前所得到的數據集類型是string類型,需要將其轉成數值型,具體實現代碼如下:


 
         
 
         
data = dataset.map(lambda x: ([x[i] for i in range(3)], label_dict[x[3]])).\               map(lambda (x,y): [int(x[0]), float(x[1]), float(x[2]), y])data = sqlContext.createDataFrame(data,  ['Mileage ', 'Gametime', 'Icecream', 'label'] )data.show(5, False)data.printSchema#data.selectExpr('Mileage', 'Gametime', 'Icecream', 'label').show()


輸出結果:


+--------+---------+--------+-----+
|Mileage |Gametime |Icecream|label|
+--------+---------+--------+-----+
|40920 |8.326976 |0.953952|1 |
|14488 |7.153469 |1.673904|2 |
|26052 |1.441871 |0.805124|0 |
|75136 |13.147394|0.428964|0 |
|38344 |1.669788 |0.134296|0 |
+--------+---------+--------+-----+
only showing top 5 rows


<bound method DataFrame.printSchema of DataFrame[Mileage : bigint, Gametime: double, Icecream: double, label: bigint]>

現在數據集已經符合我們的要求了,接下來就是建立模型了。在建立模型之前,我先對其進行標准化,然后用主成份分析(PCA)進行了降維,最后通過邏輯回歸(logistic)模型進行分類和概率預測。具體實現代碼如下:



 
         
from __future__ import print_function# $example on$from pyspark.ml import Pipelinefrom pyspark.ml.classification import LogisticRegressionfrom pyspark.ml.evaluation import BinaryClassificationEvaluatorfrom pyspark.ml.feature import HashingTF, Tokenizerfrom pyspark.ml.tuning import CrossValidator, ParamGridBuilderfrom pyspark.ml.feature import PCAfrom pyspark.mllib.linalg import Vectorsfrom pyspark.ml.feature import StandardScaler# 將類別2和類別1合並,即Helen對男生的印象是要么有魅力要么沒有魅力。# 之所以合並,是因為pyspark.ml.classification.LogisticRegression目前僅支持二分類feature_data = data.map(lambda x:(Vectors.dense([x[i] for i in range(0,3)]),float(1 if x[3]==2 else x[3])))feature_data = sqlContext.createDataFrame(feature_data, ['features', 'labels'])#feature_data.show()train_data, test_data = feature_data.randomSplit([0.7, 0.3], 6)#train.show()scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures',                            withStd=True, withMean=False)pca = PCA(k=2, inputCol="scaledFeatures", outputCol="pcaFeatures")lr = LogisticRegression(maxIter=10, featuresCol='pcaFeatures', labelCol='labels')pipeline = Pipeline(stages=[scaler, pca, lr])Model = pipeline.fit(train_data)results = Model.transform(test_data)results.select('probability', 'prediction', 'prediction').show(truncate=False)


輸出結果如下:

 
+----------------------------------------+----------+----------+|probability                             |prediction|prediction|+----------------------------------------+----------+----------+|[0.22285193760551922,0.7771480623944808]|1.0       |1.0       ||[0.19145196324973038,0.8085480367502696]|1.0       |1.0       ||[0.25815968118089555,0.7418403188191045]|1.0       |1.0       ||[0.1904557879847662,0.8095442120152337] |1.0       |1.0       ||[0.23649048307318044,0.7635095169268196]|1.0       |1.0       ||[0.19581773456064858,0.8041822654393515]|1.0       |1.0       ||[0.17595295700627253,0.8240470429937274]|1.0       |1.0       ||[0.2693008979176928,0.7306991020823073] |1.0       |1.0       ||[0.19489995345665115,0.8051000465433488]|1.0       |1.0       ||[0.2790706794240234,0.7209293205759766] |1.0       |1.0       ||[0.2074274685125254,0.7925725314874746] |1.0       |1.0       ||[0.2225838179162865,0.7774161820837134] |1.0       |1.0       ||[0.23520083542636305,0.764799164573637] |1.0       |1.0       ||[0.16390109775004727,0.8360989022499528]|1.0       |1.0       ||[0.2032817412585787,0.7967182587414213] |1.0       |1.0       ||[0.22397459472064782,0.7760254052793522]|1.0       |1.0       ||[0.1987896145632484,0.8012103854367516] |1.0       |1.0       ||[0.18503543175783838,0.8149645682421617]|1.0       |1.0       ||[0.30849060803324585,0.6915093919667542]|1.0       |1.0       ||[0.2472540013472057,0.7527459986527943] |1.0       |1.0       |+----------------------------------------+----------+----------+only showing top 20 rows



最后對模型進行簡單的評估,具體代碼如下:


 
         
 
         
from pyspark.mllib.evaluation import MulticlassMetricspredictionAndLabels = results.select('probability', 'prediction', 'prediction').map(lambda x: (x[1], x[2]))metrics = MulticlassMetrics(predictionAndLabels)metrics.confusionMatrix().toArray()


輸出結果:


array([[  40.,    0.],
[ 0., 257.]])





注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
粤ICP备14056181号  © 2014-2021 ITdaan.com