TF-IDF值和文本向量化


根據提取的特征詞計算特征值,即TF-IDF。采用向量空間模型(VSM)將文檔表示成向量,並將文檔輸出為WEKA能處理的.arff格式。

直接上代碼:

#!/user/bin/python
# -*- coding: utf-8 -*-

import codecs
import math

# 特征詞列表
feture_word = [] # 存放特征詞
feture_word_dic = {} # 存放特征詞DF
feture_word_dic2 = {} # 計算並存放每個特征詞的IDF

f = codecs.open('/Users/Administrator/Desktop/ni.txt','rb',encoding='utf-8')
for line in f:
line = line.split()
IDF = math.log(4205/float(line[1]),10)
feture_word.append(line[0])
feture_word_dic[line[0]] = line[1]
feture_word_dic2[line[0]] = IDF

alltext = []
for j in range(1,10):
for i in range(10,510):
dic = {}
try:
f = codecs.open('/Users/Administrator/Desktop/wordsfrequence2/%d/%d.txt' % (j,i), 'rb',encoding='utf-8')
for x in range(1,2):
p = f.readline()
p = p.split()
tmax = p[1]
dic['【tmax】'] = p[1]
for line in f:
line = line.split()
dic[line[0]] = line[1]
dic['【type】'] = j
alltext.append(dic)
except:
print u'問題文檔',j,i
continue

alltext_vector = []
for dic in alltext:
vector = []
for word in feture_word:
if word in dic:
t = dic[word]
else:
t = 0
tf_idf = (float(t)/float(dic['【tmax】']))*feture_word_dic2[word]
vector.append(tf_idf)
texttype = dic['【type】']
vector.append(texttype)
alltext_vector.append(vector)

data = codecs.open('/Users/Administrator/Desktop/data.arff','a',encoding='utf-8')
data.truncate()

data.write(u'@relation'+' '+u'sougoucorpus'+'\n\n')
for everyword in feture_word:
data.write(u'@attribute'+ ' '+ everyword +' '+u'numeric\n')
data.write(u'@attribute 【type】 {1,2,3,4,5,6,7,8,9}\n\n@data\n')
for vector in alltext_vector:
for value in vector[:-1]:
data = codecs.open('/Users/Administrator/Desktop/data.arff','a',encoding='utf-8')
data.write(str(value) + ',')
data.write(str(vector[-1]) + '\n')
data.close()

運行結果:



注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
粤ICP备14056181号  © 2014-2021 ITdaan.com