Python实现韩文分解(基于python 3.4版本)


    咱直接进入正题吧!下面这个程序是我一个老师做研究,用到python来把韩文拆分。老师让我帮他做的。因为自己 所以拿出来分享一下。  

    首先呢!  参考这个“韩文的unicode范围及应用”文章。

    其次,因为这个这个只针对现代韩文是有效的。对很古老的韩文是无法拆分的。所以单独地,对古代韩文的拆分是放在一个txt文件中的。存放方式如下

    

    程序思路如下:

        1.先从古代韩文txt文件中读取所有的古韩文和其对应的分解,然后存放在一个字典中;

            代码如下

                

           在这里,之所以用utf-8编码打开是因为我当时存古韩语和其分解的时候是用utf-8编码存放的。



        2.从一个文件夹中读取所有的以utf-16编码的txt文件;

                

        3.读取一个文件的一行,再对每一个字符进行处理;

            如果是古代韩文,通过从先前得到的字典中查找返回相应的分解。如果是现在使用的韩文字符,通过参考文章http://www.ch2ko.com/hanguoyu/hanwen-unicode/来处理的。如果是其它字符,返回原字符。

        4.得到的每一个字符的分解存放在一个临时的行temp_line中(初始为空),处理完每一个行,把temp_line加到每个文件字符串temp_file_string中(初始为空)。

        5.处理完一个文件后把结果存储为一个新文件,一行分解,下一行原文的方式。

            如下图

            


    程序代码如下

    

#dividing_test.py
#coding:utf-8

import codecs
import glob
import codecs
import os

first_parts = ("ㄱ", "ㄲ", "ㄴ", "ㄷ", "ㄸ", "ㄹ", "ㅁ", "ㅂ", "ㅃ", "ㅅ", "ㅆ", "ㅇ", "ㅈ", "ㅉ", "ㅊ", "ㅋ", "ㅌ", "ㅍ", "ㅎ")
second_parts =("ㅏ", "ㅐ", "ㅑ", "ㅒ", "ㅓ", "ㅔ", "ㅕ", "ㅖ", "ㅗ", "ㅗㅏ", "ㅗㅐ", "ㅗㅣ", "ㅛ", "ㅜ", "ㅜㅓ", "ㅜㅔ", "ㅜㅣ", "ㅠ", "ㅡ", "ㅡㅣ", "ㅣ")
third_parts = ("", "ㄱ", "ㄲ", "ㄳ", "ㄴ", "ㄵ", "ㄶ", "ㄷ", "ㄹ", "ㄺ", "ㄻ", "ㄼ", "ㄽ", "ㄾ", "ㄿ", "ㅀ", "ㅁ", "ㅂ", "ㅄ", "ㅅ", "ㅆ", "ㅇ", "ㅈ", "ㅊ", "ㅋ", "ㅌ", "ㅍ", "ㅎ")
def divide_korean(temp_string):
temp_string_value = ord(temp_string)
part_1 = (temp_string_value - 44032) // 588
part_2 = (temp_string_value - 44032 - part_1 * 588) // 28
part_3 = (temp_string_value - 44032 ) % 28
return first_parts[part_1] + second_parts[part_2] + third_parts[part_3]



old_korean_dictionary = {}
read_file = codecs.open("old_korean_dictionary.txt", 'r', encoding="utf-8")
for each_line in read_file:
old_korean, dividing_parts = each_line.split()
old_korean_dictionary[old_korean] = dividing_parts


# wriet string to txt file
def write_string_to_file(temp_str, file_name):
#the encoding must be same with the str
file_object = open(file_name, 'w', encoding="utf-16")
file_object.write(temp_str)
file_object.close()


data_files = glob.glob(os.getcwd() + "/test_data/*.txt")
print ("the result files save in the " + os.getcwd())
for each_file in data_files:
print (each_file + "-"*5 + ">dealing") #begin to deal file
with codecs.open(each_file, 'r', encoding="utf-16") as read_file:
temp_file_string = ""
for each_line in read_file:
if each_line.strip() == "":
continue
temp_line = ""
for i in range(0, len(each_line)):
if each_line[i] in old_korean_dictionary:
temp_line = temp_line + old_korean_dictionary.get(each_line[i])
elif each_line[i] >= u'\uAC00' and each_line[i] <= u'\uD7AF':
temp_line = temp_line + divide_korean(each_line[i])
elif each_line[i] == "\n":
continue
else :
temp_line = temp_line + each_line[i]
temp_line = temp_line + each_line.strip('\n')
temp_file_string = temp_file_string + temp_line
print ( each_file + "-"*5 + ">finished") #finish
#the name of new file to save the result, the new file is in the current dir
new_filename = os.path.splitext(os.path.basename(each_file))[0] + "_result.txt"
#write to the file
write_string_to_file(temp_file_string, new_filename)




       最后,可以到我网盘下载源代码和测试的文件http://pan.baidu.com/s/1gdCfjTD
智能推荐

注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
© 2014-2019 ITdaan.com 粤ICP备14056181号  

赞助商广告