刪除重音和特殊字符[重復]

[英]removing accent and special characters [duplicate]


Possible Duplicate:
What is the best way to remove accents in a python unicode string?
Python and character normalization

可能重復:在python unicode字符串中去除重音的最佳方法是什么?Python和字符歸一化

I would like to remove accents, turn all characters to lowercase, and delete any numbers and special characters.

我想刪除重音,將所有字符轉換為小寫,並刪除任何數字和特殊字符。

Example :

例子:

Frédér8ic@ --> frederic

Freder8ic@ - - >弗雷德里克

Proposal:

建議:

def remove_accents(data):
    return ''.join(x for x in unicodedata.normalize('NFKD', data) if \
    unicodedata.category(x)[0] == 'L').lower()

Is there any better way to do this?

有更好的辦法嗎?

2 个解决方案

#1


14  

A possible solution would be

一個可能的解決辦法是

def remove_accents(data):
    return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.printable).lower()

Using NFKD AFAIK is the standard way to normalize unicode to convert it to compatible characters. The rest as to remove the special characters numbers and unicode characters that originated from normalization, you can simply compare with string.ascii_letters and remove any character's not in that set.

使用NFKD AFAIK是將unicode規范化為兼容字符的標准方法。至於其他,為了刪除源自規范化的特殊字符號和unicode字符,您可以簡單地與字符串進行比較。並刪除任何不在該集合中的字符。

#2


1  

Can you convert the string into HTML entities? If so, you can then use a simple regular expression.

你能把字符串轉換成HTML實體嗎?如果是,那么您可以使用一個簡單的正則表達式。

The following replacement would work in PHP/PCRE (see my other answer for an example):

以下替換將在PHP/PCRE中工作(參見我的另一個示例答案):

'~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i' => '$1'

Then simply convert back from HTML entities and remove any non a-Z char (demo @ CodePad).

然后簡單地從HTML實體轉換回來,刪除任何非a-Z字符(demo @ CodePad)。

Sorry I don't know Python enough to provide a Pythonic answer.

對不起,我不太了解Python,無法提供Python的答案。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2012/01/01/72fb7fb063e117d35d85ea181b786559.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com