正則表達式匹配所有的韓語字符和音節塊

[英]Regex to match all Hangul (Korean) characters and syllable blocks


I'm trying to validate user input (in Python) and see if the right language is being used, Korean in this case. Lets take the Korean word for email address: 이메일 주소

我嘗試驗證用戶輸入(在Python中),看看是否使用了正確的語言,在這里是韓語。讓韓國的“電子郵件地址:이메일주소

I can check each character like so:

我可以這樣檢查每個字符:

import unicodedata as ud
for chr in u'이메일 주소':
    if 'HANGUL' in ud.name(chr): print "Yep, that's a Korean character."

But that seems highly inefficient, especially for longer texts. Of course, I could create a static dictionary containing all Korean syllable blocks, but that dictionary would contain some 25,000 characters and again, that would be inefficient to check against. Also, I also need a solution for Japanese and Chinese, which may contain even more characters.

但這似乎是非常低效的,尤其是對於較長的文本。當然,我可以創建一個包含所有韓語音節塊的靜態字典,但該字典將包含大約25,000個字符,而且檢查起來效率也很低。另外,我還需要一個日語和漢語的解決方案,可能包含更多的字符。

Therefore, I'd like to use a Regex pattern covering all Unicode characters for Hangul syllable blocks. But I have no clue if there is a range for that or where to find it.

因此,我想使用Regex模式來覆蓋所有用於Hangul音節塊的Unicode字符。但我不知道是否有一個范圍,或者在哪里找到它。

As an example, this regex pattern covers all Latin based characters, including brackets and other commonly used symbols:

例如,這個regex模式包含所有基於拉丁的字符,包括括號和其他常用符號:

import re
LATIN_CHARACTERS = re.compile(ur'[\x00-\x7F\x80-\xFF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]')

Can somebody translate this regex to match Korean Hangul syllable block? Or can you show me a table or reference to lookup such ranges myself?

有人能把這個regex翻譯成韓語的Hangul音節塊嗎?或者您能給我一個表或引用來查找這些范圍嗎?

A pattern to match Chinese and Japanese would also be very helpful. Or one regex to match all CJK characters at once. I wouldn't need to distinguish between Japanese and Korean.

匹配中文和日文的模式也很有幫助。或者一個regex同時匹配所有CJK字符。我不需要區分日語和韓語。

Here's a Python library for that task, but it works with incredibly huge dictionaries: https://github.com/EliFinkelshteyn/alphabet-detector I cannot imagine that to be efficient for large texte and lots of user inputs.

這里有一個用於該任務的Python庫,但是它與非常龐大的字典一起工作:https://github.com/EliFinkelshteyn/alphabet-detector我無法想象這對於大型texte和大量用戶輸入是高效的。

Thanks!

謝謝!

1 个解决方案

#1


3  

You are aware of how Unicode is broken into blocks, and how each block represents a contiguous range of code-points? IE, there's a much more efficient solution than a regular expression.

您知道Unicode是如何被分割成塊的,以及每個塊如何表示連續的代碼點范圍?IE有一個比正則表達式更有效的解決方案。

There is a single code block for Hangul Jamo, with additional characters in the CJK block, a compatability block, Hangul syllables, etc.

Hangul Jamo有一個單獨的代碼塊,在CJK塊中有額外的字符、可壓縮塊、Hangul音節等等。

The most efficient way is to check if each character is within the acceptable range, using if/then statements. You could almost certainly speed this up using a C-extension.

最有效的方法是使用if/then語句檢查每個字符是否在可接受范圍內。幾乎可以肯定的是,你可以用c擴展來加速這個過程。

For example, if I were just checking the Hangul block (insufficient, but merely a simple starting place), I would check each character in a string with the following code:

例如,如果我只是檢查Hangul塊(不夠,但僅僅是一個簡單的起點),我會用以下代碼檢查字符串中的每個字符:

def is_hangul_character(char):
    '''Check if character is in the Hangul Jamo block'''

    value = ord(char)
    return value >= 4352 and value <= 4607


def is_hangul(string):
    '''Check if all characters are in the Hangul Jamo block'''

    return all(is_hangul_character(i) for i in string)

It would be easy to extend this for the 8 or so blocks that contain Hangul characters. No tables lookups, no regex compilation. Just fast range checks based on the block of the Unicode character.

對包含Hangul字符的大約8個塊進行擴展是很容易的。沒有表查找,沒有regex編譯。只是基於Unicode字符塊的快速范圍檢查。

In C, this would be very easy as well (if you would like a significant performance boost, to match a fully-optimized library with little work):

在C語言中,這也很容易(如果您希望獲得顯著的性能提升,以匹配完全優化的庫而不需要做什么工作):

// Return 0 if a character is in Hangul Jamo block, -1 otherwise
int is_hangul_character(char c)
{
    if (c >= 4352 && c <= 4607) {
        return 0;
    }
    return -1;
}


// Return 0 if all characters are in Hangul Jamo block, -1 otherwise
int is_hangul(const char* string, size_t length)
{
    size_t i;
    for (i = 0; i < length; ++i) {
        if (is_hangul_character(string[i]) < 0) {
            return -1;
        }
    }
    return 0;
}

Edit A cursory glance at the CPython implementation shows CPython uses this exact approach for the unicodedata module. IE, it's efficient despite the relative ease to implement it on your own. It is still worth implementing, since you don't have to allocate any intermediate string, or use superfluous string comparisons (which is likely the primary cost of the unicodedata module).

編輯對CPython實現的粗略了解,可以看到CPython對unicodedata模塊使用了這種方法。例如,它是有效的,盡管相對容易實現它自己。它仍然值得實現,因為您不必分配任何中間字符串,或者使用多余的字符串比較(這可能是unicodedata模塊的主要成本)。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2018/01/03/7299ee703a16d604033d0f00eaa46e69.html



 
粤ICP备14056181号  © 2014-2020 ITdaan.com