如何过滤(或替换)在UTF-8中需要超过3字节的unicode字符?

[英]How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?


I'm using Python and Django, but I'm having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their utf8 implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters using utf8mb4; and, someday in future, utf8 might support it as well.

我使用的是Python和Django,但是由于MySQL的限制,我遇到了一个问题。根据MySQL 5.1文档,他们的utf8实现不支持4字节字符。MySQL 5.5将使用utf8mb4支持4字节字符;而且,在将来的某一天,utf8可能也会支持它。

But my server is not ready to upgrade to MySQL 5.5, and thus I'm limited to UTF-8 characters that take 3 bytes or less.

但是,我的服务器还没有准备好升级到MySQL 5.5,因此我只能限制UTF-8字符,它占用3字节或更少。

My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?

我的问题是:如何过滤(或替换)需要超过3字节的unicode字符?

I want to replace all 4-byte characters with the official \ufffd (U+FFFD REPLACEMENT CHARACTER), or with ?.

我想用官方的\ufffd (U+FFFD替换字符)替换所有4字节字符,或者用?。

In other words, I want a behavior quite similar to Python's own str.encode() method (when passing 'replace' parameter). Edit: I want a behavior similar to encode(), but I don't want to actually encode the string. I want to still have an unicode string after filtering.

换句话说,我想要一种与Python自己的string .encode()方法类似的行为(当传递“replace”参数时)。编辑:我想要一个类似于encode()的行为,但是我不想实际地对字符串进行编码。我想在过滤后还有一个unicode字符串。

I DON'T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.

在MySQL存储之前,我不想转义字符,因为这意味着我需要从数据库中获取所有字符串,这是非常烦人和不可行的。

See also:

参见:

[EDIT] Added tests about the proposed solutions

So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.

所以到目前为止我得到了很好的答案。谢谢,人!现在,为了选择其中的一个,我做了一个快速的测试来找到最简单和最快的。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et

import cProfile
import random
import re

# How many times to repeat each filtering
repeat_count = 256

# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90

# Total number of characters in this string
string_size = 8 * 1024

# Generating a random testing string
test_string = u''.join(
        unichr(random.randrange(32,
            0x10ffff if random.randrange(100) > normal_chars else 0x0fff
        )) for i in xrange(string_size) )

# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)

def filter_using_re(unicode_string):
    return re_pattern.sub(u'\uFFFD', unicode_string)

def filter_using_python(unicode_string):
    return u''.join(
        uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
        for uc in unicode_string
    )

def repeat_test(func, unicode_string):
    for i in xrange(repeat_count):
        tmp = func(unicode_string)

print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')

#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')

The results:

结果:

  • filter_using_re() did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at the sub() built-in)
  • filter_using_re()在0.139 CPU秒内完成了515个函数调用(在sub()内置的0.138 CPU秒)
  • filter_using_python() did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at the join() call and 1.900 CPU seconds evaluating the generator expression)
  • filter_using_python()在3.413 CPU秒内完成了2097923个函数调用(在join()调用中调用了1.511 CPU秒,在计算生成器表达式时使用了1.900 CPU秒)
  • I did no test using itertools because... well... that solution, although interesting, was quite big and complex.
  • 我没有使用itertools进行测试,因为…嗯…那个解决方案虽然很有趣,但却相当复杂。

Conclusion

The RegEx solution was, by far, the fastest one.

到目前为止,RegEx解决方案是最快的。

6 个解决方案

#1


33  

Unicode characters in the ranges \u0000-\uD7FF and \uE000-\uFFFF will have 3 byte (or less) encodings in UTF8. The \uD800-\uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.

在u0000-\uD7FF和\uE000-\uFFFF的范围内的Unicode字符将在UTF8中有3个字节(或更少)编码。\uD800-\uDFFF范围是用于多字节UTF16。我不知道python,但是您应该能够设置一个正则表达式来匹配这些范围之外的内容。

pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)

Edit adding Python from Denilson Sá's script in the question body:

从Denilson Sa的脚本中编辑添加Python:

re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)    

#2


5  

You may skip the decoding and encoding steps and directly detect the value of the first byte (8-bit string) of each character. According to UTF-8:

您可以跳过解码和编码步骤,直接检测每个字符的第一个字节(8位字符串)的值。根据utf - 8:

#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

According to that, you only need to check the value of only the first byte of each character to filter out 4-byte characters:

根据此,您只需要检查每个字符的第一个字节的值,以过滤出4个字节的字符:

def filter_4byte_chars(s):
    i = 0
    j = len(s)
    # you need to convert
    # the immutable string
    # to a mutable list first
    s = list(s)
    while i < j:
        # get the value of this byte
        k = ord(s[i])
        # this is a 1-byte character, skip to the next byte
        if k <= 127:
            i += 1
        # this is a 2-byte character, skip ahead by 2 bytes
        elif k < 224:
            i += 2
        # this is a 3-byte character, skip ahead by 3 bytes
        elif k < 240:
            i += 3
        # this is a 4-byte character, remove it and update
        # the length of the string we need to check
        else:
            s[i:i+4] = []
            j -= 4
    return ''.join(s)

Skipping the decoding and encoding parts will save you some time and for smaller strings that mostly have 1-byte characters this could even be faster than the regular expression filtering.

跳过解码和编码部分将节省您一段时间,而且对于较小的字符串,这些字符串大多有1个字节的字符,这甚至比正则表达式过滤更快。

#3


1  

And just for the fun of it, an itertools monstrosity :)

只是为了好玩,一个迭代工具怪物:)

import itertools as it, operator as op

def max3bytes(unicode_string):

    # sequence of pairs of (char_in_string, u'\N{REPLACEMENT CHARACTER}')
    pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))

    # is the argument less than or equal to 65535?
    selector= ft.partial(op.le, 65535)

    # using the character ordinals, return 0 or 1 based on `selector`
    indexer= it.imap(selector, it.imap(ord, unicode_string))

    # now pick the correct item for all pairs
    return u''.join(it.imap(tuple.__getitem__, pairs, indexer))

#4


1  

Encode as UTF-16, then reencode as UTF-8.

编码为UTF-16,然后重新编码为UTF-8。

>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'

Note that you can't encode after joining, since the surrogate pairs may be decoded before reencoding.

注意,连接后不能进行编码,因为代理对可能在重新编码之前被解码。

EDIT:

编辑:

MySQL (at least 5.1.47) has no problem dealing with surrogate pairs:

MySQL(至少5.1.47)处理代理对没有问题:

mysql> create table utf8test (t character(128)) collate utf8_general_ci;
Query OK, 0 rows affected (0.12 sec)

  ...

>>> cxn = MySQLdb.connect(..., charset='utf8')
>>> csr = cxn.cursor()
>>> t = u'𝐟𝐨𝐨'
>>> e = t.encode('utf-16le')
>>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
>>> v
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'
>>> csr.execute('insert into utf8test (t) values (%s)', (v,))
1L
>>> csr.execute('select * from utf8test')
1L
>>> r = csr.fetchone()
>>> r
(u'\ud835\udc1f\ud835\udc28\ud835\udc28',)
>>> print r[0]
𝐟𝐨𝐨

#5


1  

According to the MySQL 5.1 documentation: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP." This indicates that there might be a problem with surrogate pairs.

根据MySQL 5.1文档:“ucs2和utf8字符集不支持在BMP之外的辅助字符。”这表明代理对可能存在问题。

Note that the Unicode standard 5.2 chapter 3 actually forbids encoding a surrogate pair as two 3-byte UTF-8 sequences instead of one 4-byte UTF-8 sequence ... see for example page 93 """Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.""" However this proscription is as far as I know largely unknown or ignored.

注意,Unicode标准5.2第3章实际上禁止将代理对编码为两个3字节的UTF-8序列,而不是一个4字节的UTF-8序列。因为代理代码点不是Unicode标量值,任何UTF-8字节序列,否则将映射到代码点D800..DFFF是不规范的。””“然而,我所知道的这种放逐,在很大程度上是未知的或被忽略的。

It may well be a good idea to check what MySQL does with surrogate pairs. If they are not to be retained, this code will provide a simple-enough check:

检查MySQL与代理对之间的关系可能是个好主意。如果它们不被保留,此代码将提供一个简单的检查:

all(uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' for uc in unicode_string)

and this code will replace any "nasties" with u\ufffd:

这段代码将用u\ufffd替换任何“讨厌的”:

u''.join(
    uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
    for uc in unicode_string
    )

#6


0  

I'm guessing it's not the fastest, but quite straightforward (“pythonic” :) :

我猜它不是最快的,但是很直接(“python”:):

def max3bytes(unicode_string):
    return u''.join(uc if uc <= u'\uffff' else u'\ufffd' for uc in unicode_string)

NB: this code does not take into account the fact that Unicode has surrogate characters in the ranges U+D800-U+DFFF.

NB:这段代码没有考虑到Unicode在U+D800-U+DFFF中有代理字符这一事实。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.itdaan.com/blog/2010/07/10/849563ac9ad03f35216bcf2ce9cd0d6b.html



 
© 2014-2018 ITdaan.com 粤ICP备14056181号