如何在Python中獲得合並Unicode字符串的“可見”長度?

[英]How do I get the “visible” length of a combining Unicode string in Python?


If I have a Python Unicode string that contains combining characters, len reports a value that does not correspond to the number of characters "seen".

如果我有一個包含組合字符的Python Unicode字符串,len會報告一個與“已見”字符數量不一致的值。

For example, if I have a string with combining overlines and underlines such as u'A\u0332\u0305BC', len(u'A\u0332\u0305BC') reports 5; but the displayed string is only 3 characters long.

例如,如果我有一個字符串,它結合了overlines和下划線,比如u' a \u0332\u0305BC', len(u' a \u0332\u0305BC')報告5;但是顯示的字符串只有3個字符長。

How do I get the "visible" — that is, number of distinct positions occupied by the string the user sees — length of a Unicode string containing combining glyphs in Python?

如何獲得“可見的”——即用戶看到的字符串占用的不同位置的數量——包含Python中的組合符號的Unicode字符串的長度?

3 个解决方案

#1


4  

The unicodedata module has a function combining that can be used to determine if a single character is a combining character. If it returns 0 you can count the character as non-combining.

unicodedata模塊具有一個組合函數,可用於確定單個字符是否是組合字符。如果返回0,則可以將字符計數為非組合。

import unicodedata
len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))

or, slightly simpler:

或者,稍微簡單一點:

sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)

#2


4  

If you have a regex flavor that supports matching grapheme, you can use \X

如果您有支持匹配字符的regex風格,您可以使用\X

Demo

演示

While the default Python re module does not support \X, Matthew Barnett's regex module does:

雖然默認的Python re模塊不支持\X,但是Matthew Barnett的regex模塊是這樣做的:

>>> len(regex.findall(r'\X', u'A\u0332\u0305BC'))
3

On Python 2, you need to use u in the pattern:

在Python 2中,您需要在模式中使用u:

>>> regex.findall(u'\\X', u'A\u0332\u0305BC')
[u'A\u0332\u0305', u'B', u'C']
>>> len(regex.findall(u'\\X', u'A\u0332\u0305BC'))
3

#3


2  

Combining characters are not the only zero-width characters:

組合字符不是唯一的零寬度字符:

>>> sum(1 for ch in u'\u200c' if unicodedata.combining(ch) == 0)
1

("\u200c" or "‌" is zero-width non-joiner; it's a non-printing character.)

(“\ u200c”或“‌”是零寬度non-joiner;這是一個非打印字符。)

In this case the regex module does not work either:

在這種情況下,regex模塊也不起作用:

>>> len(regex.findall(r'\X', u'\u200c'))
1

I found wcwidth that handles the above case correctly:

我找到了正確處理上述情況的wcwidth:

>>> from wcwidth import wcswidth
>>> wcswidth(u'A\u0332\u0305BC')
3
>>> wcswidth(u'\u200c')
0

But still doesn't seem to work with user 596219's example:

但用戶596219的例子似乎仍然不能說明問題:

>>> wcswidth('각')
4

注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2015/10/26/71296953bc22a914b2a448e83a763eaa.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com