Regex可以刪除字符串中的重復字符模式。

[英]Regex to remove repeated character pattern in a string


I have a string that may have a repeated character pattern, e.g.

我有一個字符串,它可能具有重復的字符模式。

'xyzzyxxyzzyxxyzzyx'

I need to write a regex that would replace such string with its smallest repeated pattern:

我需要寫一個regex,用它最小的重復模式替換這種字符串:

'xyzzyxxyzzyxxyzzyx' becomes 'xyzzyx',

'abcbaccbaabcbaccbaabcbaccba' becomes 'abcbaccba'

3 个解决方案

#1


5  

Use the following:

使用以下:

> re.sub(r'(.+?)\1+', r'\1', 'xyzzyxxyzzyxxyzzyx')
'xyzzyx'
> re.sub(r'(.+?)\1+', r'\1', 'abcbaccbaabcbaccbaabcbaccba')
'abcbaccba'
> re.sub(r'(.+?)\1+', r'\1', 'iiiiiiiiiiiiiiiiii')
'i'

It basically matches a pattern that repeats itself (.+?)\1+, and removes everything but the repeating pattern, which is captured in the first group \1. Also note that using a reluctant qualifier here, i.e., +? will make the regex backtrack quite a lot.

它基本上匹配一個重復自己(.+?)\1+的模式,並刪除除重復模式外的所有內容,重復模式在第一組\1中捕獲。還要注意在這里使用不情願的限定符,例如。,+ ?會使regex倒退很多。

DEMO.

演示。

#2


4  

Since you want the smallest repeating pattern, something like the following should work for you:

既然您想要最小的重復模式,那么以下內容應該適合您:

re.sub(r'^(.+?)\1+$', r'\1', input_string)

The ^ and $ anchors make sure you don't get matches in the middle of the string, and by using .+? instead of just .+ you will get the shortest pattern (compare results using a string like 'aaaaaaaaaa').

^和$錨確保你沒有得到匹配的字符串,並使用。+ ?您將得到最短的模式(使用“aaaaaaaaaa”這樣的字符串比較結果)。

#3


2  

Try this regex pattern and capture the first group:

嘗試這個regex模式並捕獲第一個組:

^(.+?)\1+$
  • ^ anchor for beginning of string/line
  • ^錨為字符串的開始/線
  • . any character except newlines
  • 。任何字符除了換行
  • + quantifier to denote atleast 1 occurence
  • +量詞表示至少發生1次
  • ? makes the + lazy instead of greedy, hence giving you the shortest pattern
  • 嗎?使+變得懶惰而不是貪婪,從而給你最短的模式
  • () capturing group
  • ()捕獲組
  • \1+ backreference with quantifier to denote that pattern should repeat atleast once
  • \1+帶量詞的反向引用來表示模式應該至少重復一次
  • $ anchor for end of string/line
  • $錨用於字符串/行尾

Test it here: Rubular

測試在這里:Rubular


The above solution does a lot of backtracking affecting performance. If you know the which characters are not allowed in these strings, then you can use a negated characted set which eliminates backtracking. For e.g., if whitespaces are not allowed, then

上面的解決方案做了大量的回溯影響性能。如果您知道這些字符串中不允許使用哪些字符,那么您可以使用一個否定的字符集來消除回溯。例如,如果不允許使用空格,那么

^([^\s]+)\1+$

注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2012/09/17/6a8e95125e164aadc40a606bb7442fec.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com