用於文本解析的regex問題(類似於textile)

[英]Problem with regex for text parsing (similar to textile)


I'm banging my head against the wall trying to figure out a (regexp?) based parser rule for the following problem. I'm developing a text markup parser similar to textile (using PHP), but i don't know how to get the inline formatting rules correct -- and i noticed, that the textile parsers i found are not able to format the following text as i would like to get it formatted:

我把頭靠在牆上,試圖為下面的問題找到一個基於(regexp?)的解析器規則。我正在開發一個類似於textile(使用PHP)的文本標記解析器,但我不知道如何使內聯格式規則正確——我注意到,我發現的紡織解析器不能格式化如下文本:

-*deleted* -- text- and -more deleted text-

The result I want to have is:

我想要的結果是:

<del><strong>deleted</strong> -- text</del> and <del>more deleted text</del>

What I do not want is:

我不想要的是:

<del><strong>deleted</strong> </del>- text- and <del>more deleted text</del>

Any ideas are very appreciated! thanks very much!

任何想法都非常感謝!非常感謝!

UPDATE

更新

i think i should have mentioned, that '-' should still be a valid character (hyphen) :) -- for example the following should be possible:

我想我應該提到,'-'應該仍然是一個有效的字符(連字符):)——例如,以下內容應該是可能的:

-american-football player-

expected result:

預期結果:

<del>american-football player</del>

5 个解决方案

#1


2  

Based of the RedCloth library's parser description, with some modification for double-dash.

基於紅布庫的解析器描述,對雙破折號進行了一些修改。

@
  (?<!\S)               # Start of string, or after space or newline
  -                     # Opening dash
  (                     # Capture group 1
    (?:                 #   : (see note 1)
      [^-\s]+           #   :
      [-\s]+            #   :
    )*?                 #   :
    [^-\s]+?            #   :
  )                     # End
  -                     # Closing dash
  (?![^\s!"\#$%&',\-./:;=?\\^`|~[\]()<])  # (see note 2)
@x
  • Note 1: This should match up to the next dash lazily, while consuming any non-single dashes, and single dashes surrounded by whitespace.
  • 注意1:這應該與下一個虛線匹配,同時使用任何非單個的破折號,以及被空格包圍的單個破折號。
  • Note 2: Followed by space, punctuation, line break or end of string.
  • 注2:空格、標點、斷行或字符串結束。

Or compacted:

或壓實:

@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&',\-./:;=?\\^`|~[\]()<])@

A few examples:

幾個例子:

$regex = '@(?<!\S)-((?:[^-\s]+[-\s]+)*?[^-\s]+?)-(?![^\s!"#$%&\',\-./:;=?\\\^`|~[\]()<])@';
$replacement = '<del>\1</del>';

preg_replace($regex, $replacement, '-*deleted* -- text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-*deleted*--text- and -more deleted text-'), "\n";
preg_replace($regex, $replacement, '-american-football player-'), "\n";

Will output:

將輸出:

<del>*deleted* -- text</del> and <del>more deleted text</del>
<del>*deleted*</del>-text- and <del>more deleted text</del>
<del>american-football player</del>

In the second example, it will match just -*deleted*-, since there are no spaces before the --. -text- will not be matched, because the initial - is not preceded by a space.

在第二個示例中,它將匹配just -*deleted*-,因為在-之前沒有空格。-text-將不匹配,因為首字母-前面沒有空格。

#2


1  

The strong tag is easy:

強有力的標簽很容易:

$string = preg_replace('~[*](.+?)[*]~', '<strong>$1</strong>',  $string);

Working on the others.

在其他的工作。


Shameless hack for the del tag:

無恥的德爾標簽黑客:

$string = preg_replace('~-(.+?)-~', '<del>$1</del>', $string);
$string = str_replace('<del></del>', '--', $string);

#3


1  

For a single token, you can simply match:

對於單個令牌,只需匹配:

-((?:[^-]|--)*)-

and replace with:

和替換為:

<del>$1</del>

and similarly for \*((?:[^*]|\*{2,})*)\* and <strong>$1</strong>.

和同樣\ *((?:[^ *)| \ * { 2,})*)\ *和 <強> < /強> 1美元。

The regex is quite simple: literal - in both ends. In the middle, in a capturing group, we allow anything that isn't an hyphen, or two hyphens in a row.

regex非常簡單:字面上的-在兩端。在中間,在一個捕獲組中,我們允許任何不是連字符,或連續的兩個連字符。

To also allow single dashes in words, as in objective-c, this can work, by accepting dashes surrounded by two alphanumeric letters:

同樣,在單詞中允許單個破折號,如objective-c,這可以通過接受被兩個字母數字字母包圍的破折號來實現:

-((?:[^-]|--|\b-\b)*)-

#4


0  

You could try something like:

你可以試試:

'/-.*?[^-]-\b/'

Where the ending hyphen must be at a word boundary and preceded by something that is not a hyphen.

結尾的連字符必須在一個單詞的邊界上,前面必須有一個非連字符的東西。

#5


0  

I think you should read this warning sign first You can't parse [X]HTML with regex

我認為您應該先閱讀這個警告標志,您不能使用regex解析[X]HTML

Perhaps you should try googling for a php html library

也許您應該嘗試google一下php html庫


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2010/07/14/72f1c7b667b807620dec74cf89a42e85.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com