Perl 6 Grammar與我認為不應該匹配

[英]Perl 6 Grammar doesn't match like I think it should

I'm doing Advent of Code day 9:


You sit for a while and record part of the stream (your puzzle input). The characters represent groups - sequences that begin with { and end with }. Within a group, there are zero or more other things, separated by commas: either another group or garbage. Since groups can contain other groups, a } only closes the most-recently-opened unclosed group - that is, they are nestable. Your puzzle input represents a single, large group which itself contains many smaller ones.

你坐了一會兒,記錄了一部分流(你的拼圖輸入)。字符代表組 - 以{和以}結尾的序列。在一個組中,有零個或多個其他東西,用逗號分隔:另一個組或垃圾。由於組可以包含其他組,因此只關閉最近打開的未關閉組 - 也就是說,它們是可嵌套的。您的拼圖輸入代表一個單獨的大組,其本身包含許多較小的組。

Sometimes, instead of a group, you will find garbage. Garbage begins with < and ends with >. Between those angle brackets, almost any character can appear, including { and }. Within garbage, < has no special meaning.

有時候,你會發現垃圾,而不是一群人。垃圾以 <開頭,以> 結尾。在這些尖括號之間,幾乎可以出現任何字符,包括{和}。在垃圾中, <沒有特殊意義。< p>

In a futile attempt to clean up the garbage, some program has canceled some of the characters within it using !: inside garbage, any character that comes after ! should be ignored, including <, >, and even another !.


Of course, this screams out for a Perl 6 Grammar...

當然,這對Perl 6 Grammar來說很尖叫......

grammar Stream
    rule TOP { ^ <group> $ }

    rule group { '{' [ <group> || <garbage> ]* % ',' '}' }
    rule garbage { '<' [ <garbchar> | <garbignore> ]* '>' }

    token garbignore { '!' . }
    token garbchar { <-[ !> ]> }

This seems to work fine on simple examples, but it goes wrong with two garbchars in a row:


say Stream.parse('{<aa>}');

gives Nil.


Grammar::Tracer is no help:

語法:: Tracer沒有幫助:

|  group
|  |  group
|  |  * FAIL
|  |  garbage
|  |  |  garbchar
|  |  |  * MATCH "a"
|  |  * FAIL
|  * FAIL

Multiple garbignores are no problem:


say Stream.parse('{<!!a!a>}');



 group => 「{<!!a!a>}」
  garbage => 「<!!a!a>」
   garbignore => 「!!」
   garbchar => 「a」
   garbignore => 「!a」

Any ideas?


2 个解决方案



UPD Given that the Advent of code problem doesn't mention whitespace you shouldn't be using the rule construct at all. Just switch all the rules to tokens and you should be set. In general, follow Brad's advice -- use token unless you know you need a rule (discussed below) or a regex (if you need backtracking).

UPD鑒於代碼問題的出現沒有提到空格,你根本就不應該使用規則構造。只需將所有規則切換到令牌即可。一般來說,遵循布拉德的建議 - 使用令牌,除非你知道你需要一個規則(下面討論)或一個正則表達式(如果你需要回溯)。

My original answer below explored why the rules didn't work. I'll leave it in for now.


TL;DR <garbchar> | contains a space. Whitespace that directly follows any atom in a rule indicates a tokenizing break. You can simply remove this inappropriate space, i.e. write <garbchar>| instead (or better still, <.garbchar>| if you don't need to capture the garbage) to get the result you seek.

TL; DR |包含一個空間。直接跟隨規則中任何原子的空格表示標記化中斷。你可以簡單地刪除這個不合適的空間,即寫 |相反(或者更好的是,<.garbchar> |如果你不需要捕獲垃圾)來獲得你尋求的結果。

As your original question allowed, this isn't a bug, it's just that your mental model is off.


Your answer correctly identifies the issue: tokenization.


So what we're left with is your follow up question, which is about your mental model of tokenization, or at least how Perl 6 tokenizes by default:

所以我們留下的是你的后續問題,這是關於你的標記化的心理模型,或者至少是默認情況下Perl 6如何標記:

why ... my second example ... goes wrong with two garbchars in a row:



Simplifying, the issue is how to tokenize this:



The simple high level answer is that, in parsing vernacular, aa will ordinarily be treated as one token, not two, and, by default, Perl 6 assumes this ordinary definition. This is the issue you're encountering.

簡單的高級答案是,在解析白話時,aa通常被視為一個令牌,而不是兩個,並且默認情況下,Perl 6假設這個普通的定義。這是你遇到的問題。

You can overrule this ordinary definition to get any tokenizing result you care to achieve. But it's seldom necessary to do so and it certainly isn't in simple cases like this.


I'll provide two redundant paths that I hope might lead folk to the correct mental model:


  • For those who prefer diving straight into nitty gritty detail, there's a reddit comment I wrote recently about tokenization in Perl 6.

    對於那些喜歡直接深入細節的人來說,最近我寫了一篇關於Perl 6中的標記化的reddit評論。

  • The rest of this SO answer provides a high level discussion that complements the low level explanation in my reddit comment.


Excerpting from the "Obstacles" section of the wikipedia page on tokenization, and interleaving the excerpts with P6 specific discussion:


Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often a tokenizer relies on simple heuristics, for example:


  • Punctuation and whitespace may or may not be included in the resulting list of tokens.
  • 標點符號和空格可能包含也可能不包含在結果的標記列表中。

In Perl 6 you control what gets included or not in the parse tree using capturing features that are orthogonal to tokenizing.

在Perl 6中,您可以使用與標記化正交的捕獲功能來控制在解析樹中包含或不包含的內容。

  • All contiguous strings of alphabetic characters are part of one token; likewise with numbers.


  • Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.


By default, the Perl 6 design embodies an equivalent of these two heuristics.

默認情況下,Perl 6設計體現了這兩種啟發式的等價物。

The key thing to get is that it's the rule construct that handles a string of tokens, plural. The token construct is used to define a single token per call.


I think I'll end my answer here because it's already getting pretty long. Please use the comments to help us improve this answer. I hope what I've written so far helps.




A partial answer to my own question: Change all the rules to tokens and it works. It makes sense, because the difference is :sigspace, which we don't need or want here. What I don't understand, though, is why it did work for some input, like my second example.


The resulting code is here, if you're interested.




粤ICP备14056181号  © 2014-2021