Perl 6 Grammar與我認為不應該匹配

[英]Perl 6 Grammar doesn't match like I think it should


I'm doing Advent of Code day 9:

我正在做第9天的代碼出現:

You sit for a while and record part of the stream (your puzzle input). The characters represent groups - sequences that begin with { and end with }. Within a group, there are zero or more other things, separated by commas: either another group or garbage. Since groups can contain other groups, a } only closes the most-recently-opened unclosed group - that is, they are nestable. Your puzzle input represents a single, large group which itself contains many smaller ones.

你坐了一會兒,記錄了一部分流(你的拼圖輸入)。字符代表組 - 以{和以}結尾的序列。在一個組中,有零個或多個其他東西,用逗號分隔:另一個組或垃圾。由於組可以包含其他組,因此只關閉最近打開的未關閉組 - 也就是說,它們是可嵌套的。您的拼圖輸入代表一個單獨的大組,其本身包含許多較小的組。

Sometimes, instead of a group, you will find garbage. Garbage begins with < and ends with >. Between those angle brackets, almost any character can appear, including { and }. Within garbage, < has no special meaning.

有時候,你會發現垃圾,而不是一群人。垃圾以 <開頭,以> 結尾。在這些尖括號之間,幾乎可以出現任何字符,包括{和}。在垃圾中, <沒有特殊意義。< p>

In a futile attempt to clean up the garbage, some program has canceled some of the characters within it using !: inside garbage, any character that comes after ! should be ignored, including <, >, and even another !.

在徒勞地嘗試清理垃圾時,一些程序已經使用!取消了其中的一些字符:內部垃圾,任何后來的字符!應該被忽略,包括<,>,甚至是另一個!

Of course, this screams out for a Perl 6 Grammar...

當然,這對Perl 6 Grammar來說很尖叫......

grammar Stream
{
    rule TOP { ^ <group> $ }

    rule group { '{' [ <group> || <garbage> ]* % ',' '}' }
    rule garbage { '<' [ <garbchar> | <garbignore> ]* '>' }

    token garbignore { '!' . }
    token garbchar { <-[ !> ]> }
}

This seems to work fine on simple examples, but it goes wrong with two garbchars in a row:

這似乎在簡單的例子上工作正常,但連續兩個garbchars出錯了:

say Stream.parse('{<aa>}');

gives Nil.

給了Nil。

Grammar::Tracer is no help:

語法:: Tracer沒有幫助:

TOP
|  group
|  |  group
|  |  * FAIL
|  |  garbage
|  |  |  garbchar
|  |  |  * MATCH "a"
|  |  * FAIL
|  * FAIL
* FAIL
Nil

Multiple garbignores are no problem:

多個garbignores沒問題:

say Stream.parse('{<!!a!a>}');

gives:

得到:

「{<!!a!a>}」
 group => 「{<!!a!a>}」
  garbage => 「<!!a!a>」
   garbignore => 「!!」
   garbchar => 「a」
   garbignore => 「!a」

Any ideas?

有任何想法嗎?

2 个解决方案

#1


6  

UPD Given that the Advent of code problem doesn't mention whitespace you shouldn't be using the rule construct at all. Just switch all the rules to tokens and you should be set. In general, follow Brad's advice -- use token unless you know you need a rule (discussed below) or a regex (if you need backtracking).

UPD鑒於代碼問題的出現沒有提到空格,你根本就不應該使用規則構造。只需將所有規則切換到令牌即可。一般來說,遵循布拉德的建議 - 使用令牌,除非你知道你需要一個規則(下面討論)或一個正則表達式(如果你需要回溯)。


My original answer below explored why the rules didn't work. I'll leave it in for now.

我在下面的原始答案探討了為什么規則不起作用。我現在就把它留下來。


TL;DR <garbchar> | contains a space. Whitespace that directly follows any atom in a rule indicates a tokenizing break. You can simply remove this inappropriate space, i.e. write <garbchar>| instead (or better still, <.garbchar>| if you don't need to capture the garbage) to get the result you seek.

TL; DR |包含一個空間。直接跟隨規則中任何原子的空格表示標記化中斷。你可以簡單地刪除這個不合適的空間,即寫 |相反(或者更好的是,<.garbchar> |如果你不需要捕獲垃圾)來獲得你尋求的結果。


As your original question allowed, this isn't a bug, it's just that your mental model is off.

正如你原來的問題允許的那樣,這不是一個錯誤,只是你的心理模型已經關閉。

Your answer correctly identifies the issue: tokenization.

您的答案正確識別問題:標記化。

So what we're left with is your follow up question, which is about your mental model of tokenization, or at least how Perl 6 tokenizes by default:

所以我們留下的是你的后續問題,這是關於你的標記化的心理模型,或者至少是默認情況下Perl 6如何標記:

why ... my second example ... goes wrong with two garbchars in a row:

為什么...我的第二個例子......連續兩個garbchars出錯了:

'{<aa>}'

Simplifying, the issue is how to tokenize this:

簡化,問題是如何標記這個:

aa

The simple high level answer is that, in parsing vernacular, aa will ordinarily be treated as one token, not two, and, by default, Perl 6 assumes this ordinary definition. This is the issue you're encountering.

簡單的高級答案是,在解析白話時,aa通常被視為一個令牌,而不是兩個,並且默認情況下,Perl 6假設這個普通的定義。這是你遇到的問題。

You can overrule this ordinary definition to get any tokenizing result you care to achieve. But it's seldom necessary to do so and it certainly isn't in simple cases like this.

您可以否決這個普通的定義,以獲得您想要達到的任何標記化結果。但是很少有必要這樣做,當然不是像這樣的簡單情況。

I'll provide two redundant paths that I hope might lead folk to the correct mental model:

我將提供兩條冗余路徑,我希望這些路徑可以引導民間人士找到正確的心理模型:

  • For those who prefer diving straight into nitty gritty detail, there's a reddit comment I wrote recently about tokenization in Perl 6.

    對於那些喜歡直接深入細節的人來說,最近我寫了一篇關於Perl 6中的標記化的reddit評論。

  • The rest of this SO answer provides a high level discussion that complements the low level explanation in my reddit comment.

    這個SO答案的其余部分提供了一個高級別的討論,補充了我的reddit評論中的低級別解釋。

Excerpting from the "Obstacles" section of the wikipedia page on tokenization, and interleaving the excerpts with P6 specific discussion:

摘自維基百科頁面上關於標記化的“障礙”部分,並將摘錄與P6特定討論交錯:

Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often a tokenizer relies on simple heuristics, for example:

通常,標記化發生在單詞級別。但是,有時很難定義“單詞”的含義。通常,標記生成器依賴於簡單的啟發式方法,例如:

  • Punctuation and whitespace may or may not be included in the resulting list of tokens.
  • 標點符號和空格可能包含也可能不包含在結果的標記列表中。

In Perl 6 you control what gets included or not in the parse tree using capturing features that are orthogonal to tokenizing.

在Perl 6中,您可以使用與標記化正交的捕獲功能來控制在解析樹中包含或不包含的內容。

  • All contiguous strings of alphabetic characters are part of one token; likewise with numbers.

    所有連續的字母字符串都是一個標記的一部分;同樣有數字。

  • Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.

    標記由空格字符分隔,例如空格或換行符,或者用標點字符分隔。

By default, the Perl 6 design embodies an equivalent of these two heuristics.

默認情況下,Perl 6設計體現了這兩種啟發式的等價物。

The key thing to get is that it's the rule construct that handles a string of tokens, plural. The token construct is used to define a single token per call.

要獲得的關鍵是它是規則構造,它處理一串令牌,復數。令牌構造用於為每個調用定義單個令牌。

I think I'll end my answer here because it's already getting pretty long. Please use the comments to help us improve this answer. I hope what I've written so far helps.

我想我會在這里結束我的答案,因為它已經很長了。請使用評論來幫助我們改進這個答案。我希望到目前為止我所寫的內容有所幫助。

#2


3  

A partial answer to my own question: Change all the rules to tokens and it works. It makes sense, because the difference is :sigspace, which we don't need or want here. What I don't understand, though, is why it did work for some input, like my second example.

對我自己的問題的部分答案:將所有規則更改為令牌並且它有效。這是有道理的,因為區別在於:sigspace,我們不需要或想要這里。但是,我不明白的是,為什么它確實適用於某些輸入,就像我的第二個例子。

The resulting code is here, if you're interested.

如果你感興趣的話,結果代碼在這里。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2017/12/09/72f3f633a05f00ec101a4c5b1215eecf.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com