為Python 2/3實現谷歌的DiffMatchPatch API

[英]Implementing Google's DiffMatchPatch API for Python 2/3


I want to write a simple diff application in Python using Google's Diff Match Patch APIs. I'm quite new to Python, so I want an example of how to use the Diff Match Patch API for semantically comparing two paragraphs of text. I'm not too sure of how to go about using the diff_match_patch.py file and what to import to from it. Help will be much appreciated!

我想使用谷歌的diff匹配補丁api在Python中編寫一個簡單的diff應用程序。我對Python非常陌生,所以我想要一個如何使用Diff匹配補丁API來語義上比較兩段文本的示例。我不太確定如何使用diff_match_patch。py文件以及從它導入什么。非常感謝您的幫助!

Additionally, I've tried using difflib, but I found it ineffective for comparing largely varied sentences. I'm using ubuntu 12.04 x64.

此外,我嘗試過使用difflib,但我發現它對於比較大量不同的句子是無效的。我用的是ubuntu 12.04 x64。

1 个解决方案

#1


19  

Google's diff-match-patch API is the same for all languages that it is implemented in (Java, JavaScript, Dart, C++, C#, Objective C, Lua and Python 2.x or python 3.x). Therefore one can typically use sample snippets in languages other than one's target language to figure out which particular API calls are needed for various diff/match/patch tasks .

谷歌的擴散-match-patch API對於所有的語言都是相同的(Java、JavaScript、Dart、c++、c#、Objective C、Lua和Python 2)。x或python 3. x)。因此,我們通常可以在目標語言之外的語言中使用示例代碼片段,以確定不同的diff/match/patch任務需要哪些特定的API調用。

In the case of a simple "semantic" comparison this is what you need

在簡單的“語義”比較中,這是您需要的

import diff_match_patch

textA = "the cat in the red hat"
textB = "the feline in the blue hat"

#create a diff_match_patch object
dmp = diff_match_patch.diff_match_patch()

# Depending on the kind of text you work with, in term of overall length
# and complexity, you may want to extend (or here suppress) the
# time_out feature
dmp.Diff_Timeout = 0   # or some other value, default is 1.0 seconds

# All 'diff' jobs start with invoking diff_main()
diffs = dmp.diff_main(textA, textB)

# diff_cleanupSemantic() is used to make the diffs array more "human" readable
dmp.diff_cleanupSemantic(diffs)

# and if you want the results as some ready to display HMTL snippet
htmlSnippet = dmp.diff_prettyHtml(diffs)


A word on "semantic" processing by diff-match-patch
Beware that such processing is useful to present the differences to a human viewer because it tends to produce a shorter list of differences by avoiding non-relevant resynchronization of the texts (when for example two distinct words happen to have common letters in their mid). The results produced however are far from perfect, as this processing is just simple heuristics based on the length of differences and surface patterns etc. rather than actual NLP processing based on lexicons and other semantic-level devices.
For example, the textA and textB values used above produce the following "before-and-after-diff_cleanupSemantic" values for the diffs array

一個字“語義”處理diff-match-patch小心這樣的處理是有用的人呈現的差異,因為它往往會產生一個短的差異通過避免無關的再同步的文本(例如,當發生在兩個不同的字有常見的字母的)。然而,產生的結果遠非完美,因為這種處理只是基於差異和表面模式的長度的簡單啟發式,而不是基於詞匯表和其他語義級設備的實際的NLP處理。例如,上面使用的textA和textB值為diffs數組生成以下“before-and-after-diff_cleanupSemantic”值

[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'r'), (1, 'blu'), (0, 'e'), (-1, 'd'), (0, ' hat')]
[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'red'), (1, 'blue'), (0, ' hat')]

Nice! the letter 'e' that is common to red and blue causes the diff_main() to see this area of the text as four edits, but the cleanupSemantic() fixes as just two edits, nicely singling out the different sems 'blue' and 'red'.

好了!紅色和藍色常見的字母“e”使diff_main()將文本的這一區域視為四個編輯,但是clean - semantic()只進行兩個編輯,很好地將不同的sems“blue”和“red”單獨列出。

However, if we have, for example

但是,如果我們有。

textA = "stackoverflow is cool"
textb = "so is very cool"

The before/after arrays produced are:

所產生的前后數組為:

[(0, 's'), (-1, 'tack'), (0, 'o'), (-1, 'verflow'), (0, ' is'), (1, ' very'), (0, ' cool')]
[(0, 's'), (-1, 'tackoverflow is'), (1, 'o is very'), (0, ' cool')]

Which shows that the allegedly semantically improved after can be rather unduly "tortured" compared to the before. Note, for example, how the leading 's' is kept as a match and how the added 'very' word is mixed with parts of the 'is cool' expression. Ideally, we'd probably expect something like

這表明,與之前相比,據說在語義上有所改善的after可以被過度“折磨”。例如,請注意,引導的“s”是如何被保存為匹配的,以及如何將“非常”的單詞與“is cool”的某些部分混合在一起。理想的情況下,我們可能會有類似的預期

[(-1, 'stackoverflow'), (1, 'so'), (0, ' is '), (-1, 'very'), (0, ' cool')]

注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2013/04/18/72f9e7bded9a2fe9e4d041f888d46f9e.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com