I am trying to normalize some scores from a .txt file by dividing each score for each possible sense (eg. take#v#2; referred to as $tokpossense in my code) by the sum of all scores for a wordtype (e.g. take#v; referred to as $tokpos). The difficulty is in grouping the wordtypes together when processing each line of the so that the normalized scores are printed upon finding a new wordtype/$tokpos. I used two hashes and an if block to achieve this.
我試圖通過將每個可能意義的每個分數(例如,在我的代碼中將#v#2;稱為$ tokpossense)除以一個單詞類型的所有分數的總和來從.txt文件中標准化一些分數(例如,取#v;簡稱$ tokpos)。難點在於在處理每一行時將單詞類型分組在一起,以便在找到新的單詞類型/ $ tokpos時打印標准化分數。我使用了兩個哈希和一個if塊來實現這一目標。
Currently, the problem seems to be that $tokpos is undefined as a key in SumHash{$tokpos} at line 20 resulting in a division by zero. However, I believe $tokpos is properly defined within the scope of this block. What is the problem exactly and how would I best solve it? I would also gladly hear alternative approaches to this problem.
目前,問題似乎是$ tokpos未定義為第20行SumHash {$ tokpos}中的一個鍵,導致除以零。但是,我認為$ tokpos已在此塊的范圍內正確定義。究竟是什么問題,我最好如何解決它?我也很樂意聽到解決這個問題的其他方法。
Here's an example inputfile:
這是一個示例輸入文件:
i#CL take#v#17 my#CL checks#n#1 to#CL the#CL bank#n#2 .#IT
Context: i#CL <target>take#v</target> my#CL checks#n to#CL the#CL bank#n
Scores for take#v
take#v#1: 17
take#v#10: 158
take#v#17: 174
Winning score: 174
Context: i#CL take#v my#CL <target>checks#n</target> to#CL the#CL bank#n .#IT
Scores for checks#n
check#n#1: 198
check#n#2: 117
check#n#3: 42
Winning score: 198
Context: take#v my#CL checks#n to#CL the#CL <target>bank#n</target> .#IT
Scores for bank#n
bank#n#1: 81
bank#n#2: 202
bank#n#3: 68
bank#n#4: 37
Winning score: 202
My erroneous Code:
我的錯誤代碼:
@files = @ARGV;
foreach $file(@files){
open(IN, $file);
@lines=<IN>;
foreach (@lines){
chomp;
#store tokpossense (eg. "take#v#1") and rawscore (eg. 4)
if (($tokpossense,$rawscore)= /^\s{4}(.+): (\d+)/) {
#split tokpossense for recombination
($tok,$pos,$sensenr)=split(/#/,$tokpossense);
#tokpos (eg. take#v) will be a unique identifier when calculating normalized score
$tokpos="$tok\#$pos";
#block for when new tokpos(word) is found in inputfile
if (defined($prevtokpos) and
($tokpos ne $prevtokpos)) {
# normalize hash: THE PROBLEM LIES IN $SumHash{$tokpos} which is returned as zero > WHY?
foreach (keys %ScoreHash) {
$normscore=$ScoreHash{$_}/$SumHash{$tokpos};
#print the results to a file
print "$_\t$ScoreHash{$_}\t$normscore\n";
}
#empty hashes
undef %ScoreHash;
undef %SumHash;
}
#prevtokpos is assigned to tokpos for condition above
$prevtokpos = $tokpos;
#store the sum of scores for a tokpos identifier for normalization
$SumHash{$tokpos}+=$rawscore;
#store the scores for a tokpossense identifier for normalization
$ScoreHash{$tokpossense}=$rawscore;
}
#skip the irrelevant lines of inputfile
else {next;}
}
}
Extra info: I am doing Word Sense Disambiguation using Pedersen's Wordnet WSD tool which uses Wordnet::Similarity::AllWords. The output file is generated by this package and the found scores have to be normalized for implementation in our toolset.
額外信息:我正在使用Pedersen的Wordnet WSD工具進行Word Sense Disambiguation,它使用Wordnet :: Similarity :: AllWords。輸出文件由此包生成,必須對找到的分數進行標准化,以便在我們的工具集中實現。
You don't assign anything to $tokpos
. The assignment is part of a comment - syntax highlighting in your editor should've told you. strict would've told you, too.
你沒有給$ tokpos分配任何東西。作業是評論的一部分 - 編輯器中的語法高亮應該告訴你。嚴格也會告訴你的。
Also, you should probably use $prevtokpos
in the division: $tokpos
is the new value that you haven't met before. To get the output for the last token, you have to process it outside the loop, as there's no $tokpos
to replace it. To avoid code repetition, use a subroutine to do that:
此外,您應該在分部中使用$ prevtokpos:$ tokpos是您之前未遇到的新值。要獲取最后一個令牌的輸出,您必須在循環外處理它,因為沒有$ tokpos來替換它。為避免代碼重復,請使用子例程來執行此操作:
#!/usr/bin/perl
use warnings;
use strict;
my %SumHash;
my %ScoreHash;
sub output {
my $token = shift;
for (keys %ScoreHash) {
my $normscore = $ScoreHash{$_} / $SumHash{$token};
print "$_\t$ScoreHash{$_}\t$normscore\n";
}
undef %ScoreHash;
undef %SumHash;
}
my $prevtokpos;
while (<DATA>){
chomp;
if (my ($tokpossense,$rawscore) = /^\s{4}(.+): (\d+)/) {
my ($tok, $pos, $sensenr) = split /#/, $tokpossense;
my $tokpos = "$tok\#$pos";
if (defined $prevtokpos && $tokpos ne $prevtokpos) {
output($prevtokpos);
}
$prevtokpos = $tokpos;
$SumHash{$tokpos} += $rawscore;
$ScoreHash{$tokpossense} = $rawscore;
}
}
output($prevtokpos);
__DATA__
i#CL take#v#17 my#CL checks#n#1 to#CL the#CL bank#n#2 .#IT
Context: i#CL <target>take#v</target> my#CL checks#n to#CL the#CL bank#n
Scores for take#v
take#v#1: 17
take#v#10: 158
take#v#17: 174
Winning score: 174
Context: i#CL take#v my#CL <target>checks#n</target> to#CL the#CL bank#n .#IT
Scores for checks#n
check#n#1: 198
check#n#2: 117
check#n#3: 42
Winning score: 198
Context: take#v my#CL checks#n to#CL the#CL <target>bank#n</target> .#IT
Scores for bank#n
bank#n#1: 81
bank#n#2: 202
bank#n#3: 68
bank#n#4: 37
Winning score: 202
You're confusing yourself by trying to print the results as soon as $tokpos
changes. For one thing it's the values for $prevtokpos
that are complete, but your trying to output the data for $tokpos
; and also you're never going to display the last block of data because you require a change in $tokpos
to trigger the output.
$ tokpos更改后,嘗試打印結果會讓您感到困惑。首先,$ prevtokpos的值是完整的,但是你試圖輸出$ tokpos的數據;而且你永遠不會顯示最后一個數據塊,因為你需要更改$ tokpos來觸發輸出。
It's far easier to accumulate all the data for a given file and then print it when the end of file is reached. This program works by keeping the three values $tokpos
, $sense
, and $rawscore
for each line of the output in array @results
, together with the total score for each value of $tokpos
in %totals
. Then it's simply a matter of dumping the contents of @results
with an extra column that divides each value by the corresponding total.
累積給定文件的所有數據然后在到達文件末尾時打印它會容易得多。這個程序的工作原理是保持數組@results中輸出的每一行的三個值$ tokpos,$ sense和$ rawscore,以及%tokpos中每個值的總得分。%total。然后,只需將@results的內容轉儲為一個額外的列,將每個值除以相應的總數。
use strict;
use warnings;
use 5.014; # For non-destructive substitution
for my $file ( @ARGV ) {
open my $fh, '<', $file or die $!;
my (@results, %totals);
while ( <$fh> ) {
chomp;
next unless my ($tokpos, $sense, $rawscore) = / ^ \s{4} ( [^#]+ \# [^#]+ ) \# (\d+) : \s+ (\d+) /x;
push @results, [ $tokpos, $sense, $rawscore ];
$totals{$tokpos} += $rawscore;
}
print "** $file **\n";
for my $item ( @results ) {
my ($tokpos, $sense, $rawscore) = @$item;
printf "%s\t%s\t%6.4f\n", $tokpos.$sense, $rawscore, $rawscore / $totals{$tokpos};
}
print "\n";
}
output
** tokpos.txt **
take#v#1 17 0.0487
take#v#10 158 0.4527
take#v#17 174 0.4986
check#n#1 198 0.5546
check#n#2 117 0.3277
check#n#3 42 0.1176
bank#n#1 81 0.2088
bank#n#2 202 0.5206
bank#n#3 68 0.1753
bank#n#4 37 0.0954
本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2015/05/04/72593c5949f48b2ace00015aa61f2a8c.html。