如何刪除C ++ std :: string中的重音符和波浪號

[英]How to remove accents and tilde in a C++ std::string


I have a problem with a string in C++ which has several words in Spanish. This means that I have a lot of words with accents and tildes. I want to replace them for their not accented counterparts. Example: I want to replace this word: "había" for habia. I tried replace it directly but with replace method of string class but I could not get that to work.

我在C ++中有一個字符串的問題,它有幾個西班牙語單詞。這意味着我有很多帶有重音符號和波浪號的單詞。我想替換它們沒有重音的同行。示例:我想替換這個詞:哈比亞的“había”。我嘗試直接替換它但使用字符串類的替換方法,但我無法讓它工作。

I'm using this code:

我正在使用此代碼:

for (it= dictionary.begin(); it != dictionary.end(); it++)
{
    strMine=(it->first);
    found=toReplace.find_first_of(strMine);
    while (found!=std::string::npos)
    {
        strAux=(it->second);
        toReplace.erase(found,strMine.length());
        toReplace.insert(found,strAux);
        found=toReplace.find_first_of(strMine,found+1);
    }
}

Where dictionary is a map like this (with more entries):

字典是這樣的地圖(有更多條目):

dictionary.insert ( std::pair<std::string,std::string>("á","a") );
dictionary.insert ( std::pair<std::string,std::string>("é","e") );
dictionary.insert ( std::pair<std::string,std::string>("í","i") );
dictionary.insert ( std::pair<std::string,std::string>("ó","o") );
dictionary.insert ( std::pair<std::string,std::string>("ú","u") );
dictionary.insert ( std::pair<std::string,std::string>("ñ","n") );

and toReplace strings is:

和toReplace字符串是:

std::string toReplace="á-é-í-ó-ú-ñ-á-é-í-ó-ú-ñ";

I obviously must be missing something. I can't figure it out. Is there any library I can use?.

我顯然必須遺漏一些東西。我無法弄明白。我有可以使用的圖書館嗎?

Thanks,

12 个解决方案

#1


17  

First, this is a really bad idea: you’re mangling somebody’s language by removing letters. Although the extra dots in words like “naïve” seem superfluous to people who only speak English, there are literally thousands of writing systems in the world in which such distinctions are very important. Writing software to mutilate someone’s speech puts you squarely on the wrong side of the tension between using computers as means to broaden the realm of human expression vs. tools of oppression.

首先,這是一個非常糟糕的主意:你通過刪除字母來破壞某人的語言。雖然像“天真”這樣的單詞中的額外點對於只說英語的人來說似乎是多余的,但世界上有數以千計的書寫系統,其中這些區別非常重要。編寫軟件以破壞某人的言論,這使你正好處於使用計算機作為擴大人類表達領域與壓迫工具之間的緊張關系的錯誤方面。

What is the reason you’re trying to do this? Is something further down the line choking on the accents? Many people would love to help you solve that.

你試圖這樣做的原因是什么?是什么東西在口音上窒息?很多人都願意幫助你解決這個問題。

That said, libicu can do this for you. Open the transform demo; copy and paste your Spanish text into the “Input” box; enter

也就是說,libicu可以為你做到這一點。打開轉換演示;將西班牙文本復制並粘貼到“輸入”框中;輸入

NFD; [:M:] remove; NFC

as “Compound 1” and click transform.

作為“化合物1”並單擊轉換。

(With help from slide 9 of Unicode Transforms in ICU. Slides 29-30 show how to use the API.)

(借助ICU中Unicode轉換的幻燈片9的幫助。幻燈片29-30顯示了如何使用API​​。)

#2


23  

I disagree with the currently "approved" answer. The question makes perfect sense when you are indexing text. Like case-insensitive search, accent-insensitive search is a good idea. "naïve" matches "Naïve" matches "naive" matches "NAİVE" (you do know that an uppercase i is İ in Turkish? That's why you ignore accents)

我不同意目前“批准”的答案。在索引文本時,這個問題非常有意義。與不區分大小寫的搜索一樣,不區分重音的搜索也是一個好主意。 “naïve”匹配“Naïve”匹配“天真”匹配“NAİVE”(你知道大寫我是土耳其語嗎?這就是你忽略重音的原因)

Now, the best algorithm is hinted at the approved answer: Use NKD (decomposition) to decompose accented letters into the base letter and a seperate accent, and then remove all accents.

現在,最好的算法暗示了批准的答案:使用NKD(分解)將重音字母分解為基本字母和單獨的重音,然后刪除所有重音。

There is little point in the re-composition afterwards, though. You removed most sequences which would change, and the others are for all intents and purposes identical anyway. WHat's the difference between æ in NKC and æ in NKD?

不過,之后的重組很少有意義。您刪除了大多數會改變的序列,而其他序列無論如何都是相同的。什么是NKC和æ在NKD之間的區別?

#3


2  

I definitely think you should look into the root of the problem. That is, look for a solution that will allow you to support characters encoded in Unicode or for the user's locale.

我絕對認為你應該研究問題的根源。也就是說,尋找一種解決方案,允許您支持以Unicode編碼的字符或用戶的語言環境。

That being said, your problem is that you're dealing with multi-character strings. There is std::wstring but I'm not sure I'd use that. For one thing, wide characters aren't meant to handle variable width encodings. This hole goes deep, so I'll leave it at that.

話雖這么說,你的問題是你正在處理多字符串。有std :: wstring,但我不確定我是否會使用它。首先,寬字符並不意味着處理可變寬度編碼。這個洞深入,所以我會留下它。

Now, as for the rest of your code, it is error prone because you mix the looping logic with translation logic. Thus, at least two kinds of bugs can occur: translation bugs and looping bugs. Do use the STL, it can help you a lot with the looping part.

現在,對於其余的代碼,它很容易出錯,因為您將循環邏輯與轉換邏輯混合在一起。因此,至少會出現兩種錯誤:轉換錯誤和循環錯誤。使用STL,它可以幫助你很多循環部分。

The following is a rough solution for replacing characters in a string.

以下是替換字符串中字符的粗略解決方案。

main.cpp:

#include <iostream>
#include <string>
#include <iterator>
#include <algorithm>
#include "translate_characters.h"

using namespace std;

int main()
{
    string text;
    cin.unsetf(ios::skipws);
    transform(istream_iterator<char>(cin), istream_iterator<char>(),
              inserter(text, text.end()), translate_characters());
    cout << text << endl;
    return 0;
}

translate_characters.h:

#ifndef TRANSLATE_CHARACTERS_H
#define TRANSLATE_CHARACTERS_H

#include <functional>
#include <map>

class translate_characters : public std::unary_function<const char,char> {
public:
    translate_characters();
    char operator()(const char c);

private:
    std::map<char, char> characters_map;
};

#endif // TRANSLATE_CHARACTERS_H

translate_characters.cpp:

#include "translate_characters.h"

using namespace std;

translate_characters::translate_characters()
{
    characters_map.insert(make_pair('e', 'a'));
}

char translate_characters::operator()(const char c)
{
    map<char, char>::const_iterator translation_pos(characters_map.find(c));
    if( translation_pos == characters_map.end() )
        return c;
    return translation_pos->second;
}

#4


1  

I'm surprised some people say you shouldn't deaccentuate characters. Having accents on characters in filenames can get you into a lot of problems when using programs manifestly written by programmers who didn't allow for this.

我很驚訝有些人說你不應該讓角色變得沉重。對文件名中的字符進行重音可能會在使用由不允許這樣做的程序員明顯編寫的程序時遇到很多問題。

#5


1  

I'm totally 100% in favour of using Unicode and not losing important information such as accents, but sometimes you need to do something like this. It's best not to second-guess people's reasons for wanting a particular function. In my case, I'm looking to do this for the purposes of searching for "similar" texts (which often means texts written - incorrectly - without accents).

我完全100%贊成使用Unicode並且不會丟失重音等重要信息,但有時你需要做這樣的事情。最好不要猜測人們想要特定功能的原因。在我的情況下,我希望這樣做是為了搜索“類似”文本(這通常意味着文字寫得不正確 - 沒有重音)。

Someone will always have a valid reason.

有人總是有正當理由。

#6


0  

You might want to check out the boost (http://www.boost.org/) library.

您可能想查看boost(http://www.boost.org/)庫。

It has a regexp library, which you could use. In addition it has a specific library that has some functions for string manipulation (link) including replace.

它有一個正則表達式庫,您可以使用它。此外,它還有一個特定的庫,它具有一些字符串操作(鏈接)功能,包括替換。

#7


0  

I was using unix, I forgot to mention that, but I run tr like this

我正在使用unix,我忘了提到它,但我像這樣運行tr

$tr áéíóú aeiou
á-é-í-ó-ú
ue-uo-uu-uu-uu

$tráéíóúaeiouá-é-í-ó-úu-uo-uu-uu-uu

it does not work as espected. I think it has to do with unicode and string class.

它沒有像預期的那樣工作。我認為它與unicode和string類有關。

#8


0  

The thing is that I am developing an application due in 5 days for university. It's a program that will index the text inside the tag in HTML pages (I can't use apache lucene to create the index also). However I won't be indexing all the words, must remove all stopwords use stemming and make all the text in lowercase. As per request of our teacher we must eliminate accents and tilde in the words. Hope this make things a little clearer.

問題是我正在為大學開發5天申請。這是一個程序,它將索引HTML頁面中標簽內的文本(我也不能使用apache lucene來創建索引)。但是我不會將所有單詞編入索引,必須刪除所有使用詞干的停用詞並使所有文本都以小寫形式出現。根據我們老師的要求,我們必須消除口音中的重音和代字。希望這會讓事情變得更加清晰。

Saludos,

#9


0  

Try using std::wstring instead of std::string. UTF-16 should work (as opposed to ASCII).

嘗試使用std :: wstring而不是std :: string。 UTF-16應該工作(而不是ASCII)。

#10


0  

If you can (if you're running Unix), I suggest using the tr facility for this: it's custom-built for this purpose. Remember, no code == no buggy code. :-)

如果可以(如果你正在運行Unix),我建議使用tr工具:它是為此目的而定制的。記住,沒有代碼==沒有錯誤的代碼。 :-)

Edit: Sorry, you're right, tr doesn't seem to work. How about sed? It's a pretty stupid script I've written, but it works for me.

編輯:對不起,你說得對,tr似乎不起作用。怎么樣sed?這是我編寫的一個非常愚蠢的劇本,但它對我有用。

#!/bin/sed -f
s/á/a/g;
s/é/e/g;
s/í/i/g;
s/ó/o/g;
s/ú/u/g;
s/ñ/n/g;

#11


0  

I could not link the ICU libraries but I still think it's the best solution. As I need this program to be functional as soon as possible I made a little program (that I have to improve) and I'm going to use that. Thank you all for for suggestions and answers.

我無法鏈接ICU庫,但我仍然認為這是最好的解決方案。因為我需要這個程序盡快運行,我做了一個小程序(我必須改進),我將使用它。謝謝大家的建議和解答。

Here's the code I'm gonna use:

這是我要使用的代碼:

for (it= dictionary.begin(); it != dictionary.end(); it++)
{
    strMine=(it->first);
    found=toReplace.find(strMine);
    while (found != std::string::npos)
    {
        strAux=(it->second);
        toReplace.erase(found,2);
        toReplace.insert(found,strAux);
        found=toReplace.find(strMine,found+1);
    }
} 

I will change it next time I have to turn my program in for correction (in about 6 weeks).

下次我必須改變我的程序進行校正(大約6周),我會改變它。

#12


0  

    /// <summary>
    /// 
    /// Replace any accent and foreign character by their ASCII equivalent.
    /// In other words, convert a string to an ASCII-complient string.
    /// 
    /// This also get rid of special hidden character, like EOF, NUL, TAB and other '\0', except \n\r
    /// 
    /// Tests with accents and foreign characters:
    /// Before: "äæǽaeöœoeüueÄAeÜUeÖOeÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶАAàáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặаaБBбbÇĆĈĊČCçćĉċčcДDдdÐĎĐΔDjðďđδdjÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭEèéêëēĕėęěέεẽẻẹềếễểệеэeФFфfĜĞĠĢΓГҐGĝğġģγгґgĤĦHĥħhÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫIìíîïĩīĭǐįıηήίιϊỉịиыїiĴJĵjĶΚКKķκкkĹĻĽĿŁΛЛLĺļľŀłλлlМMмmÑŃŅŇΝНNñńņňʼnνнnÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢОOòóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợоoПPпpŔŖŘΡРRŕŗřρрrŚŜŞȘŠΣСSśŝşșšſσςсsȚŢŤŦτТTțţťŧтtÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУUùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựуuÝŸŶΥΎΫỲỸỶỴЙYýÿŷỳỹỷỵйyВVвvŴWŵwŹŻŽΖЗZźżžζзzÆǼAEßssIJIJijijŒOEƒf'ξksπpβvμmψpsЁYoёyoЄYeєyeЇYiЖZhжzhХKhхkhЦTsцtsЧChчchШShшshЩShchщshchЪъЬьЮYuюyuЯYaяya"
    /// After:  "aaeooeuueAAeUUeOOeAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaaaaaaaaaaaaaaaaBbCCCCCCccccccDdDDjddjEEEEEEEEEEEEEEEEEEeeeeeeeeeeeeeeeeeeFfGGGGGgggggHHhhIIIIIIIIIIIIIiiiiiiiiiiiiJJjjKKkkLLLLllllMmNNNNNnnnnnOOOOOOOOOOOOOOOOOOOOOOooooooooooooooooooooooPpRRRRrrrrSSSSSSssssssTTTTttttUUUUUUUUUUUUUUUUUUUUUUUUuuuuuuuuuuuuuuuuuuuuuuuYYYYYYYYyyyyyyyyVvWWwwZZZZzzzzAEssIJijOEf'kspvmpsYoyoYeyeYiZhzhKhkhTstsChchShshShchshchYuyuYaya"
    /// 
    /// Tests with invalid 'special hidden characters':
    /// Before: "\0\0\000\0000Bj��rk�\'\"\\\0\a\b\f\n\r\t\v\u0020���oacu\'\\\'te�"
    /// After:  "00000Bjrk'\"\\\n\r oacu'\\'te"
    /// 
    /// </summary>
    private string Normalize(string StringToClean)
    {
        string normalizedString = StringToClean.Normalize(NormalizationForm.FormD);
        StringBuilder Buffer = new StringBuilder(StringToClean.Length);

        for (int i = 0; i < normalizedString.Length; i++)
        {
            if (CharUnicodeInfo.GetUnicodeCategory(normalizedString[i]) != UnicodeCategory.NonSpacingMark)
            {
                Buffer.Append(normalizedString[i]);
            }
        }

        string PreAsciiCompliant = Buffer.ToString().Normalize(NormalizationForm.FormC);
        StringBuilder AsciiComplient = new StringBuilder(PreAsciiCompliant.Length);

        foreach (char character in PreAsciiCompliant)
        {
            //Reject all special characters except \n\r (Carriage-Return and Line-Feed). 
            //Get rid of special hidden character, like EOF, NUL, TAB and other '\0'
            if (((int)character >= 32 && (int)character < 127) || ((int)character == 10 || (int)character == 13)) 
            {
                AsciiComplient.Append(character);
            }
        }
        return AsciiComplient.ToString().Trim(); // Remove spaces at start and end of string if any
    }

注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2008/09/27/ccaa72cdf91adb892ef9b323a66f62d3.html



 
  © 2014-2022 ITdaan.com 联系我们: