一種更快的方法來執行多個字符串替換。

[英]A faster way of doing multiple string replacements


I need to do the following:

我需要做以下事情:

    static string[] pats = { "å", "Å", "æ", "Æ", "ä", "Ä", "ö", "Ö", "ø", "Ø" ,"è", "È", "à", "À", "ì", "Ì", "õ", "Õ", "ï", "Ï" };
    static string[] repl = { "a", "A", "a", "A", "a", "A", "o", "O", "o", "O", "e", "E", "a", "A", "i", "I", "o", "O", "i", "I" };
    static int i = pats.Length;
    int j;

     // function for the replacement(s)
     public string DoRepl(string Inp) {
      string tmp = Inp;
        for( j = 0; j < i; j++ ) {
            tmp = Regex.Replace(tmp,pats[j],repl[j]);
        }
        return tmp.ToString();            
    }
    /* Main flow processes about 45000 lines of input */

Each line has 6 elements that go through DoRepl. Approximately 300,000 function calls. Each does 20 Regex.Replace, totalling ~6 million replaces.

每一行有6個元素通過DoRepl。大約300000函數調用。每20個正則表達式。更換,總計約600萬更換。

Is there any more elegant way to do this in fewer passes?

有沒有更優雅的方法在更少的傳球中做到這一點?

8 个解决方案

#1


21  

static Dictionary<char, char> repl = new Dictionary<char, char>() { { 'å', 'a' }, { 'ø', 'o' } }; // etc...
public string DoRepl(string Inp)
{
    var tmp = Inp.Select(c =>
    {
        char r;
        if (repl.TryGetValue(c, out r))
            return r;
        return c;
    });
    return new string(tmp.ToArray());
}

Each char is checked only once against a dictionary and replaced if found in the dictionary.

每個字符僅針對字典檢查一次,如果在字典中找到,則替換它。

#2


12  

How about this "trick"?

這“詭計”怎么樣?

string conv = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(input));

#3


10  

Without regex it might be way faster.

沒有regex,它可能會更快。

    for( j = 0; j < i; j++ ) 
    {
        tmp = tmp.Replace(pats[j], repl[j]);
    }

Edit

編輯

Another way using Zip and a StringBuilder:

使用Zip和StringBuilder的另一種方式:

StringBuilder result = new StringBuilder(input);
foreach (var zipped = patterns.Zip(replacements, (p, r) => new {p, r}))
{
  result = result.Replace(zipped.p, zipped.r);
}
return result.ToString();

#4


2  

The problem with your original regex is that you're not using it to its fullest potential. Remember, a regex pattern can have alternations. You will still need a dictionary, but you can do it in one pass without looping through each character.

原始regex的問題是您沒有充分利用它的潛力。請記住,regex模式可能會有更改。您仍然需要一個字典,但是您可以一次完成,而不需要對每個字符進行循環。

This would be achieved as follows:

這將實現如下:

string[] pats = { "å", "Å", "æ", "Æ", "ä", "Ä", "ö", "Ö", "ø", "Ø" ,"è", "È", "à", "À", "ì", "Ì", "õ", "Õ", "ï", "Ï" };
string[] repl = { "a", "A", "a", "A", "a", "A", "o", "O", "o", "O", "e", "E", "a", "A", "i", "I", "o", "O", "i", "I" };
// using Zip as a shortcut, otherwise setup dictionary differently as others have shown
var dict = pats.Zip(repl, (k,v) => new { Key = k, Value = v }).ToDictionary(o => o.Key, o => o.Value);

string input = "åÅæÆäÄöÖøØèÈàÀìÌõÕïÏ";
string pattern = String.Join("|", dict.Keys.Select(k => k)); // use ToArray() for .NET 3.5
string result = Regex.Replace(input, pattern, m => dict[m.Value]);

Console.WriteLine("Pattern: " + pattern);
Console.WriteLine("Input: " + input);
Console.WriteLine("Result: " + result);

Of course, you should always escape your pattern using Regex.Escape. In this case this is not needed since we know the finite set of characters and they don't need to be escaped.

當然,您應該總是使用Regex.Escape來避免您的模式。在這種情況下,這是不需要的,因為我們知道有限的字符集合,它們不需要轉義。

#5


2  

First, I would use a StringBuilder to perform the translation inside a buffer and avoid creating new strings all over the place.

首先,我將使用StringBuilder在緩沖區中執行轉換,並避免到處創建新的字符串。

Next, ideally we'd like something akin to XPath's translate(), so we can work with strings instead of arrays or mappings. Let's do that in an extension method:

接下來,理想情況下,我們希望類似於XPath的翻譯(),因此我們可以使用字符串而不是數組或映射。我們用擴展方法來做:

public static StringBuilder Translate(this StringBuilder builder,
    string inChars, string outChars)
{
    int length = Math.Min(inChars.Length, outChars.Length);
    for (int i = 0; i < length; ++i) {
        builder.Replace(inChars[i], outChars[i]);
    }
    return builder;
}

Then use it:

然后使用它:

StringBuilder builder = new StringBuilder(yourString);
yourString = builder.Translate("åÅæÆäÄöÖøØèÈàÀìÌõÕïÏ",
    "aAaAaAoOoOeEaAiIoOiI").ToString();

#6


1  

If you want to remove accents then perhaps this solution would be helpful How do I remove diacritics (accents) from a string in .NET?

如果您想要刪除重音,那么這個解決方案可能會很有幫助,我如何從。net中的字符串中刪除diacritics(重音)?

Otherwise I would to this in single pass:

否則我就一氣呵成:

Dictionary<char, char> replacements = new Dictionary<char, char>();
...
StringBuilder result = new StringBuilder();
foreach(char c in str)
{
  char rc;
  if (!_replacements.TryGetValue(c, out rc)
  {
    rc = c;
  }
  result.Append(rc);
}

#7


1  

The fastest (IMHO) way (compared even with the dictionary) in the special case of one-to-one character replacement would be a full character map:

在一對一字符替換的特殊情況下,最快的(IMHO)方法(與字典相比)是完整的字符映射:

public class Converter
{
    private readonly char[] _map;

    public Converter()
    {
        // This code assumes char to be a short unsigned integer
        _map = new char[char.MaxValue];

        for (int i = 0; i < _map.Length; i++)
            _map[i] = (char)i;

        _map['å'] = 'a';  // Note that 'å' is used as an integer index into the array.
        _map['Å'] = 'A';
        _map['æ'] = 'a';
        // ... the rest of overriding map
    }

    public string Convert(string source)
    {
        if (string.IsNullOrEmpty(source))
            return source;

        var result = new char[source.Length];

        for (int i = 0; i < source.Length; i++)
            result[i] = _map[source[i]]; // convert using the map

        return new string(result);
    }
}

To further speed up this code, you might want to use the "unsafe" keyword and use pointers. This way, traversing the string array could be done faster and without bound-checks (which in theory would be optimized away by the VM, but might not).

為了進一步加速這段代碼,您可能需要使用“不安全”關鍵字和使用指針。通過這種方式,可以更快地遍歷字符串數組,而且無需進行邊界檢查(理論上,VM會對其進行優化,但可能不會)。

#8


0  

I'm not familiar with the Regex class, but most regular expression engines have a transliterate operation that would work well here. Then you would only need one call per line.

我不熟悉Regex類,但是大多數正則表達式引擎都有一個可以在這里很好地工作的跨文化操作。然后,每一行只需要一個調用。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2010/11/11/726022a4a9e530737829ee2f14ce5e06.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com