如何阻止Php的DOMDocument編碼html實體?

[英]How do I prevent Php's DOMDocument from encoding html entities?


I have a function that replaces anchors' href attribute in a string using Php's DOMDocument. Here's a snippet:

我有一個函數,使用Php的DOMDocument替換字符串中的錨點'href屬性。這是一個片段:

$doc        = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($text);
$anchors    = $doc->getElementsByTagName('a');

foreach($anchors as $a) {
    $a->setAttribute('href', 'http://google.com');
}

return $doc->saveHTML();

The problem is that loadHTML($text) surrounds the $text in doctype, html, body, etc. tags. I tried working around this by doing this instead of loadHTML():

問題是loadHTML($ text)圍繞doctype,html,body等標簽中的$ text。我嘗試通過這樣做而不是loadHTML()來解決這個問題:

$doc        = new DOMDocument('1.0', 'UTF-8');
$node       = $doc->createTextNode($text);
$doc->appendChild($node);
...

Unfortunately, this encodes all the entities (anchors included). Does anyone know how to turn this off? I've already thoroughly looked through the docs and tried hacking it, but can't figure it out.

不幸的是,這會編碼所有實體(包括錨點)。有誰知道如何關閉它?我已經徹底查看了文檔並試圖破解它,但無法弄明白。

Thanks! :)

4 个解决方案

#1


$text is a translated string with place-holder anchor tags

If these place holders have a strict, well-defined format a simple preg_replace or preg_replace_callback might do the trick.
I do not suggest fiddling about html documents with regex in general, but for a small well-defined subset they are suitable.

如果這些占位符具有嚴格的,定義良好的格式,則可以使用簡單的preg_replace或preg_replace_callback。我並不建議在一般情況下使用正則表達式來擺弄html文檔,但對於一個小的明確定義的子集,它們是合適的。

#2


XML has only very few predefined entities. All you html entities are defined somewhere else. When you use loadhtml() these entity definitions are load automagically, with loadxml() (or no load() at all) they are not.
createTextNode() does exactly what the name suggests. Everything you pass as value is treated as text content, not as markup. I.e. if you pass something that has a special meaning to the markup (<, >, ...) it's encoded in a way a parser can distinguish the text from the actual markup (&lt;, &gt;, ...)

XML只有很少的預定義實體。所有html實體都在其他地方定義。當你使用loadhtml()時,這些實體定義是自動加載的,而loadxml()(或根本沒有load())則不是。 createTextNode()正如名稱所暗示的那樣。作為值傳遞的所有內容都被視為文本內容,而不是標記。即如果你將一些具有特殊含義的東西傳遞給標記(<,>,...),它的編碼方式是解析器可以將文本與實際標記區分開來(<,>,...)

Where does $text come from? Can't you do the replacement within the actual html document?

$ text來自哪里?你不能在實際的html文件中做替換嗎?

#3


I ended up hacking this in a tenuous way, changing:

我最后以一種微妙的方式破解了這個,改變了:

return $doc->saveHTML();

into:

$text       = $doc->saveHTML();
return mb_substr($text, 122, -19);

This cuts out all the unnecessary garbage, changing this:

這減少了所有不必要的垃圾,改變了這個:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p>
You can <a href="http://www.google.com">click here</a> to visit Google.</p>
</body></html> 

into this:

You can <a href="http://www.google.com">click here</a> to visit Google.

Can anyone figure out something better?

誰能想出更好的東西?

#4


OK, here's the final solution I ended up with. Decided to go with VolkerK's suggestion.

好的,這是我最終的最終解決方案。決定采用VolkerK的建議。

public static function ReplaceAnchors($text, array $attributeSets)
{
    $expression = '/(<a)([\s\w\d:\/=_&\[\]\+%".?])*(>)/';

    if (empty($attributeSets) || !is_array($attributeSets)) {
        // no attributes to set. Set href="#".
        return preg_replace($expression, '$1 href="#"$3', $text);
    }

    $attributeStrs  = array();
    foreach ($attributeSets as $attributeKeyVal) {
        // loop thru attributes and set the anchor
        $attributePairs = array();
        foreach ($attributeKeyVal as $name => $value) {
            if (!is_string($value) && !is_int($value)) {
                continue; // skip
            }

            $name               = htmlspecialchars($name);
            $value              = htmlspecialchars($value);
            $attributePairs[]   = "$name=\"$value\"";
        }
        $attributeStrs[]    = implode(' ', $attributePairs);
    }

    $i      = -1;
    $pieces = preg_split($expression, $text);
    foreach ($pieces as &$piece) {
        if ($i === -1) {
            // skip the first token
            ++$i;
            continue;
        }

        // figure out which attribute string to use
        if (isset($attributeStrs[$i])) {
            // pick the parallel attribute string
            $attributeStr   = $attributeStrs[$i];
        } else {
            // pick the last attribute string if we don't have enough
            $attributeStr   = $attributeStrs[count($attributeStrs) - 1];
        }

        // build a opening new anchor for this token.
        $piece  = '<a '.$attributeStr.'>'.preg_replace($expression, '$1 href="#"$3', $piece);
        ++$i;
    }

    return implode('', $pieces);

This allows one to call the function with a set of different anchor attributes.

這允許用一組不同的錨屬性調用該函數。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2009/04/27/72096fa89e29abe61e036184473ab78c.html



 
粤ICP备14056181号  © 2014-2020 ITdaan.com