How can I replace strings NOT within a link tag?

I am working on this PHP function. The idea is to wrap certain words occuring in a string into certain tags (both, words and tags, given in an array). It works OK!, but when those words occur into a linked text or its ‘src’ attribute, then of course the link is broken and stuffed with tags, or tags that should not be inside a link are generated. This is what I have now:

function replace() {
  $terminos = array (
  "beneficios" => "h3",
  "valoracion" => "h2",
  "empresarios" => "h2",
  "tecnologias" => "h2",
  "...and so on..." => "etc",
  );

  foreach ($terminos as $key => $value)
  {
  $body = "string where the word empresarios should be replaced; but the word <a href='http://www.empresarios.com'>empresarios</a> should not be replaced inside <a> tags nor in the URL of their 'src' attribute.";
  $tagged = "<".$value.">".$key."</".$value.">";
  $result = str_replace($key, $tagged, $body);
  }
}

The function, in this example, should return "string where the word <h2>empresarios</h2> should be replaced; but the word <a href='http://www.empresarios.com'>empresarios</a> should not be replaced inside <a> tags nor in the URL of their 'src' attribute."

I’d like this replacement function to work all throught the string, but not inside tags nor in its attributes!

(I’d like to do what is mentioned in the following thread, it’s just that it’s not in javascript what I need, but in PHP: /questions/1666790/how-to-replace-text-not-within-a-specific-tag-in-javascript)

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

Use the DOM and only modify text nodes:

$s = "foo <a href='http://test.com'>foo</a> lorem bar ipsum foo. <a>bar</a> not a test";
echo htmlentities($s) . '<hr>';

$d = new DOMDocument;
$d->loadHTML($s);

$x = new DOMXPath($d);
$t = $x->evaluate("//text()");

$wrap = array(
    'foo' => 'h1',
    'bar' => 'h2'
);

$preg_find = '/\b(' . implode('|', array_keys($wrap)) . ')\b/';

foreach($t as $textNode) {
    if( $textNode->parentNode->tagName == "a" ) {
        continue;
    }

    $sections = preg_split( $preg_find, $textNode->nodeValue, null, PREG_SPLIT_DELIM_CAPTURE);

    $parentNode = $textNode->parentNode;

    foreach($sections as $section) {  
        if( !isset($wrap[$section]) ) {
            $parentNode->insertBefore( $d->createTextNode($section), $textNode );
            continue;
        }

        $tagName = $wrap[$section];
        $parentNode->insertBefore( $d->createElement( $tagName, $section ), $textNode );
    }

    $parentNode->removeChild( $textNode );
}

echo htmlentities($d->saveHTML());

Edited to replace DOMText with DOMText and DOMElement as necessary.

Solution 2

To the answer you pointed, in JS, it’s basically the same. You just have to specify it’s a string.

$regexp = "/(<pre>(?:[^<](?!\/pre))*<\/pre>)|(\:\-\))/gi";

Also note that you may be need another preg_replace function to replace the word ’empresarios’ in case it’s capitalized (Empresarios) or like weird stuff (EmPreSAriOS).

Also take care of your HTML. <h2> are block elements and may be interpretated this way:

string where the word empresarios
should be replaced;

And replaced

string where the word

empresarios

should be replaced;

Maybe what you’ll need to use is a <big> tag.

Solution 3

Definitely use a dom parser to isolate the qualifying text nodes before attempting to replace with a regex pattern that respects: word boundries, case-insensitivity, and unicode characters. If you are planning to specifically target words with unicode characters, then you will need to add mb_ to some of the string functions.

After leveraging the following insights, I tailored a solution for your scenario.

Code: (Demo)

$html = <<<HTML
foo <a href='http://test.com'>fóo</a> lórem
bár ipsum bar food foo bark. <a>bar</a> not á test
HTML;

$lookup = [
    'foo' => 'h3',
    'bar' => 'h2'
];

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

$regexNeedles = [];
foreach ($lookup as $word => $tagName) {
    $regexNeedles[] = preg_quote($word, '~');
}
$pattern = '~\b(' . implode('|', $regexNeedles) . ')\b~iu' ;

foreach($xpath->query('//*[not(self::a)]/text()') as $textNode) {
    $newNodes = [];
    $hasReplacement = false;
    foreach (preg_split($pattern, $textNode->nodeValue, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $fragment) {
        $fragmentLower = strtolower($fragment);
        if (isset($lookup[$fragmentLower])) {
            $hasReplacement = true;
            $a = $dom->createElement($lookup[$fragmentLower]);
            $a->nodeValue = $fragment;
            $newNodes[] = $a;
        } else {
            $newNodes[] = $dom->createTextNode($fragment);
        }
    }
    if ($hasReplacement) {
        $newFragment = $dom->createDocumentFragment();
        foreach ($newNodes as $newNode) {
            $newFragment->appendChild($newNode);
        }
        $textNode->parentNode->replaceChild($newFragment, $textNode);
    }
}
echo substr(trim(utf8_decode($dom->saveHTML($dom->documentElement))), 3, -4);

Output:

<h3>foo</h3> <a href="http://test.com">fóo</a> lórem
bár ipsum <h2>bar</h2> food <h3>foo</h3> bark. <a>bar</a> not á test

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply