Regex to match words or phrases in string but NOT match if part of a URL or inside <a> </a> tags. (php)

I am aware that regex is not ideal for use with HTML strings and I have looked at the PHP Simple HTML DOM Parser but still believe this is the way to go. All the HTML tags will be generated by my forum software so they will be consistent and valid HTML.

What I am trying to do is make a plugin that will find a list of keywords (or phrases) in a string of HTML and replace them with a link I specify. For example if someone types:

I use Amazon for that.

it would replace it with:

I use <a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">Amazon</a> for that.

The problem is of course is that if “amazon” is in the URL it would also get replaced. I solved that issue with a callback function found on this site, slightly modified.

But now I still have an issue, it still replaces words between opening and closing tags.

<a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">My Amazon Link</a>

It will match the “Amazon” in “My Amazon Link”

What I really need is a regex to match say “amazon” anywhere except between <a href and </a>

Any ideas?

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

Using the DOM would certainly be preferable.

However, you might get away with this:

$result = preg_replace('%Amazon(?![^<]*</a>)%i', '<a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">Amazon</a>', $subject);

It matches Amazon only if

  1. it’s not followed by a closing </a> tag,
  2. it’s not itself part of a tag,
  3. there are no intervening tags, i. e. it will be thrown off if tags can be nested inside <a> tags.

It will therefore change this:

I use Amazon for that.
I use <a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">Amazon</a> for that.
<a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">My Amazon Link</a>
It will match the "Amazon" in "My Amazon Link"

into this:

I use <a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">Amazon</a> for that.
I use <a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">Amazon</a> for that.
<a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">My Amazon Link</a>
It will match the "<a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">Amazon</a>" in "My <a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">Amazon</a> Link"

Solution 2

Don’t do this. You cannot reliably do this with Regex, no matter how consistent your HTML is.

Something like this should work, however:

<?php
$dom = new DOMDocument;
$dom->load('test.xml');
$x = new DOMXPath($dom);

$nodes = $x->query("//text()[contains(., 'Amazon')][not(ancestor::a)]");

foreach ($nodes as $node) {
    while (false !== strpos($node->nodeValue, 'Amazon')) {
        $word = $node->splitText(strpos($node->nodeValue, 'Amazon'));
        $after = $word->splitText(6);

        $link = $dom->createElement('a');
        $link->setAttribute('href', 'http://www.amazon.com');

        $word->parentNode->replaceChild($link, $word);
        $link->appendChild($word);

        $node = $after;
    }
}

$html = $dom->saveHTML();
echo $html;

It’s verbose, but it will actually work.

Solution 3

Try this here

Amazon(?![^<]*</a>)

This will search for Amazon and the negative lookahead ensures that there is no closing tag behind. And I search there only for not < so that I will not read a opening tag accidentally.

http://regexr.com

Solution 4

Unfortunately I think the logic you need is still more complex than text pattern matching :-/

I know it’s not the answer you want to hear, but you’ll probably get better results with a DOM model.

Here’s a discussion of this topic elsewhere: http://coderzone.org/forum/index.php?topic=84.0

Is it possible to just run the filter once, so you don’t end up with dupes? Or could the original corpus also include links?

Solution 5

Joe, resurrecting this question because it had a simple solution that wasn’t mentioned. (Found your question while doing some research for a general question about how to exclude patterns in regex.)

With all the disclaimers about using regex to parse html, here is a simple way to do it.

Here’s our simple regex:

<a.*?</a>(*SKIP)(*F)|amazon

The left side of the alternation matches complete <a... </a> tags, then deliberately fails. The right side matches amazon, and we know this is the right amazon because it was not matched by the expression on the left.

This program shows how to use the regex (see the results at the bottom of the online demo):

<?php
$target = "word1 <a stuff amazon> </a> word2 amazon";
$regex = "~(?i)<a.*?</a>(*SKIP)(*F)|amazon~";
$repl= '<a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">Amazon</a>';
$new=preg_replace($regex,$repl,$target);
echo htmlentities($new);

Reference

How to match (or replace) a pattern except in situations s1, s2, s3…

Solution 6

Use this code:

$p = '~((<a\s)(?(2)[^>]*?>))?(amazon)~smi';

$str = '<a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">Amazon</a>';

$s = preg_replace($p, "$1My $3 Link", $str);
var_dump($s);

OUTPUT

String(50) "<a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">My Amazon Link</a>"

Solution 7

Improvisation. It should link only if it is a whole word “Amazon” and not words like AmazonWorld.

$result = preg_replace('%\bAmazon(?![^<]*</a>)\b%i', '<a href="http://www.amazon.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">Amazon</a>', $subject);

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply