Finding substring whilst ignoring HTML tags

I need to match parts of string whilst ignoring HTML tags. Which means if user wants to look for string “foo and foo1” in source code.

Two strings, <u>foo</u> and foo1

He’d not get the match, because of the tags.

I’ve tried regex, but since the tags can and don’t have to be there, it seems rather too complicated.

It’s not server-side script. It’d be an application run from console.

To be more specific: it is for syntax highlight. So user wants “foo and foo1” to be italic, but part of it is already underline and wouldn’t match anyway. That’s why I can’t strip the string.

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

Use the PHP function strip_tags to remove the HTML tags from the text. Then do your search.

http://php.net/manual/en/function.strip-tags.php

Solution 2

Use strip_tags as you have been advised, it is really the best way. However, if you want to have fun or experiment and benchmark your regex engine 🙂 you can insert (?:<\/?[^>]+>)? after each symbol of the query passed, and you will have a match, and in the very beginning of the query (or the opening tag won’t be captured).

Here is an example for a “foo and foo1”:

(?:<\/?[^>]+>)?f(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)? (?:<\/?[^>]+>)?a(?:<\/?[^>]+>)?n(?:<\/?[^>]+>)?d(?:<\/?[^>]+>)? (?:<\/?[^>]+>)?f(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)?o(?:<\/?[^>]+>)?1(?:<\/?[^>]+>)?

This will match <u>foo</u> and foo1.

https://regex101.com/r/aF8fJ8/4

Solution 3

This regex will ignore the <> and slashes in html tags, only extracting words.

(?!</?[^>]+>)([a-zA-Z]+)

just replace the [a-zA-Z]+ with what you want to match.

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply