regex top level contents from a string

Please help, my regular expression skills fail me. I have the following string:

username|email_address|phone_numbers[number]profile[title|addresses[id]]

I want to be able to extract any data between square brackets, but not where that data is a subset of an already extracted set. So any nestings should be left as part of the parent’s extracted string.

In the above example I’d have extracted two parts:

"number"
"title|addresses[id]"

Note how the [id] isn’t extracted as it’s part of a lower level dataset.

I’ve been attempting to do this with preg_match, but think I may have to resort to iterating over each character in the string.

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

Here’s a regex solution:

preg_match_all(
    '/(?<=\[)     # Assert that the previous characters is a [
      (?:         # Match either...
       [^[\]]*    # any number of characters except brackets
      |           # or
       \[         # an opening bracket
       (?R)       # containing a match of this very regex
       \]         # followed by a closing bracket
      )*          # Repeat as needed
      (?=\])      # Assert the next character is a ]/x', 
    $subject, $result, PREG_PATTERN_ORDER);
$result = $result[0];

Solution 2

A sad truth is that regular expression cannot handle bracket matching, because regular expression don’t have memory. (It’s equivalent to DFA)

To achieve what you want you’ll have to write a small parser by yourself (I think), using stack can solve the problem 😉

The basic idea using stack to solve the problem is that.. every time you see an [ you’ll push stack, and every time you see an ] you’ll pop stack and retreat the string you got since the [ you saw last time.

Hope this helps 😉

Solution 3

I wrote a small parser to achieve the desired results:

Code:

$data = 'username|email_address|phone_numbers[number]profile[title|addresses[id]wut]aaa[another test] aaand another one [which is [more] c[omplexer]t[h[an]] the others]';
print_r(parse($data));

function parse($string, $s1='[', $s2=']'){
    $c1 = $c2 = 0;$s = 1;
    $l = strlen($string);
    $array = array(array(), array());
    for($i=0;$i < $l;$i++){
        if($string[$i] == $s1){
            $c1++;
            $array[0][$c1] = $i;
        }elseif($string[$i] == $s2){
            $c2++;
            $array[1][$c2] = $i;
            if($c1 == $c2){
                $results[] = substr($string, $array[0][$s], $array[1][$c2] - $array[0][$s] + 1);
                $s=$c1+1;
            }
        }
    }
    return $results;
}

Output:

Array
(
    [0] => [number]
    [1] => [title|addresses[id]wut]
    [2] => [another test]
    [3] => [which is [more] c[omplexer]t[h[an]] the others]
)

Online demo

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply