Split a text into sentences

How can I split a text into an array of sentences?

Example text:

Fry me a Beaver. Fry me a Beaver! Fry me a Beaver?
Fry me Beaver no. 4?! Fry me many Beavers… End

Should output:

0 => Fry me a Beaver.
1 => Fry me a Beaver!
2 => Fry me a Beaver?
3 => Fry me Beaver no. 4?!
4 => Fry me many Beavers...
5 => End

I tried some solutions that I’ve found on SO through search, but they all fail, especially at the 4th sentence.

/(?<=[!?.])./

/\.|\?|!/

/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])/

/(?<=[.!?]|[.!?][\'"])\s+/    // <- closest one

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

Since you want to “split” sentences why are you trying to match them ?

For this case let’s use preg_split().

Code:

$str = 'Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End';
$sentences = preg_split('/(?<=[.?!])\s+(?=[a-z])/i', $str);
print_r($sentences);

Output:

Array
(
    [0] => Fry me a Beaver.
    [1] => Fry me a Beaver!
    [2] => Fry me a Beaver?
    [3] => Fry me Beaver no. 4?!
    [4] => Fry me many Beavers...
    [5] => End
)

Explanation:

Well to put it simply we are spliting by grouped space(s) \s+ and doing two things:

  1. (?<=[.?!]) Positive look behind assertion, basically we search if there is a point or question mark or exclamation mark behind the space.

  2. (?=[a-z]) Positive look ahead assertion, searching if there is a letter after the space, this is kind of a workaround for the no. 4 problem.

Solution 2

I recommend searching for your delimiting punctuation without a lookbehind, then releaseing those matched characters (with \K), then matching the space, then looking ahead for an uppercase letter representing the start of the next sentence.

Code: (Demo)

$str = 'Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End';

var_export(
    preg_split('~[.?!]+\K\s+(?=[A-Z])~', $str, 0, PREG_SPLIT_NO_EMPTY)
);

Output:

array (
  0 => 'Fry me a Beaver.',
  1 => 'Fry me a Beaver!',
  2 => 'Fry me a Beaver?',
  3 => 'Fry me Beaver no. 4?!',
  4 => 'Fry me many Beavers...',
  5 => 'End',
)

Though not necessary for the sample string, PREG_SPLIT_NO_EMPTY will prevent creating an empty element at the end of the array if the string ends with a punctuation.

Using \K in my answer requires less backtracking. This allows the regex engine to "step" through the string with greater efficiency. In Hamza’s answer, the regex engine starts matching every time there is a space, then after the space is matched, it needs to read backward to check for the punctuation, then if that qualifies, it then needs to look ahead for a letter.

In my approach, the regex engine only begins considering matches when it encounters one of the listed punctuation symbols, and it never looks back. There are many spaces to match, but much fewer qualifying symbols. For these reasons, on the sample input string, my pattern splits the string in 40 steps and Hamza’s pattern splits the string in 74 steps.

This efficiency is not really worth bragging about for relatively small strings, but if you are parsing large texts, then efficiency and minimizing backtracking becomes more important.

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply