Finding text strings in JavaScript

I have a large valid JavaScript file (utf-8), from which I need to extract all text strings automatically.

For simplicity, the file doesn’t contain any comment blocks in it, only valid ES6 JavaScript code.

Once I find an occurrence of ' or " or `, I’m supposed to scan for the end of the text block, is where I got stuck, given all the possible variations, like "'", '"', “\'”, ‘\”‘, '", `\“, etc.

Is there a known and/or reusable algorithm for detecting the end of a valid ES6 JavaScript text block?

UPDATE-1: My JavaScript file isn’t just large, I also have to process it as a stream, in chunks, so Regex is absolutely not usable. I didn’t want to complicate my question, mentioning joint chunks of code, I will figure that out myself, If I have an algorithm that can work for a single piece of code that’s in memory.

UPDATE-2: I got this working initially, thanks to the many advises given here, but then I got stuck again, because of the Regular Expressions.

Examples of Regular Expressions that break any of the text detection techniques suggested so far:

/'/
/"/
/\`/

Having studied the matter closer, by reading this: How does JavaScript detect regular expressions?, I’m afraid that detecting regular expressions in JavaScript is a whole new ball game, worth a separate question, or else it gets too complicated. But I appreciate very much if somebody can point me in the right direction with this issue…

UPDATE-3: After much research I found with regret that I cannot come up with an algorithm that would work in my case, because presence of Regular Expressions makes the task incredibly more complicated than was initially thought. According to the following: When parsing Javascript, what determines the meaning of a slash?, determining the beginning and end of regular expressions in JavaScript is one of the most complex and convoluted tasks. And without it we cannot figure out when symbols ', ‘”‘ and ` are opening a text block or whether they are inside a regular expression.

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

The only way to parse JavaScript is with a JavaScript parser. Even if you were able to use regular expressions, at the end of the day they are not powerful enough to do what you are trying to do here.

You could either use one of several existing parsers, that are very easy to use, or you could write your own, simplified to focus on the string extraction problem. I hardly imagine you want to write your own parser, even a simplified one. You will spend much more time writing it and maintaining it than you might think.

For instance, an existing parser will handle something like the following without breaking a sweat.

`foo${"bar"+`baz`}`

The obvious candidates for parsers to use are esprima and babel.

By the way, what are you planning to do with these strings once you extract them?

Solution 2

If you only need an approximate answer, or if you want to get the string literals exactly as they appear in the source code, then a regular expression can do the job.

Given the string literal "\n", do you expect a single-character string containing a newline or the two characters backslash and n?

  • In the former case you need to interpret escape sequences exactly like a JavaScript interpreter does. What you need is a lexer for JavaScript, and many people have already programmed this piece of code.
  • In the latter case the regular expression has to recognize escape sequences like \x40 and \u2026, so even in that case you should copy the code from an existing JavaScript lexer.

See https://github.com/douglascrockford/JSLint/blob/master/jslint.js, function tokenize.

Solution 3

Try code below:

 txt = "var z,b \n;z=10;\n b='321`1123`321321';\n c='321`321`312`3123`';"
 function fetchStrings(txt, breaker){
      var result = [];
      for (var i=0; i < txt.length; i++){
        // Define possible string starts characters
        if ((txt[i] == "'")||(txt[i] == "`")){
          // Get our text string;
          textString = txt.slice(i+1, i + 1 + txt.slice(i+1).indexOf(txt[i]));
          result.push(textString)
          // Jump to end of fetched string;
          i = i + textString.length + 1;
        }
      }
      return result;
    };

console.log(fetchStrings(txt));

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply