How to tokenize markdown using Node.js?

Im building an iOS app that have a view that is going to have its source from markdown.

My idea is to be able to parse markdown stored in MongoDB into a JSON-object that looks something like:

{
    "h1": "This is the heading",
    "p" : "Heres the first paragraph",
    "link": {
        "text": "Text for link",
        "url": "http://exampledomain.com",
    }
}

On the server I am running Node.js, and was looking at the module marked which seem to be the most popular one out there. It gives me access to the Lexer, which is tokenizing the markdown to some custom object. But when I look at the object, it doesnt tokenize the link. If I go ahead and parse the markdown to HTML, the link is detected and the HTML looks correct.

After looking into some more modules, and failing I thought that maybe I could do this on the client instead and found MMMarkdown which seemed promising, but then again .. that worked fine when parsing directly to HTML, but when stepping in between and just parsing the markdown to the so called MMDocument, it did not consist of any MMElement of type Link.

So, is there anything fundamental about markdown parsing that I am missing? Is the lexing of the inline links supposed to be done in a second round, or something? I cant get my head around it.

If nothing else works, I might just go with using a UIWebView filled withed the HTML from the parsed markdown, but then we have to design the whole thing again, but with CSS, and we are running out of time so we cant reallt afford the double work.

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

Did you look at https://github.com/evilstreak/markdown-js ?

It seems to give you access to the syntax tree.

For example:

var md = require( "markdown" ).markdown,
text = "Header\n---------------\n\n" +
       "This is a paragraph\n\n" +
"This is [an example](http://example.com/ \"Title\") inline link.";

// parse the markdown into a tree and grab the link references
var tree = md.parse( text );

console.log(JSON.stringify(tree));

produces

[
    "markdown",
    [
        "header",
        {
            "level": 2
        },
        "Header"
    ],
    [
        "para",
        "This is a paragraph"
    ],
    [
        "para",
        "This is ",
        [
            "link",
            {
                "href": "http://example.com/",
                "title": "Title"
            },
            "an example"
        ],
        " inline link."
    ]
]

Solution 2

Although this question is already quite a few years old, I wanted to give a little update.

I found the combination of unified and remark-parse a good fit for my situation.
After installing those packages (with npm, yarn, pnpm or your most favourite js package manager) I wrote a little test script as follows:

const unified = require('unified');
const markdown = require('remark-parse');

const tokens = unified()
  .use(markdown)
  .parse('# Hello world');

console.log(tokens);

This of course generates a token tree and needs further processing.

Maybe this is useful for someone else who stumbled upon this question.

Solution 3

Here’s the code that I ended up using instead.

var nodes = markdownText.split('\r\n');
var content = [];

nodes.forEach(function(node) {

    // Heading 2
    if (node.indexOf('##') == 0) {
        content.push({
            h2: node.replace('##','')
        })
    }

    // Heading 1
    else if (node.indexOf('#') == 0) {
        content.push({
            h1: node.replace('#','')
        })
    }

    // Link (Text + URL)
    else if (node.indexOf('[') == 0) {
        var matches = node.match(/\[(.*)\]\((.*)\)/);
        content.push({
            link: {
                text: matches[1],
                url: matches[2]
            }
        })
    }

    // Paragraph
    else if (node.length > 0) {
        content.push({
            p: node
        })
    }

});

I know this matching is very non-forgiving, but in our case it works fine.

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply