Get all links with XPath in Puppeteer (pausing or not working)?

I am required to use XPaths to select all links on a page, for then my Puppeteer app to click into and perform some actions. I am finding that the method (code below) is getting stuck sometimes and my crawler will be paused. Is there a better/different way of getting all links from an XPath? Or is there something in my code that is incorrect and could be pausing my app’s progress?

try {
  links = await this.getLinksFromXPathSelector(state);
} catch (e) {
  console.log("error getting links");
  return {...state, error: e};
}

Which calls:

async getLinksFromXPathSelector(state) {
 const newPage = state.page
 // console.log('links selector');
 const links = await newPage.evaluate((mySelector) => {
   let results = [];
   let query = document.evaluate(mySelector,
     document,
     null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
   for (let i=0, length=query.snapshotLength; i<length; ++i) {
     results.push(query.snapshotItem(i).href);
   }
   return results;
 }, state.linksSelector);
  return links;
}

The XPath is in state.linksSelector.

Here is Solutions:

We have many solutions to this problem, But we recommend you to use the first solution because it is tested & true solution that will 100% work for you.

Solution 1

You can use page.$x() to evaluate an XPath expression and obtain an ElementHandle array. It may be appropriate to use page.waitForXPath() beforehand to ensure that the elements specified by XPath string are added to the DOM.

Then you can pass the ElementHandle array elements to the page context via page.evaluate() and return an array containing the href attribute values for each element.

const xpath_expression = '//a[@href]';
await page.waitForXPath(xpath_expression);
const links = await page.$x(xpath_expression);
const link_urls = await page.evaluate((...links) => {
  return links.map(e => e.href);
}, ...links);

console.log(link_urls);

Note: Use and implement solution 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply