What I Wish I Knew Before Writing My Own Scrapper With Puppeteer

#dev #puppeteer #javascript

There is no developer who has not written (or at least tried to write) at least one scrapping bot.
Here is my list of what I wish I had known back then.

Best Practices for Writing a Scrapper with Puppeteer

always start prompting with #ai to create a script skeleton for you (I usually start right inside #cursorai IDE.)

there are several ways how to click on an element, probably the easier one is the following:

await page.click('.selector')

or via callback:

await page.$eval('.selector', (el) => el.click())

or additionally via a querySelector:

await page.evaluate(() => {document.querySelector('.selector').click()}

page.evaluate(() => {...}) is executed inside the browser context, therefore window is available there, however with caveats - don't expect console.log(...) to show in your #nodejs script output
there is a way how to output console.log logs from page.evaluate(() => {...}) context, by adding this code at the beginning of your scraping script:

page.on('console', (msg) => {
    for (let i = 0; i < msg.args().length; ++i) console.log(`${i}: ${msg.args()[i]}`)
  })

An easy way how to debug the state of the page you are scrapping is with a screenshot, which can be easily done with the following script:

const createScreenshot = async (page) => {
  const path = `screenshot-${Date.now()}.png`
  console.log('--creating snapshot', path)
  await page.screenshot({
    type: 'png', // can also be "jpeg" or "webp" (recommended)
    path,
    fullPage: true, // will scroll down to capture everything if true
  })
}

but an even better way how to debug what is happening during scrapping is via the browser itself, turn off the headless mode to see what is happening:

	const browser = await puppeteer.launch({ headless: false, dumpio: false })

(don' forget to disable the command to close the browser, so if the unhandled error happens, you can check the state itself!)
dont try to "simulate user behaviour", do what is most effective for the puppeteer - like avoid using the go back button, instead open urls in new tabs, and close them afterwards
change page.setDefaultNavigationTimeout() to at least 2 minutes, because the default value is 30seconds[^1]
using built-in document querySelector rather than cheerio → (avoid this: page.$('body')) [^1]
don't use screenshots for infinite scrolls, at some point you will get to the point where the page is gonna be too big to create a screenshot

TLDR

To be honest, I do not think the Puppeteer is the right tool for this task 🤷‍♂️

[^1]: VIDEO - Industrial-scale Web Scraping with AI & Proxy Networks

This article was originally published on https://craftengineer.com/. It was written by a human and polished using grammar tools for clarity.

Follow me on X (Formally, Twitter) or Bluesky.