What I Wish I Knew Before Writing My Own Scrapper With Puppeteer
There is no developer who has not written (or at least tried to write) at least one scrapping bot.
Here is my list of what I wish I had known back then.
Best Practices for Writing Scrapper with Puppeteer
- always start prompting with #ai to create a script skeleton for you (I usually start right inside #cursorai IDE.)
- there are several ways how to click on an element, probably the easier one is the following:
or via callback:await page.click('.selector')
or additionally via aawait page.$eval('.selector', (el) => el.click())
querySelector
:await page.evaluate(() => {document.querySelector('.selector').click()}
page.evaluate(() => {...})
is executed inside the browser context, thereforewindow
is available there, however with caveats - don't expectconsole.log(...)
to show in your #nodejs script output- there is a way how to output
console.log
logs frompage.evaluate(() => {...})
context, by adding this code at the beginning of your scrapping script:
page.on('console', (msg) => {
for (let i = 0; i < msg.args().length; ++i) console.log(`${i}: ${msg.args()[i]}`)
})
- An easy way how to debug the state of the page you are scrapping is with a screenshot, which can be easily done with the following script:
const createScreenshot = async (page) => {
const path = `screenshot-${Date.now()}.png`
console.log('--creating snapshot', path)
await page.screenshot({
type: 'png', // can also be "jpeg" or "webp" (recommended)
path,
fullPage: true, // will scroll down to capture everything if true
})
}
- but an even better way how to debug what is happening during scrapping is via the browser itself, turn off the headless mode to see what is happening:
const browser = await puppeteer.launch({ headless: false, dumpio: false })
- → (don' forget to disable the command to close the browser, so if the unhandled error happens, you can check the state itself!)
- dont try to "simulate user behaviour", do what is most effective for the puppeteer - like avoid using the go back button, instead open urls in new tabs, and close them afterwards
- change
page.setDefaultNavigationTimeout()
to at least 2 minutes, because the default value is 30seconds[^1] - using built-in document querySelector rather than
cheerio
→ (avoid this:page.$('body')
) [^1] - don't use screenshots for infinite scrolls, at some point you will get to the point where the page is gonna be too big to create a screenshot
TLDR
To be honest, I do not think the Puppeteer is the right tool for this task 🤷♂️
This article was originally published on https://craftengineer.com/. It was written by a human and polished using grammar tools for clarity.
Follow me on X (Formally, Twitter) or Bluesky.