Proxies for Puppeteer while scraping to prevent getting blocked!

Does your IP is blacklisted when crawling or scraping with Puppeteer? Are you tired of getting blocked by recaptcha? No worry any more!

This post we will let you know, How to prevent being detected as bot on Puppeteer, and The best IPs solution to avoid getting blacklisted & blocked while scraping with Puppeteer!

What is Puppeteer?

Puppeteer is a tool for web developers built by Google. The tool is a Node library with a high-level API to control the headless and non-headless browsers, chrome and chromium.

A web browser without a user-interface is called a headless browser and it allows you to automate control of a web page. Automation on top of a real browser means you no longer need to run javascript, render pages or follow redirects.

This method allows for successful and accurate access to target websites that implement blocking techniques through monitoring the cookies and headers of those accessing their site.


Why use a headless browser like Puppeteer for testing or scraping?

The main benefit for using a headless browser is that you can automate your testing and scraping operations. A headless browser like Puppeteer lacks Flash Player and other types of software which provides information about the user to the target websites. By not having these parameters to worry about, and getting rid of them altogether, you can easily increase your success rate.

Puppeteer is an easy to use automation tool in comparison to other headless browsers that require more technical expertise. Created for Chrome browser, Puppeteer is used for testing and automating desktop applications by providing the ability to simulate real-user behavior. It allows for testing the user-interface of sites to ensure they behave as developers expect.

With Puppeteer there is no need to open the browser you can easily generate a screenshot of the final destination with little to no effort.  Puppeteer helps you to use incognito mode providing an entirely neutral environment with no cookies, no cache, and no device fingerprints. This means every time you open a  browser its as if you opened an entirely new machine!


You may be interest in,


Why Puppeteer Proxies are needed?

Working with an automation tool like Puppeteer will allow you to code every aspect of an environment but the one thing that cannot be coded is your IP address.

There is no doubt that websites can easily detect web scraping activities based on the IP address, even normal browsing, sometimes you will be asked to proceed CAPTCHA verification for you’re detected as bot by google, isn’t it?

Proxies are needed to test your application in a different country or city. They are also a must if you require scraping multiple pages. Not only will a proxy network allow you to simulate a real-user in the location you require but it will keep you anonymous and provide the real-time, accurate data you need.

With the use of a Puppeteer proxy, you can run multiple browsers simultaneously, each from a unique IP and test performance as well as the speed of the site/application.


Why do I recommend Luminati with Puppeteer?

Luminati offers 4 separate networks including,

Luminati proxy network

All with country and city targeting, With its many product offerings and more than 11 IP types, it has everything you need to successfully scrape and test the sites you require.

Luminati also offers a free, open-source proxy manager which allows you to control your proxies and their parameters with the ease of a simple drop-down menu.

Within the Luminati Proxy Manager, you can choose your preferred user-agent or employ random user agents on each request. The software also supports custom user agents and headers.

By automating your browser with the use of proxies you can quickly and easily test your applications, generate screenshots and ensure the user experience you desire!

Puppeteer with Luminati

How to connect Puppeteer with Luminati’s Super Proxies

  • Begin by going to your Luminati Dashboard and clicking ‘create a zone’.
  • Choose ‘Network type’ and click save.
  • Within Puppeteer fill in the ‘Proxy IP:Port’ in the ‘proxy-server’ value, for example zproxy.lum-superproxy.io:22225.
  • Under ‘page.authenticate’ input your Luminati account ID and proxy zone name in the ‘username’ value, for example: lum-customer-CUSTOMER-zone-YOURZONE and your zone password found in the zone settings.

For example:

const puppeteer = require(‘puppeteer’);

(async () => {

const browser = await puppeteer.launch({

headless: false,

args: [‘–proxy-server=zproxy.lum-superproxy.io:22225’]

});

const page = await browser.newPage();

await page.authenticate({

username: ‘lum-customer-USERNAME-zone-YOURZONE’,

password: ‘PASSWORD’

});

await page.goto(‘http://lumtest.com/myip.json’);

await page.screenshot({path: ‘example.png’});

await browser.close();

})();

How to connect Puppeteer with Luminati’s Proxy Manager

  • Create a zone with the network, IP type and number of IPs you wish to use.
  • Install the Luminati Proxy Manager.
  • Click ‘add new proxy’ and choose the zone and settings you require, click ‘save’.
  • In Puppeteer under the ‘proxy-server’ input your local IP and proxy manager port (i.e. 127.0.0.1:24000)
    • The local host IP is 127.0.0.1
    • The port created in the Luminati Proxy Manager is 24XXX, for example 24000
  • Leave the username and password values empty, as the Luminati Proxy Manager has already been authenticated with the Super Proxy.

For example:

const puppeteer = require(‘puppeteer’);

(async () => {

const browser = await puppeteer.launch({

headless: false,

args: [‘–proxy-server=127.0.0.1:24000’]

});

const page = await browser.newPage();

await page.authenticate();

await page.goto(‘http://lumtest.com/myip.json’);

await page.screenshot({path: ‘example.png’});

await browser.close();

})();


Using the headless browser Puppeteer, in tandem with the Luminati proxy service will allow you to automate your operations with ease.

By combining the two you can manipulate every request sent to see how the site/application will respond. This allows for the most accurate web data extraction and a true look into the user experience of applications that require testing.

Leave a Comment