NodeJS is a server-side framework/library for JavaScript. Since its release, it has made JavaScript development easier and quite fantastic. NodeJS is coupled with NPM, and this opens endless possibilities with plug-and-play libraries and modules. NPM hosts a wide variety of libraries and modules that can be used in your NodeJS applications. These libraries are stable and add more features to the framework.
With NodeJS, you can create APIs, or you can hit third-party APIs too. Web scraping requires you to hit third-party APIs and webpages, parse them, and find out your desired information from the returned responses. If you are new to web scraping but not new to the JavaScript ecosystem, you should be relieved that NodeJS has web scraping capabilities.
As we advance, we will find out how NodeJS and some NPM libraries can help you scrape almost any type of webpage. You can easily Hire Remote Node JS Developer to do this for you. We will also go through a hands-on section where we will scrape a Wikipedia page to get details about US presidents. But first, let’s understand why web scraping is required.
Why is web scraping required?
Web scraping is a way to get data from websites in an automated way. Using web scraping, you can scrape details from almost any website regardless of its API knowledge and step by step guide of API with Nodejs. While web scraping is legal somewhere and illegal somewhere, it is best if you take consent from the site owners before scraping any content or if you do it in a way that does not damage the site’s primary objectives.
Web scraping is used on many different fronts like price scraping, market research, gathering financial data, verifying and republishing news, automation of data gathering, etc.
The primary objective of web scraping is to get as much data as possible. With web scraping, you can scrape data unlimitedly without having any API call restrictions.
Having known why web scraping is required, let’s know what you need to start web scraping using NodeJS.
Requirements For Web Scraping With NodeJS
-
NodeJS
One of the primary requirements for web scraping with NodeJS is NodeJS itself. You need to install NodeJS before anything else. To install this, you can head over to the nodejs.org website and download the LTS version of the framework. While you are doing so, you should also install the NPM, i.e., Node Package Manager. It is the most critical thing for NodeJS applications, and it will allow you to install and manage other libraries to use in your project.
-
Request-promise
Once you have installed NodeJS and NPM, it is time to install the request-promise module. As the name suggests, request-promise is an NPM library that allows you to send requests with support for the promised feature. This library is the base for your web scraping projects, as, without this, you cannot send requests to any webpage and get its response.
-
CheerioJS
After installing the request-promise module, you need to set up CheerioJS. It is another NPM library that will be used extensively on all web scraping projects. Cheerio will help us parse all the webpage responses we receive after sending a request through the request-promise module. Once the webpage is parsed, we can traverse through the new structure and find out things that are relevant for us pretty quickly.
-
Puppeteer
Puppeteer is the library that will help us automate many things. With this library, we get a controller for our web browser and can control it with scripts. We may not use it in the walkthrough example as we will just be doing simple scraping. But you’ll require Puppeteer when you are working on larger web scraping projects.
By now, you know the tools required for web scraping with NodeJS, so why not use them and create a sample web scraping project with NodeJS? Let’s move to our hands-on web scraping section.
“Rotating proxies” are becoming increasingly popular for web scrapers, who rely on them in order to scrape the web anonymously and change their IP address periodically to avoid getting blocked. These proxies allow you to scrape web data without getting detected by website owners who have set up IP blocks to prevent this.
- Rotating residential proxies choices: Soax, Brightdata, and Smartproxy
- Rotating datacenter proxies choice: Rayobyte, webshare and Oxylabs
- Scaping API choice: Apify, Scapinger API
Hands-on Web Scraping With NodeJS
In this example, we will be scraping a Wikipedia page containing information about US presidents. It is quite a simple scraping script, and we will use request-promise, cheerio, and Puppeteer to get the data.
The primary step in web scraping is creating a new JS filed name firstScraper.js, writing the code below in the file, and saving it.
const rp = require(‘request-promise’);
const url = ‘https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States’;
rp(url)
.then(function(html){
//success!
console.log(html);
})
.catch(function(err){
//handle error
});
Here we are importing the request-promise module, defining our URL to be scraped, and writing a function on the request-promise module that hits the URL, waits for the response, and prints the response on the console.
To find the correct details, we need to inspect and see the data once in the web browser. To do that, open your web browser, go to the URL, and open developer tools. From there, just inspect the page elements, and note down the HTML structure, class names, and IDs which store your desired information. This was simple, and I hope you’ve not faced any errors.
Once you have decided which details you need and understood the class names, IDs, and HTML structure, you can use cheerio to parse the page and find relevant details. As we are looking to parse names and URLs of US presidents here, we can find them inside an HTML A tag which is wrapped inside a big tag.
Copy the below code in your firstScraper.js file, and remove all the previous code there.
const rp = require(‘request-promise’);
const $ = require(‘cheerio’);
const url = ‘https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States’;
rp(url)
.then(function(html){
//success!
console.log($(‘big > a’, html).length);
console.log($(‘big > a’, html));
})
.catch(function(err){
//handle error
});
In the above code, we are parsing the HTML page using cheerio and consoling the elements that match our expression. The $(‘big > a’, html) code will find out all big tags which have an A tag inside them on the response webpage and console it as an array of JSON objects.
The returned array should have a length of 45 which is the number of US presidents till now. There can be hidden big tags on the page, and to not include them in the output, we need to put a condition so that we only get the relevant data. So, copy the below code and replace it in the firstScraper.js file.
const rp = require(‘request-promise’);
const $ = require(‘cheerio’);
const url = ‘https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States’;
rp(url)
.then(function(html){
//success!
const wikiUrls = [];
for (let i = 0; i < 45; i++) {
wikiUrls.push($(‘big > a’, html)[i].attribs.href);
}
console.log(wikiUrls);
})
.catch(function(err){
//handle error
});
The above code will ensure that we only get the 45 US president’s data, and all that will be stored inside an array. The array will store URLs for each president’s Wikipedia webpage.
You have successfully scraped, parsed, and stored a webpage with NodeJS. Now you can use these stored URLs with Puppeteer and automatically scrape more details on each president, like their birthdate, term at the office, etc.
Conclusion
Coming to an end, now you know the basics of web scraping with NodeJS, so leverage this knowledge and scrape websites legally to extract and gather as much relevant data as possible. Keep trying new things, and you’ll soon master web scraping with NodeJS.