Build a Web Crawler using Node.js
This blog post is about building a quick web crawler using Node.js and is aimed at people new to Node. This post was written by Vijay Kumar.
Web Crawlers
In the programming world, a web crawler is a program that crawls the World Wide Web and creates an index of data.
Scraping is helpful when there no APIs are provided for the web page.
In this post, we'll see how to use Node.js and its libraries to build a web crawler.
Node.Js is a server-side, cross-platform Javascript runtime environment that uses V8 engine to execute outside the browser.
You can download and install Node.js from this link: https://nodejs.org/en/download/
Node.js libraries:
request-promise - The simplified HTTP request client with Promise support. The syntax is similar to fetch which is supported by major browsers.
cheerio - Parses HTML markup and provides an API for traversing/manipulating the resulting data structure. It has a jQuery-like syntax.
cli-table - A utility to render tables on the command line from your Node.js scripts.
Kicking off the project:
Quotes to scrape - It’s the web site we are going to scrape for practice. We are going to get the actual quote, the author, and the tags.
To start with the project, create a folder in your workspace or your work folder. Let's call the folder as scraper
.
In the terminal, type yarn init
or npm init
and enter the project related details to fill out the contents and create a package.json for our project.
package.json contains the metadata for our project which has the name, version, repository URL, and dependencies list which would include request-promise, cheerio, and cli-table
in our case.
To install the dependencies, type yarn add
or npm i -s cheerio request-promise request cli-table
- this installs the libraries and save it to package.json file under dependencies.
Create a file scraper.js
in the project folder and open it on your favorite code editor.
const rp = require('request-promise');
const cheerio = require('cheerio');
const Table = require('cli-table');
const url = 'http://quotes.toscrape.com/';
const options = {
uri: url,
transform: (html) => cheerio.load(html)
}
rp(options)
.then((data) => console.log(data))
.catch((error) => console.log(error));
In the above code snippet, we are just importing the libraries to be used later. In the options variable, we are passing in URL to be requested and directly transforming the result as an HTML by passing it on to cheerio
.
We use the request-promise
library to make the request, and the result is printed out in the terminal, assuming there was no issue.
To run the script, on your terminal type node scraper.js
and hit enter. Voila
{ [Function: initialize]
fn:
initialize {
constructor: [Circular],
_originalRoot:
{ type: 'root',
name: 'root',
namespace: 'http://www.w3.org/1999/xhtml',
attribs: {},
'x-attribsNamespace': {},
'x-attribsPrefix': {},
children: [Array],
parent: null,
prev: null,
next: null } },
load: [Function],
html: [Function],
xml: [Function],
text: [Function],
parseHTML: [Function],
root: [Function],
contains: [Function],
merge: [Function],
_root:
{ type: 'root',
name: 'root',
namespace: 'http://www.w3.org/1999/xhtml',
attribs: {},
'x-attribsNamespace': {},
'x-attribsPrefix': {},
children: [ [Object], [Object] ],
parent: null,
prev: null,
next: null },
_options:
{ withDomLvl1: true,
normalizeWhitespace: false,
xml: false,
decodeEntities: true } }
Before proceeding, let's use table-cli
to create a table for pushing the values.
const table = new Table({
head: ['Quote', 'Author', 'Tags'],
colWidths: [100, 30, 30]
});
Next step is to traverse the HTML generated by cheerio
.
To get the quote, tags, and author, we need to traverse the DOM tree in the node. We can select the first quote element by identifying the class/id tag to traverse the DOM. We can also do this manually using Chrome dev tools (assuming you are using Google Chrome)
rp(options)
.then((data) => {
const quote = data('.quote');
quote.each(function(index, element) {
const quoteText = cheerio(element).children('.text').text()
const quoteTags = cheerio(element).children('.tags').text();
const quoteAuthor = cheerio(element).children('span').children('.author').text();
table.push([quoteText, quoteAuthor, quoteTags]);
});
console.log(table.toString());
})
.catch((error) => console.log(error));
In the snippet given above, we can see data returned by cheerio
. We are passing the class as .quote
, which we identified through Chrome dev tools.
We traverse this class to get the children of the quote object that contains the quote, the author, and the tags related to the quote.
Using cheerio’s each loop we are iterating on the data by passing in a callback function, index, and element as the argument.
In the function block, we pass the element to load as a DOM, so that it's easy to traverse the element. We use jQuery like selectors to get the quote, tags, and author, assigned to relevant variables. The text values are obtained using the text()
function.
Finally, we push the values to the table we created earlier and use console.log()
to print the data and verify the scraped information.
The complete code for the scraper:
const rp = require('request-promise');
const cheerio = require('cheerio');
const Table = require('cli-table');
const url = 'http://quotes.toscrape.com/';
const table = new Table({
head: ['Quote', 'Author', 'Tags'],
colWidths: [100, 30, 30]
});
const options = {
uri: url,
transform: (html) => cheerio.load(html)
}
rp(options)
.then((data) => {
const quote = data('.quote');
quote.each(function(index, element) {
const quoteText = cheerio(element).children('.text').text()
const quoteTags = cheerio(element).children('.tags').text();
const quoteAuthor = cheerio(element).children('span').children('.author').text();
table.push([quoteText, quoteAuthor, quoteTags]);
});
console.log(table.toString());
})
.catch((error) => console.log(error));
Output: on your terminal run node scraper.js
That’s it, folks. Hope this helped you get a start on scraping data. Keep watching the space for more DIY starter tutorials.