Data Sci Adventures - part 4: Web scraping in JavaScript and Python

Published on 2021-07-15 17:56

Available in:

Recently I've been working on two small projects which required web scraping. Python project is to satisfy my curiosity and JavaScript project is a homework for a potential job.

For those interested in my job looking, this position won't work. My gut tells me to drop out from the hiring process.

The only thing you need to know for now are CSS selectors.

Getting the data

In both cases we need to get the data first. As neither language enables easy way to fetch data, external libraries are the way to go. In Python there's a requests library for making easy requests. In JavaScript one of the most used ones is axios.

npm install axios # installs axios library for JavaScript
pip install requests # installs requests library for Python

Probably the biggest difference is usage between axios and requests is approach to asynchronicity. requests works synchronously by default which means that when it's called it will block execution until success or error. axios on the other hand will make a request and unless keyword await is used, commands will be process as usual.

In Python one uses request:

import requests

data = requests.get("url")

In JavaScript:

const { data } = await axios.get("url");

// Or

await axios.get("url")
  .then((res) => { // process data })
  .catch((err) => { //deal with errors })

Working with data

As with working with the data, it's the same except different. In both cases one needs to install one library. There are couple to choose from in both languages. In JavaScript I chose JSDOM because of my comfort with writing querySelectors. As my Python skills aren't on same level as my JS skills I picked BeautifulSoup because of a well written tutorial.

In both we need to import them first.

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')
const { JSDOM } = require("jsdom")


const dom = new JSDOM(data, {
      runScripts: "dangerously",
      resources: "usable"
    });

A basic query in JSDOM is not that different from VanillaJS in the browser. Next command will return first div that can be found.

document.querySelector("div")

When one wants to get all the divs on the page, the command is:

document.querySelectorAll("div")

One thing to remember is that it returns NodeList and not an array. To be able use methods like map() it needs to be transformed to array first.

BeautifulSoup offers two ways how to get elements. One is through find() and find_all(). They both have same arguments and the difference is how many elements they return. find() returns only the first it can find and the other returns all it can find. Next example will return the first div:

soup.find("div")

For more specificity we can filter by several arguments. One of them is class. Next example will return first div with class square:

soup.find("div", class="square")

BeautifulSoup also let's one use CSS selectors. select() method which is an equivalent JSDOM`s querySelectorAll.

soup.select("div") # is same as document.querySelectorAll("div")

Resources