Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Empty result scraping site with Fetch and Cheerio

I decided for the sake of interest to collect data from the site (name, price per night, rating) for myself and encountered a misunderstanding. I get nothing on the output. I rewrote on other libraries but they say this one is better.

const cheerio = require("cheerio"); 
let fs = require('fs');
const base = "https://ostrovok.ru/hotel/russia/adler/";

(async () => {
  let url = "?page=1";
  const data = [];

  for (let i = 0; i < 176; i++) {
    try {
      console.log(base + url);
      const res = await fetch(base + url);

      if (!res.ok) {
        break;
      }

      const $ = cheerio.load(await res.text());
      const chunk = [...$("")].map(e =>
        $(e).text().trim()
      );
      data.push(chunk);
      url = $("#__next > div > div:nth-child(2) > div > div > div.Layout_content__9ap_g > div:nth-child(3) > div > div.HotelCard_headerArea__hlQPk > div > div.HotelCard_mainInfo__pNKYU > div.HotelCard_wrapTitle__t742O > h2 > a").attr("TEXT");
    }
    catch (err) {
      console.error(err);
      break;
    }
  }

  console.log(JSON.stringify(data, null, 2));

  fs.writeFile('numbers.txt', data.join('\n'), function(err) {
    if (err) {
        console.log(err);
    }
});

})();

I was expecting to see a list of data, but I got [].

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

base + url always uses ?page=1. Try interpolating the index variable in: ${base}?page=${i}.

.attr("TEXT") looks incorrect. I assume you want all 20 hotel names on each page, so use [...$("...")].map(e => $(e).text()) to collect each name as a separate array element.

As for the selector, long, browser-generated ultra-rigid selectors are prone to error. If any assumption in that chain changes, the whole thing breaks. Safer to use ".HotelCard_title__cpfvk", which is all that’s needed to identify the element you want, and nothing more or less.

!res.ok isn’t enough to determine when the pagination ends. Break when the result list is empty.

Putting it together:

const cheerio = require("cheerio"); // ^1.0.0-rc.12
const {writeFile} = require("node:fs/promises");

const url = "<Your URL>";

(async () => {
  const data = [];

  for (let i = 1; i <= 1000; i++) {
    const res = await fetch(`${url}?page=${i}`);

    if (!res.ok) {
      break;
    }
    
    const $ = cheerio.load(await res.text());
    const names = [...$(".HotelCard_title__cpfvk")]
      .map(e => $(e).text());

    if (!names.length) {
      break;
    }

    data.push(...names);
  }

  console.log(data);
  await writeFile("numbers.txt", JSON.stringify(data));
})();

This takes awhile to run, so you could parallelize requests (at the risk of angering the server), or simply add some logs to ensure each chunk is coming through OK.

Disclosure: I’m the author of the linked blog post.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading