What I do:
I’m crawling the web, the way I do it is that I have got a list of website links and create a promise from each (the promise is basically a crawler). And I do it in sequence so for example if I have 10 links I will crawl the first link, wait for it to finish, crawl second link, etc.
What I need:
What I’m trying to achieve is to group my promises. Each group will run in parallel but list of groups will run in sequence.
So for example I have 10 links, and I will create 10 promises from them.
After that, I will split promises into groups with max 3 promises per group.
After that it should crawl first 3 (as they are first group), wait for them to finish and then run 4th, 5th 6th as they are second group etc.
What I tried:
I created a method to split promises:
export function splitPromises<T>(promises: Promise<T>[], maxPerItem: number): Promise<T>[][] {
const splitPromisesList: Promise<T>[][] = [];
let currentSplit: Promise<T>[] = [];
for (let i = 0; i < promises.length; i++) {
currentSplit.push(promises[i]);
if (currentSplit.length === maxPerItem || i === promises.length - 1) {
splitPromisesList.push(currentSplit);
currentSplit = [];
}
}
return splitPromisesList;
}
After that method which will use that splitting and call promises:
async function crawler(links: string[], page: Page): Promise<MyData[]> {
const list: MyData[] = [];
const crawlPromises = links.map(async (link, index) => {
try {
const newPage = await page.browser().newPage();
const detail = await crawlLink(link, newPage);
await newPage.close();
return detail;
} catch (e) {
console.log(e);
return null as MyData;
}
});
const groupedPromises = splitPromises<MyData>(crawlPromises, 3);
let results: MyData[] = [];
for (const group of groupedPromises) {
results = await Promise.all(group);
const filteredResults: MyData[] = results.filter((detail) => detail !== null) as MyData[];
list.push(...filteredResults);
}
return list;
}
What are my issues:
I’m not sure what I’m doing wrong but it executes all promises at once, not by groups.
>Solution :
Once the promise has been created, the work is already in flight. awaiting the promises in batches won’t delay the work. You instead need to batch the creation of the promises.
A function for splitting an array into chunks is still useful, but you need to make it work on an array of strings, not just an array of promises:
export function splitArray<T>(array: T[], maxPerItem: number): T[][] {
const splitList: T[][] = [];
// ... basically the same implementation as before, with different variable names
return splitList;
}
And then you need to chunk the links and create your array of promises from just that one chunk. You’ll then wait for the chunk to finish before moving on to the next chunk.
async function crawler(links: string[], page: Page): Promise<MyData[]> {
const list: MyData[] = [];
const chunks = splitArray(links, 3);
for (const chunk of chunks) {
const crawlPromises = chunk.map(async (link, index) => {
// .. same as before, except we're mapping over `chunk` instead of `links`
});
const results = await Promise.all(crawlPromises);
const filteredResults: MyData[] = results.filter((detail) => detail !== null) as MyData[];
list.push(...filteredResults);
}
return list;
}