Im scraping a product page with following script:
from requests_html import HTMLSession
import re
s = HTMLSession()
link = "https://www.kaufland.de/product/358005366/"
def get_products(link):
r = s.get(link)
title = r.html.find('h1', first=True).text
price = r.html.find('div.rd-buybox__price', first=True).text.replace(' €', '').replace(',', '.')
descriptiontable = r.html.find('div.rd-attribute-table', first=True).text
print(title, price, descriptiontable)
get_products(link)
The area i try to scrape (Containing the producer, ean ecetera) doesnt seem to be scrapable, unliek price and title. What am i doing wrong?
>Solution :
It looks like the product details table you’re after is populated by JavaScript after the page loads, so it’s not in the HTML retrieved by r = s.get(link). As explained in rayt’s answer, this is why you get None returned.
However, the data that the table contains is on the page, inside a <script> tag near the bottom:
<script> window.__NUXT__ = (function(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, _, $, aa, ab, ac, ad, ae, af, ag, ah, ai, aj, ak, al, am, an, ao, ap, aq, ar, as, at, au, av, aw, ax, ay, az, aA, aB, aC, aD, aE, aF, aG, aH, aI, aJ, aK, aL, aM, aN, aO, aP, aQ, aR, aS, aT, aU, aV, aW, aX, aY, aZ, a_, a$, ba, bb, bc, bd, be, bf, bg, bh, bi, bj, bk, bl, bm, bn, bo, bp, bq, br, bs, bt, bu, bv, bw, bx, by, bz, bA, bB, bC, bD, bE, bF, bG, bH, bI, bJ, bK, bL, bM, bN, bO, bP, bQ, bR, bS, bT, bU, bV, bW, bX, bY, bZ, b_, b$, ca, cb, cc, cd, ce, cf, cg, ch, ci, cj, ck, cl, cm, cn, co, cp, cq, cr, cs, ct, cu, cv, cw, cx, cy, cz, cA, cB, cC, cD, cE, cF, cG, cH, cI, cJ, cK, cL, cM, cN, cO, cP, cQ, cR, cS, cT, cU, cV, cW, cX, cY, cZ, c_, c$, da, db, dc, dd, de, df, dg, dh, di, dj, dk, dl, dm, dn, do0, dp, dq, dr, ds, dt, du, dv, dw, dx, dy, dz, dA, dB, dC, dD, dE, dF, dG, dH, dI, dJ, dK, dL, dM, dN, dO, dP, dQ, dR, dS, dT, dU, dV, dW, dX, dY, dZ, d_, d$, ea, eb, ec, ed, ee, ef, eg, eh, ei, ej, ek, el, em, en, eo, ep, eq, er, es, et, eu, ev, ew, ex, ey, ez, eA, eB, eC, eD, eE, eF, eG, eH, eI, eJ, eK, eL, eM, eN, eO, eP, eQ, eR, eS, eT, eU, eV, eW, eX, eY, eZ, e_, e$, fa, fb, fc, fd, fe, ff, fg, fh, fi, fj, fk, fl, fm, fn, fo, fp, fq, fr, fs, ft, fu, fv, fw, fx, fy, fz, fA, fB, fC, fD, fE, fF, fG, fH, fI, fJ, fK, fL, fM, fN, fO, fP, fQ, fR, fS, fT, fU, fV, fW, fX, fY, fZ, f_, f$, ga, gb, gc, gd, ge, gf, gg, gh, gi, gj, gk, gl, gm, gn, go, gp, gq, gr, gs, gt, gu, gv, gw, gx, gy, gz, gA, gB, gC, gD, gE, gF, gG, gH, gI, gJ, gK, gL, gM, gN, gO, gP, gQ, gR, gS, gT, gU, gV, gW, gX, gY, gZ, g_, g$, ha, hb, hc, hd, he, hf, hg, hh, hi, hj, hk, hl, hm) {
return {
layout: cG,
data: [{}],
fetch: {},
...
},
description$: {
descriptionHtml: "\u003Cp\u003E\u003Cb\u003EIm System sind folgende komponenten verbaut:\u003C\u002Fb\u003E\u003C\u002Fp\u003E\u003Cul\u003E\u003Cli\u003E\u003Cb\u003EGehäuse:\u003C\u002Fb\u003E Systemtreff Mini Tower Nero ST-401\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EProzessor: \u003C\u002Fb\u003EIntel Core i5-10400F 6 x 2.9 GHz (bei Bedarf bis zu 4.3 GHz Turbotakt durch Intel Turbo-Boost Technik)\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EArbeitsspeicher:\u003C\u002Fb\u003E 16 GB DDR4 2666 MHz \u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EMainboard:\u003C\u002Fb\u003E Gigabyte H510M H, Intel Sockel 1200 (1 x PCIe 4.0\u002F3.0 x16 (x16 mode), 1 x PCIe 3.0 x1, 1 x PS\u002F2 keyboard \u002F PS\u002F2 mouse, 1 x VGA 1 x HDMI, 1 x LAN (RJ45), 2 x USB 3.2, 4 x USB 2.0, 1 x M.2 (Key M), 4xSATA) - max. 64 GB DDR4 - 3200 MHz\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ENetzwerk:\u003C\u002Fb\u003E 1 x Gigabit LAN Controller(s)\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ESound:\u003C\u002Fb\u003E Realtek® ALC887 8-Channel High Definition Audio CODEC\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EFestplatte:\u003C\u002Fb\u003E 512GB M.2 SSD SATA III\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EGrafik:\u003C\u002Fb\u003E NVIDIA GeForce GT 730 mit 2048 MB \u002F 2GB RAM \u003Cul\u003E\u003Cli\u003ETechnik: ( GDDR3 \u002F DirectX 11 \u002F PCI Express 2.0 \u002F ) \u003C\u002Fli\u003E \u003Cli\u003EGeeignet für Heimvideos - Blu-ray FULL HD - Videobearbeitung \u002F World of Warcraft, Spore oder Sims3, sowie die Anschlussmöglichkeiten von bis zu 2 Monitore\u003C\u002Fli\u003E\u003C\u002Ful\u003E\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ENetzteil:\u003C\u002Fb\u003E 400-500Watt Marken Netzteil\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ELaufwerk:\u003C\u002Fb\u003E Kein Laufwerk verbaut\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EBetriebssystem:\u003C\u002Fb\u003E Windows 10 Pro\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ESKU:\u003C\u002Fb\u003E 20192420\u003C\u002Fli\u003E\u003Cli\u003EMarkennamen - Markenlogos sind registrierte Handelsmarken, deren Nutzung hier nur zur Produktbeschreibung eingesetzt werden - das Eigentumsrecht liegt beim jeweiligen Markeninhaber.\u003C\u002Fli\u003E\u003C\u002Ful\u003E",
attributes: {
default: [{
name: "Hersteller",
id: "manufacturer",
values: [{
text: "SYSTEMTREFF",
link: "\u002Fmanufacturer\u002F1428338\u002F",
isMasked: a
}],
isCategoryRelevant: d,
isDefaultRelevant: d
}, {
name: "Betriebssystem",
id: "operating_system",
values: [{
text: "Windows 10 Pro",
link: "\u002Fcategory\u002F39251\u002Fref-381=1388287\u002F",
isMasked: a
}],
isCategoryRelevant: d,
isDefaultRelevant: a
}, {
name: cJ,
id: cK,
values: [{
text: cL,
link: cM,
isMasked: a
}],
isCategoryRelevant: a,
isDefaultRelevant: a
}, {
I hope you’ll forgive my use of BeautifulSoup in this example, I’m more familiar with it than requests_html, but here’s how you might fetch the <script> tag content:
import requests
from bs4 import BeautifulSoup
def get_products(link):
r = requests.get(link)
html = r.text
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('h1').text.strip()
price = soup.find('div', {'class':'rd-buybox__price'}).text.strip().replace(' €', '').replace(',', '.')
descriptiontable = extract_description(soup)
print(title, price, descriptiontable)
def extract_description(soup):
product_data = soup.find_all('script')[2] # 3rd script tag
product_data = str(product_data).partition('return {')[-1]
product_data = '{' + product_data.split('}(')[0] + '}'
product_data = # You'll need to parse this content here to find the bits you need
return product_data
if __name__ == '__main__':
link = "https://www.kaufland.de/product/358005366/"
get_products(link)