Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Webscraping – Python – "Nonetype Object has no attribute text"

Im scraping a product page with following script:

from requests_html import HTMLSession
import re



s = HTMLSession()

link = "https://www.kaufland.de/product/358005366/"
def get_products(link):
    r = s.get(link)
    title = r.html.find('h1', first=True).text
    price = r.html.find('div.rd-buybox__price', first=True).text.replace(' €', '').replace(',', '.')
    descriptiontable = r.html.find('div.rd-attribute-table', first=True).text
    print(title, price, descriptiontable)
get_products(link)

The area i try to scrape (Containing the producer, ean ecetera) doesnt seem to be scrapable, unliek price and title. What am i doing wrong?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

It looks like the product details table you’re after is populated by JavaScript after the page loads, so it’s not in the HTML retrieved by r = s.get(link). As explained in rayt’s answer, this is why you get None returned.

However, the data that the table contains is on the page, inside a <script> tag near the bottom:

<script> window.__NUXT__ = (function(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, _, $, aa, ab, ac, ad, ae, af, ag, ah, ai, aj, ak, al, am, an, ao, ap, aq, ar, as, at, au, av, aw, ax, ay, az, aA, aB, aC, aD, aE, aF, aG, aH, aI, aJ, aK, aL, aM, aN, aO, aP, aQ, aR, aS, aT, aU, aV, aW, aX, aY, aZ, a_, a$, ba, bb, bc, bd, be, bf, bg, bh, bi, bj, bk, bl, bm, bn, bo, bp, bq, br, bs, bt, bu, bv, bw, bx, by, bz, bA, bB, bC, bD, bE, bF, bG, bH, bI, bJ, bK, bL, bM, bN, bO, bP, bQ, bR, bS, bT, bU, bV, bW, bX, bY, bZ, b_, b$, ca, cb, cc, cd, ce, cf, cg, ch, ci, cj, ck, cl, cm, cn, co, cp, cq, cr, cs, ct, cu, cv, cw, cx, cy, cz, cA, cB, cC, cD, cE, cF, cG, cH, cI, cJ, cK, cL, cM, cN, cO, cP, cQ, cR, cS, cT, cU, cV, cW, cX, cY, cZ, c_, c$, da, db, dc, dd, de, df, dg, dh, di, dj, dk, dl, dm, dn, do0, dp, dq, dr, ds, dt, du, dv, dw, dx, dy, dz, dA, dB, dC, dD, dE, dF, dG, dH, dI, dJ, dK, dL, dM, dN, dO, dP, dQ, dR, dS, dT, dU, dV, dW, dX, dY, dZ, d_, d$, ea, eb, ec, ed, ee, ef, eg, eh, ei, ej, ek, el, em, en, eo, ep, eq, er, es, et, eu, ev, ew, ex, ey, ez, eA, eB, eC, eD, eE, eF, eG, eH, eI, eJ, eK, eL, eM, eN, eO, eP, eQ, eR, eS, eT, eU, eV, eW, eX, eY, eZ, e_, e$, fa, fb, fc, fd, fe, ff, fg, fh, fi, fj, fk, fl, fm, fn, fo, fp, fq, fr, fs, ft, fu, fv, fw, fx, fy, fz, fA, fB, fC, fD, fE, fF, fG, fH, fI, fJ, fK, fL, fM, fN, fO, fP, fQ, fR, fS, fT, fU, fV, fW, fX, fY, fZ, f_, f$, ga, gb, gc, gd, ge, gf, gg, gh, gi, gj, gk, gl, gm, gn, go, gp, gq, gr, gs, gt, gu, gv, gw, gx, gy, gz, gA, gB, gC, gD, gE, gF, gG, gH, gI, gJ, gK, gL, gM, gN, gO, gP, gQ, gR, gS, gT, gU, gV, gW, gX, gY, gZ, g_, g$, ha, hb, hc, hd, he, hf, hg, hh, hi, hj, hk, hl, hm) {
    return {
        layout: cG,
        data: [{}],
        fetch: {},

            ...

                },
                description$: {
                    descriptionHtml: "\u003Cp\u003E\u003Cb\u003EIm System sind folgende komponenten verbaut:\u003C\u002Fb\u003E\u003C\u002Fp\u003E\u003Cul\u003E\u003Cli\u003E\u003Cb\u003EGehäuse:\u003C\u002Fb\u003E Systemtreff Mini Tower Nero ST-401\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EProzessor: \u003C\u002Fb\u003EIntel Core i5-10400F 6 x 2.9 GHz (bei Bedarf bis zu 4.3 GHz Turbotakt durch Intel Turbo-Boost Technik)\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EArbeitsspeicher:\u003C\u002Fb\u003E 16 GB DDR4 2666 MHz \u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EMainboard:\u003C\u002Fb\u003E Gigabyte H510M H, Intel Sockel 1200 (1 x PCIe 4.0\u002F3.0 x16 (x16 mode), 1 x PCIe 3.0 x1, 1 x PS\u002F2 keyboard \u002F PS\u002F2 mouse, 1 x VGA 1 x HDMI,  1 x LAN (RJ45), 2 x USB 3.2, 4 x USB 2.0, 1 x M.2 (Key M), 4xSATA) - max. 64 GB DDR4 - 3200 MHz\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ENetzwerk:\u003C\u002Fb\u003E 1 x Gigabit LAN Controller(s)\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ESound:\u003C\u002Fb\u003E Realtek® ALC887 8-Channel High Definition Audio CODEC\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EFestplatte:\u003C\u002Fb\u003E 512GB M.2 SSD SATA III\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EGrafik:\u003C\u002Fb\u003E NVIDIA GeForce GT 730 mit 2048 MB \u002F 2GB RAM \u003Cul\u003E\u003Cli\u003ETechnik: ( GDDR3 \u002F DirectX 11 \u002F PCI Express 2.0 \u002F ) \u003C\u002Fli\u003E \u003Cli\u003EGeeignet für Heimvideos - Blu-ray FULL HD - Videobearbeitung \u002F World of Warcraft, Spore oder Sims3, sowie die Anschlussmöglichkeiten von bis zu 2 Monitore\u003C\u002Fli\u003E\u003C\u002Ful\u003E\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ENetzteil:\u003C\u002Fb\u003E 400-500Watt Marken Netzteil\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ELaufwerk:\u003C\u002Fb\u003E Kein Laufwerk verbaut\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003EBetriebssystem:\u003C\u002Fb\u003E Windows 10 Pro\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cbr\u003E\u003C\u002Fli\u003E\u003Cli\u003E\u003Cb\u003ESKU:\u003C\u002Fb\u003E 20192420\u003C\u002Fli\u003E\u003Cli\u003EMarkennamen -  Markenlogos sind registrierte Handelsmarken, deren Nutzung hier nur zur Produktbeschreibung eingesetzt werden - das Eigentumsrecht liegt beim jeweiligen Markeninhaber.\u003C\u002Fli\u003E\u003C\u002Ful\u003E",
                    attributes: {
                        default: [{
                            name: "Hersteller",
                            id: "manufacturer",
                            values: [{
                                text: "SYSTEMTREFF",
                                link: "\u002Fmanufacturer\u002F1428338\u002F",
                                isMasked: a
                            }],
                            isCategoryRelevant: d,
                            isDefaultRelevant: d
                        }, {
                            name: "Betriebssystem",
                            id: "operating_system",
                            values: [{
                                text: "Windows 10 Pro",
                                link: "\u002Fcategory\u002F39251\u002Fref-381=1388287\u002F",
                                isMasked: a
                            }],
                            isCategoryRelevant: d,
                            isDefaultRelevant: a
                        }, {
                            name: cJ,
                            id: cK,
                            values: [{
                                text: cL,
                                link: cM,
                                isMasked: a
                            }],
                            isCategoryRelevant: a,
                            isDefaultRelevant: a
                        }, {

I hope you’ll forgive my use of BeautifulSoup in this example, I’m more familiar with it than requests_html, but here’s how you might fetch the <script> tag content:

import requests
from bs4 import BeautifulSoup

def get_products(link):
    r = requests.get(link)
    html = r.text
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.find('h1').text.strip()
    price = soup.find('div', {'class':'rd-buybox__price'}).text.strip().replace(' €', '').replace(',', '.')
    descriptiontable = extract_description(soup)
    print(title, price, descriptiontable)

def extract_description(soup):
    product_data = soup.find_all('script')[2] # 3rd script tag
    product_data = str(product_data).partition('return {')[-1]
    product_data = '{' + product_data.split('}(')[0] + '}'
    product_data =  # You'll need to parse this content here to find the bits you need
    return product_data


if __name__ == '__main__':
    link = "https://www.kaufland.de/product/358005366/"
    get_products(link)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading