Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Beautiful soup parser limits

I’m trying to scrape the links to the 400 models listed on this website: https://www.printables.com/model?category=14&fileType=fff&includeUserGcodes=1, which I refer to as webpage in my code below. However, when I run my code, I get no links.

User_agent = {'User-agent': 'Mozilla/5.0 (X11; CrOS i686 4319.74.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36'}

r = requests.get(webpage, headers = User_agent).text
soup = BeautifulSoup(r,'html5lib')

for link in soup.find_all('a'):
    print(link['href'])

So I check if links are even available via: print(soup.prettify()) and none of the desired links appear in the HTML view as well. This led me to assume that the website doesn’t allow scraping but r.status_code returns 200 meaning I’m able to scrape.

Is there a different approach I could take? Where else would these links be stored? Thank you.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

The data is loaded from external URL via Javascript, so BeautifulSoup doesn’t see it. To get info about all items you can use following example:

import json
import requests


url = "https://www.printables.com/graphql/"

payload = {
    "operationName": "PrintList",
    "query": "query PrintList($limit: Int!, $cursor: String, $categoryId: ID, $materialIds: [Int], $userId: ID, $printerIds: [Int], $licenses: [ID], $ordering: String, $hasModel: Boolean, $filesType: [FilterPrintFilesTypeEnum], $includeUserGcodes: Boolean, $nozzleDiameters: [Float], $weight: IntervalObject, $printDuration: IntervalObject, $publishedDateLimitDays: Int, $featured: Boolean, $featuredNow: Boolean, $usedMaterial: IntervalObject, $hasMake: Boolean, $competitionAwarded: Boolean, $onlyFollowing: Boolean, $collectedByMe: Boolean, $madeByMe: Boolean, $likedByMe: Boolean) {\n  morePrints(\n    limit: $limit\n    cursor: $cursor\n    categoryId: $categoryId\n    materialIds: $materialIds\n    printerIds: $printerIds\n    licenses: $licenses\n    userId: $userId\n    ordering: $ordering\n    hasModel: $hasModel\n    filesType: $filesType\n    nozzleDiameters: $nozzleDiameters\n    includeUserGcodes: $includeUserGcodes\n    weight: $weight\n    printDuration: $printDuration\n    publishedDateLimitDays: $publishedDateLimitDays\n    featured: $featured\n    featuredNow: $featuredNow\n    usedMaterial: $usedMaterial\n    hasMake: $hasMake\n    onlyFollowing: $onlyFollowing\n    competitionAwarded: $competitionAwarded\n    collectedByMe: $collectedByMe\n    madeByMe: $madeByMe\n    liked: $likedByMe\n  ) {\n    cursor\n    items {\n      ...PrintListFragment\n      printer {\n        id\n        __typename\n      }\n      user {\n        rating\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n}\n\nfragment PrintListFragment on PrintType {\n  id\n  name\n  slug\n  ratingAvg\n  ratingCount\n  likesCount\n  liked\n  datePublished\n  dateFeatured\n  firstPublish\n  downloadCount\n  displayCount\n  inMyCollections\n  foundInUserGcodes\n  userGcodeCount\n  userGcodesCount\n  materials {\n    id\n    __typename\n  }\n  category {\n    id\n    path {\n      id\n      name\n      __typename\n    }\n    __typename\n  }\n  modified\n  images {\n    ...ImageSimpleFragment\n    __typename\n  }\n  filesType\n  hasModel\n  user {\n    ...AvatarUserFragment\n    __typename\n  }\n  ...LatestCompetitionResult\n  __typename\n}\n\nfragment AvatarUserFragment on UserType {\n  id\n  publicUsername\n  avatarFilePath\n  slug\n  badgesProfileLevel {\n    profileLevel\n    __typename\n  }\n  __typename\n}\n\nfragment LatestCompetitionResult on PrintType {\n  latestCompetitionResult {\n    placement\n    competitionId\n    __typename\n  }\n  __typename\n}\n\nfragment ImageSimpleFragment on PrintImageType {\n  id\n  filePath\n  rotation\n  __typename\n}\n",
    "variables": {
        "categoryId": "14",
        "collectedByMe": False,
        "competitionAwarded": False,
        "cursor": "",
        "featured": False,
        "filesType": ["GCODE"],
        "hasMake": False,
        "includeUserGcodes": True,
        "likedByMe": False,
        "limit": 36,
        "madeByMe": False,
        "materialIds": None,
        "nozzleDiameters": None,
        "ordering": "-first_publish",
        "printDuration": None,
        "printerIds": None,
        "publishedDateLimitDays": None,
        "weight": None,
    },
}

cnt = 0
while True:
    data = requests.post(url, json=payload).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    for i in data["data"]["morePrints"]["items"]:
        cnt += 1
        print(
            cnt,
            i["name"],
            "https://www.printables.com/model/{}-{}".format(i["id"], i["slug"]),
        )

    if not data["data"]["morePrints"]["cursor"]:
        break

    payload["variables"]["cursor"] = data["data"]["morePrints"]["cursor"]

Prints:

1 White Spiral Vase https://www.printables.com/model/189114-white-spiral-vase
2 Calibrating Before Battle - 3DPN Mr. Print-It - Superhero Remix https://www.printables.com/model/188733-calibrating-before-battle-3dpn-mr-print-it-superhe
3 twitter 3d bird https://www.printables.com/model/187083-twitter-3d-bird
4 Welcome To Rapture plaque https://www.printables.com/model/186669-welcome-to-rapture-plaque

...

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading