I’m practicing with BeautifulSoup and HTML requests in general for the first time. The goal of the programme is to load a webpage and it’s HTML, then search through the webpage (in this case a recipe, to get a sub string of it’s ingredients). I’ve managed to get it working with the following code:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala" result = requests.get(url) myHTML = result.text index1 = myHTML.find("recipeIngredient") index2 = myHTML.find("recipeInstructions") ingredients = myHTML[index1:index2]
But when I try and use BeautifulSoup here:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala" result = requests.get(url) doc = BeautifulSoup(result.text, "html.parser") ingredients = doc.find(text = "recipeIngredient") print(ingredients)
I understand that the code above (even if I could get it working) would produce a different output of just ["recipeIngredient"] but that’s all I’m focused on for now whilst I get to grips with BS. Instead the code above just outputs None. I printed "doc" to the terminal and it would only output what appears to be the second half of the HTML (or at least : not all of it). Whereas , the text file does contain all HTML, so I assume that’s where the problem lies but i’m not sure how to fix it.
You need to use:
import requests from bs4 import BeautifulSoup url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala" doc = ( BeautifulSoup(requests.get(url).text, "html.parser") .find(class_="recipe__ingredients") ) ingredients = "\n".join( ingredient.getText() for ingredient in doc.find_all("li") ) print(ingredients)
1 large onion , chopped 4 large garlic cloves thumb-sized piece of ginger 2 tbsp rapeseed oil 4 small skinless chicken breasts, cut into chunks 2 tbsp tikka spice powder 1 tsp cayenne pepper 400g can chopped tomatoes 40g ground almonds 200g spinach 3 tbsp fat-free natural yogurt ½ small bunch of coriander , chopped brown basmati rice , to serve