Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extracting data from child element using Beautifulsoup

I am using the following script to extract data from real estate website:

for publish in soup.find('ul', {'class': 'list-view real-estates'}).find_all('span', {'class': 're-offer-type'}):
    publish_value = publish.get_text().strip()
    publisher.append(publish_value)

I need to extract the agency that is offering the property (look at image):

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The using the above code extract the data very well, the problem is that another has a class = re-offer-type. Please look at image below:

This is a link to the website page:
https://www.imoti.net/bg/obiavi/r/prodava/bulgaria/?page=1&sid=iXMpXe

To the problem.
Look at image:

I need the text in red, but I somethimes, I will stress on that somethimes I get the text in purple. Therefore I need to me more speicific, however I don’t know how to specify the second span tag and why I sometimes get the value that I want and sometimes I get the purple value.

Any suggestions?
This is the result I get based on code above:

['продава', 'ЦКБ АД', 'продава', 'ЦКБ АД', 'продава', 'ЦКБ АД', 'продава', 'частно лице', 'продава', 'ЦКБ АД', 'продава', 'Константинов Реал Естейт', 'продава', 'Вариант', 'продава', 'Luximmo Finest Estates', 'продава', 'Тийм Визия', 'продава', 'частно лице', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'частно лице', 'продава', 'Dekris', 'продава', 'частно лице', 'продава', 'ЦКБ АД', 'продава', 'ЦКБ АД', 'продава', 'Нерро недвижими имоти', 'продава', 'Имот Експрес 99', 'продава', 'АВАНГАРД НЕДВИЖИМИ ИМОТИ', 'продава', 'Premier Estates', 'продава', 'частно лице', 'продава', 'Premier Estates', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'АВАНГАРД НЕДВИЖИМИ ИМОТИ', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'BULGARIAN PROPERTIES', 'продава', 'Нерро недвижими имоти', 'продава', 'АВАНГАРД НЕДВИЖИМИ ИМОТИ', 'продава', 'BULGARIAN PROPERTIES']

The word ‘продава’ means selling and should not be present. As you can see there are a lot of ‘продава’ strings, but not all of them are wrong which is very strange.

Full code:

from requests_html import HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np

s = HTMLSession()
url = 'https://www.imoti.net/bg/obiavi/r/prodava/bulgaria/?page=1&sid=iXMpXe'

r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

prices = []
type_of_property = []
sqm_area = []
locations =[]
publisher = []

def get_prices(urls):
    for price in soup.find('ul', {'class': 'list-view real-estates'}).find_all('strong', {'class': 'price'}):
        price_text = price.get_text()
        price_arr = re.findall('[0-9]+', price_text)
        final_price = ''
        for each_sub_price in price_arr:
            final_price += each_sub_price
        prices.append(final_price)
    for property_type in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
        property_type_value = ' '.join(property_type.get_text().split(',')[0].split()[1:3])
        type_of_property.append(property_type_value)
    for sqm in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
        sqm_value = sqm.get_text().split(',')[1].split()[0]
        sqm_area.append(sqm_value)
    for location in soup.find('ul', {'class': 'list-view real-estates'}).find_all('div', {'class': 'inline-group'}):
        location_value = location.get_text().split(',')[-1].strip()
        locations.append(location_value)
    for publish in soup.find('ul', {'class': 'list-view real-estates'}).find_all('span', {'class': 're-offer-type'}):
        publish_value = publish.get_text().strip()
        publisher.append(publish_value)
    return prices, type_of_property, sqm_area, locations, publisher

print(get_prices(url))

>Solution :

Since you need to get only every second span tag, you can use slice notation to get a list with every second element ([1::2], start from the second and go to next one in steps of two) so in your code it could look something like this (I also moved part of the finder to a separate line so that the line is more readable and not that long):

real_estates = soup.find('ul', {'class': 'list-view real-estates'})
for publish in real_estates.find_all('span', {'class': 're-offer-type'})[1::2]:
    publish_value = publish.get_text().strip()
    publisher.append(publish_value)

Also seemingly you can just place the real_estates somewhere before the loop and then instead of rewriting the soup.find('ul'...) just use real_estates as shown below:

def get_prices(urls):
    real_estates = soup.find('ul', {'class': 'list-view real-estates'})
    for price in real_estates.find_all('strong', {'class': 'price'}):
        ...
    for property_type in real_estates.find_all('div', {'class': 'inline-group'}):
        ...
    ...

Useful:

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading