Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Conditional arguments to extract data from HTML

I have some HTML that I’m trying to extract specific information for, however it has repeating elements and I have an idea on how to account for this. I’m trying to implement conditional arguments that go as follows:

  1. Extract the player names from the first href tag
  2. search for the next tag named flaggenrahmen and extract the data in alt
  3. If flaggenrahmen repeats again, skip.
  4. Repeat steps.

what I have tried:

player_dict = defaultdict(list)
soup = BeautifulSoup(html)
player_id = soup.select('*[href]')
nation = soup.select('.flaggenrahmen')
for l,k in zip(player_id, nation):
    player_dict[l.get_text(strip=True)].append(k['alt'])

However, I cannot get the ‘skip’ when flaggenrahmen repeats again, and therefore I get more than one country per player.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Produced output:

defaultdict(list,
            {'': ['England', 'Spain', 'Portugal'],
             'Trent Alexander-Arnold': ['Morocco'],
             'Achraf Hakimi': ['England']})

Expected output:

{'Trent Alexander-Arnold':['England'],
'Achraf Hakimi':['Morocco'],
'João Cancelo':['Portugal'],
'Reece James':['England']
    }

Here’s the html data:

html='''<tbody>
<tr class="odd">
<td class="zentriert">1</td><td class=""><table class="inline-table"><tr><td rowspan="2"><a href="#"><img alt="Trent Alexander-Arnold" class="bilderrahmen-fixed" src="https://img.a.transfermarkt.technology/portrait/small/314353-1559826986.jpg?lm=1" title="Trent Alexander-Arnold"/></a></td><td class="hauptlink"><a class="spielprofil_tooltip" href="/trent-alexander-arnold/profil/spieler/314353" id="314353" title="Trent Alexander-Arnold">Trent Alexander-Arnold</a></td></tr><tr><td>Right-Back</td></tr></table></td><td class="zentriert">23</td><td class="zentriert"><img alt="England" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569" title="England"/></td><td class="zentriert"><a class="vereinprofil_tooltip" href="/fc-liverpool/startseite/verein/31" id="31"><img alt="Liverpool FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/31.png?lm=1456567819" title=" "/></a></td><td class="rechts hauptlink"><b>£67.50m</b><span class="icons_sprite red-arrow-ten" title="£90.00m"> </span></td></tr>
<tr class="even">
<td class="zentriert">2</td><td class=""><table class="inline-table"><tr><td rowspan="2"><a href="#"><img alt="Achraf Hakimi" class="bilderrahmen-fixed" src="https://img.a.transfermarkt.technology/portrait/small/398073-1633679363.jpg?lm=1" title="Achraf Hakimi"/></a></td><td class="hauptlink"><a class="spielprofil_tooltip" href="/achraf-hakimi/profil/spieler/398073" id="398073" title="Achraf Hakimi">Achraf Hakimi</a></td></tr><tr><td>Right-Back</td></tr></table></td><td class="zentriert">22</td><td class="zentriert"><img alt="Morocco" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/107.png?lm=1520611569" title="Morocco"/><br/><img alt="Spain" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/157.png?lm=1520611569" title="Spain"/></td><td class="zentriert"><a class="vereinprofil_tooltip" href="/fc-paris-saint-germain/startseite/verein/583" id="583"><img alt="Paris Saint-Germain" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/583.png?lm=1522312728" title=" "/></a></td><td class="rechts hauptlink"><b>£63.00m</b><span class="icons_sprite green-arrow-ten" title="£54.00m"> </span></td></tr>
<tr class="odd">
<td class="zentriert">3</td><td class=""><table class="inline-table"><tr><td rowspan="2"><a href="#"><img alt="João Cancelo" class="bilderrahmen-fixed" src="https://img.a.transfermarkt.technology/portrait/small/182712-1615221629.jpg?lm=1" title="João Cancelo"/></a></td><td class="hauptlink"><a class="spielprofil_tooltip" href="/joao-cancelo/profil/spieler/182712" id="182712" title="João Cancelo">João Cancelo</a></td></tr><tr><td>Right-Back</td></tr></table></td><td class="zentriert">27</td><td class="zentriert"><img alt="Portugal" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/136.png?lm=1520611569" title="Portugal"/></td><td class="zentriert"><a class="vereinprofil_tooltip" href="/manchester-city/startseite/verein/281" id="281"><img alt="Manchester City" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/281.png?lm=1467356331" title=" "/></a></td><td class="rechts hauptlink"><b>£49.50m</b><span class="icons_sprite green-arrow-ten" title="£45.00m"> </span></td></tr>
<tr class="even">
<td class="zentriert">4</td><td class=""><table class="inline-table"><tr><td rowspan="2"><a href="#"><img alt="Reece James" class="bilderrahmen-fixed" src="https://img.a.transfermarkt.technology/portrait/small/472423-1569484519.png?lm=1" title="Reece James"/></a></td><td class="hauptlink"><a class="spielprofil_tooltip" href="/reece-james/profil/spieler/472423" id="472423" title="Reece James">Reece James</a></td></tr><tr><td>Right-Back</td></tr></table></td><td class="zentriert">21</td><td class="zentriert"><img alt="England" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569" title="England"/></td><td class="zentriert"><a class="vereinprofil_tooltip" href="/fc-chelsea/startseite/verein/631" id="631"><img alt="Chelsea FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/631.png?lm=1628160548" title=" "/></a></td><td class="rechts hauptlink"><b>£40.50m</b><span class="icons_sprite green-arrow-ten" title="£36.00m"> </span></td></tr>
<tr class="odd">
<tbody>'''.replace('< ', '<')

>Solution :

this should do

players={}
soup = BeautifulSoup(html, 'lxml')
for el in soup.tbody.children:
    if el.name!='tr':
        continue
    name=el.select_one('.spielprofil_tooltip')
    country=el.select_one('.flaggenrahmen')
    if name and country:
        players[name.text]=[country['title']]
print(players)
>>> {'Trent Alexander-Arnold': ['England'], 'Achraf Hakimi': ['Morocco'], 'João Cancelo': ['Portugal'], 'Reece James': ['England']}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading