Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python: retrieving co-ordinates from html code

I’m currently using basic image mapping (for shelf tests if anyone is into market research). I upload an image to an image mapping app and then manually insert boxes around particular products which generates co-ordinates. I have an excel sheet with a macro that once these co-ordinates are pasted into will create code that I can put directly into my surveys. Initially, the co-ordinates are embedded in html code like the following

    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="10,32,202,115" style="outline:none;" target="_self"     />
    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="206,25,346,124" style="outline:none;" target="_self"     />
    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="340,27,448,121" style="outline:none;" target="_self"     />
    <area  alt="" title="" href="http://www.image-maps.com/" shape="rect" 
    coords="446,2,559,119" style="outline:none;" target="_self"     />
    <area shape="rect" coords="998,567,1000,569" alt="Image Map" style="outline:none;" 
    title="Image Map"/>

These can get very long I was just wondering how I could extract just the co-ordinates with each set printed on a new line. All I have currently is the following which simply returns all the numbers and only works if the pasted html code is all on one line

    import re
    html = '#paste html block onto one line'
    coords = re.findall("\d+", html)
    print(coords)

My ideal output here would be

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

10,32,202,115

206,25,346,124

340,27,448,121

446,2,559,119

998,567,1000,569

Any suggestions ?

>Solution :

Beautiful Soup package for Python can do this very easily.

You can install it with pip install beautifulsoup4

Using the XML you provided (I saved it in a file called "stuff.xml"), the following code will get the coordinates for you:

from bs4 import BeautifulSoup

with open("stuff.xml") as xml_file:
    soup = BeautifulSoup(xml_file.read(), "html.parser")

coords = [area["coords"] for area in soup.find_all('area')]

print(coords)

# -> ['10,32,202,115', '206,25,346,124', '340,27,448,121', '446,2,559,119', '998,567,1000,569']

Using regular expressions on XML is gonna cause problems sooner rather than later, better to use an actual XML parser.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading