How do I split a string to extract only uppercase string or uppercase followed by float?

Advertisements

I am using Selenium with Python to scrape some file information. I would like to extract only the file type and version number if available eg. GML 3.1.1. I’m looking for the split function to do so. My current response is a list that looks like this:

ESRI Shapefile, (50.7 kB)
GML 3.1.1, (124.9 kB)
Google Earth KML 2.1, (126.5 kB)
MapInfo MIF, (53.5 kB)

The script section is as follows:

for file in files:
    file_format = file.text
    print(file_format)

I’m looking for the strip() function that checks if the word before the comma is uppercase or uppercase followed by float. The following is the output I’m looking for:

ESRI
GML 3.1.1
KML 2.1
MIF

>Solution :

Using a regex that finds words of all uppercase letters followed optionally by a space and digits / dots would work here:

s='''ESRI Shapefile, (50.7 kB)
GML 3.1.1, (124.9 kB)
Google Earth KML 2.1, (126.5 kB)
MapInfo MIF, (53.5 kB)'''

import re

re.findall(r'\b[A-Z]+\b(?:\s[\d\.]+)?', s)
['ESRI', 'GML 3.1.1', 'KML 2.1', 'MIF']

Leave a ReplyCancel reply