Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Finding whether a string is present in a pandas data frame column, and create a column with that string if it is

I have a list of 6000 files and a pandas data frame that contains a list of URLs. Some of those URLs match the names of those 6000 files. While I am iterating through the list of the files for some other purpose (extracting text), I am also looking for matching names in the URLs column. If there is a match, I write the matching file path in a new column.

Does not sound complicated, except for the fact that my code does not work:

files = glob.glob("materials/*.html")
data = pd.read_csv("file.csv")

def match_name(row):
    if filename in row['URL']:
        return file

for file in files:
    filename = os.path.basename(f'{file[:-5]}')
    extractor = open(file, 'rb')
    ...
    full = [p_text, os.path(basename(file)]
    df_full = pd.DataFrame(full)

    data['Path'] = dataset.apply(lambda x: match_name(x), axis=1)```

However, it does not work and all the columns return Null. I also tried:

data[‘Path’] = data.apply(lambda x: file if filename in x else None, axis=1)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel


Those columns of the data frame look like this:

|Name | Value | URL                         |
|-----|-------|-----------------------------|
|Name1|Value1 |http://example.com/LALAC.html|
|Name2|Value2 |http://example.com/ABASW.html|
|Name3|Value3 |http://example.com/4421C.html|

The files are LALAC.txt, SDDSA1.txt, 4421C.html, etc. The output that I want to get is:

|Name | Value | URL                         |Path               |
|-----|-------|-----------------------------|-------------------|
|Name1|Value1 |http://example.com/LALAC.html|materials/LALAC.txt|
|Name2|Value2 |http://example.com/ABASW.html|None               |
|Name3|Value3 |http://example.com/4421C.html|materials/4421C.txt|

The path does exist in the folder, but I am missing the reason why I keep getting None. Any ideas?

>Solution :

If you have all of the file names in a set, and all of the URLs in a dataframe, you can do:

import pandas as pd
filenames = {"LALAC", "ABASW", "4421C"}

df = pd.DataFrame({'URL': [
"http://example.com/LALAC.html",
"http://example.com/ABASW.html",
"http://example.com/4421C.html",
"HTTP://example.com/12345.html"
]})

df["Path"] = "materials/" + df["URL"].str.findall('|'.join(filenames)).str[0]  + ".txt"

result:

                             URL                 path
0  http://example.com/LALAC.html  materials/LALAC.txt
1  http://example.com/ABASW.html  materials/ABASW.txt
2  http://example.com/4421C.html  materials/4421C.txt
3  http://example.com/12345.html                  NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading