I have a pandas dataframe that contains only one column which contains a string. I want to apply a function to each row that will split the string by sentence and replace that row with rows generated from the function.
Example dataframe:
import pandas as pd
df = pd.DataFrame(["A sentence. Another sentence. More sentences here.", "Another line of text"])
Output of df.head():
0
0 A sentence. Another sentence. More sentences h...
1 Another line of text
I have tried using apply() method as follows:
def get_sentence(row):
return pd.DataFrame(re.split('\.', row[0]))
df.apply(get_sentence, axis=1)
But then df.head() gives:
0 0
0 A sentenc...
1 0
0 Another line of text
I want the output as:
0
0 A sentence
1 Another sentence
2 More sentences here
3 Another line of text
What is the correct way to do this?
>Solution :
You can use
df[0].str.split(r'\.(?!$)').explode().reset_index(drop=True).str.rstrip('.')
Output:
0 A sentence
1 Another sentence
2 More sentences here
3 Another line of text
The \.(?!$) regex matches a dot not at the end of the string. The .explode() splits the results across rows and the .reset_index(drop=True) resets the indices. .str.rstrip('.') will remove trailing dots.
You can also use Series.str.findall version:
>>> df[0].str.findall(r'[^.]+').explode().reset_index(drop=True)
0 A sentence
1 Another sentence
2 More sentences here
3 Another line of text
where [^.]+ matches any one or more chars other than . char.