I have a list of sentences, with some that contain elements in sentence list form:
| index | sentence |
|---|---|
| 0 | You can get cars, trucks, planes, and boats. |
| 1 | You can get the car, truck, and plane. |
| 2 | You should ignore this sentence. |
I only wish to extract elements from sentences that start with "You can get" or "You can get the" which I hope to extract using pandas extractall method, where I extract each individual element of the list in the sentences.
Desired output:
| index | match | object |
|---|---|---|
| 0 | 0 | car |
| 1 | truck | |
| 2 | plane | |
| 3 | boat | |
| 1 | 0 | car |
| 1 | truck | |
| 2 | plane |
I have three main questions:
- How to use look behinds
(?<=[Y|y]ou can get )so it won’t capturethe - How to include the look ahead
\w+(?=s)?so that both plural and singular forms of the elements are captured - Is it possible to write a capture group that also extracts each word as individual elements, or should I extract the list in the sentence first (e.g
cars, trucks, planes, and boats) then run another regex?
>Solution :
What about using:
df.loc[df['sentence'].str.startswith('You can get '),
'sentence'].str.extractall(r'(?P<object>\S+?)s?\b(?:,|.$)')
Output:
object
match
0 0 car
1 truck
2 plane
3 boat
1 0 car
1 truck
2 plane