I’m currently trying to find the best solution to this problem:
I import a csv into Python, some of the cells have values and some don’t. I want to loop through the dataframe and if a cell has a value, append a number to it which is incremented each time a new cell has text. These appended values are then saved to a new column
I also save each piece of text to another list as I need to use them in the next step (out). The following code produces the correct output, I just wondered if there is a more efficient way to do it?
I considered np.where() or something similar but can’t increment a count from that.
import pandas as pd
df = pd.DataFrame(
[["", "c", "e"], ["a", "", ""], ["b", "d", "f"]],
columns = ["x", "y", "z"]
)
count = 1
out = []
# Loop over each column
for col in df.columns:
tmp = []
# Loop over each entry, if cell has text, add prefix
for text in df[col]:
if text:
tmp.append(str(count) + "-" + text)
out.append(text)
count += 1
else:
tmp.append("")
# Add prefixed column to df
df["pre-" + col] = tmp
Giving output:
Output DF
This doesn’t necessarily have to be done column-wise, as long as the numbers correspond to the order in out.
>Solution :
If you don’t care about the order of the counter, reshape with stack, increment with cumsum, then unstack:
s = df.replace('', np.nan).stack()
out = (s.notna().cumsum().astype(str)+'-'+s
).unstack().fillna('')
Output:
x y z
0 1-c 2-e
1 3-a
2 4-b 5-d 6-f
If order matters:
s = df.replace('', np.nan).T.stack()
out = (s.notna().cumsum().astype(str)+'-'+s
).unstack().T.fillna('')
Or using numpy:
m = df.ne('').to_numpy()
tmp = (pd.DataFrame(m.T.cumsum(axis=None).reshape(m.shape).T,
index=df.index, columns=df.columns)
.astype(str).add('-')
.where(m, '')
)
out = tmp+df
Output:
x y z
0 3-c 5-e
1 1-a
2 2-b 4-d 6-f
And to combine the output to the original DataFrame, join:
df.join(out.add_prefix('pre-'))
Output:
x y z pre-x pre-y pre-z
0 c e 3-c 5-e
1 a 1-a
2 b d f 2-b 4-d 6-f