Data Restructuring using python in unix

January 31, 2024

I am trying to split the data and rearrange the data in a CSV file. My data looks something like this

1:100011159-T-G,CDD3-597,G,G
1:10002775-GA,CDD3-597,G,G
1:100122796-C-T,CDD3-597,T,T
1:100152282-CAAA-T,CDD3-597,C,C
1:100011159-T-G,CDD3-598,G,G
1:100152282-CAAA-T,CDD3-598,C,C

and I want a table that looks like this:

ID	1:100011159-T-G	1:10002775-GA	1:100122796-C-T	1:100152282-CAAA-T
CDD3-597	GG	GG	TT	CC
CDD3-598	GG			CC

I have written the following code:

import pandas as pd


input_file = "trail_berry.csv"
output_file = "trail_output_result.csv"

# Read the CSV file without header
df = pd.read_csv(input_file, header=None)
print(df[0].str.split(',', n=2, expand=True))
# Extract SNP Name, ID, and Alleles from the data
df[['SNP_Name', 'ID', 'Alleles']] = df[0].str.split(',', n=-1, expand=True)

# Create a new DataFrame with unique SNP_Name values as columns
result_df = pd.DataFrame(columns=df['SNP_Name'].unique(), dtype=str)

# Populate the new DataFrame with ID and Alleles data
for _, row in df.iterrows():
    result_df.at[row['ID'], row['SNP_Name']] = row['Alleles']

# Reset the index
result_df.reset_index(inplace=True)
result_df.rename(columns={'index': 'ID'}, inplace=True)

# Fill NaN values with an appropriate representation (e.g., 'NULL' or '')
result_df = result_df.fillna('NULL')

# Save the result to a new CSV file
result_df.to_csv(output_file, index=False)

# Print a message indicating that the file has been saved
print("Result has been saved to {}".format(output_file))

but this has been giving me the following error:

Traceback (most recent call last):
  File "berry_trail.py", line 11, in <module>
    df[['SNP_Name', 'ID', 'Alleles']] = df[0].str.split(',', n=-1, expand=True)
  File "/nas/longleaf/home/svennam/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 3367, in __setitem__
    self._setitem_array(key, value)
  File "/nas/longleaf/home/svennam/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 3389, in _setitem_array
    raise ValueError('Columns must be same length as key')

Can someone please help, I am having hard time figuring this out.Thanks in advance!
ValueError: Columns must be same length as key

>Solution :

What are you trying to do is called pivoting, so use DataFrame.pivot():

import pandas as pd

df = pd.read_csv("your_file.csv", header=None)

out = (
    df.pivot(index=1, columns=0, values=2)
    .rename_axis(index="ID", columns=None)
    .reset_index()
)
print(out)

Prints:

         ID 1:100011159-T-G 1:10002775-GA 1:100122796-C-T 1:100152282-CAAA-T
0  CDD3-597              GG            GG              TT                 CC
1  CDD3-598              GG           NaN             NaN                 CC

EDIT: With your updated input:

import pandas as pd

df = pd.read_csv("your_file.csv", header=None)

df["tmp"] = df[[2, 3]].agg("".join, axis=1)  # <-- join the two columns together

out = (
    df.pivot(index=1, columns=0, values="tmp")
    .rename_axis(index="ID", columns=None)
    .reset_index()
)
print(out)