Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Data Restructuring using python in unix

I am trying to split the data and rearrange the data in a CSV file. My data looks something like this

1:100011159-T-G,CDD3-597,G,G
1:10002775-GA,CDD3-597,G,G
1:100122796-C-T,CDD3-597,T,T
1:100152282-CAAA-T,CDD3-597,C,C
1:100011159-T-G,CDD3-598,G,G
1:100152282-CAAA-T,CDD3-598,C,C  

and I want a table that looks like this:

ID 1:100011159-T-G 1:10002775-GA 1:100122796-C-T 1:100152282-CAAA-T
CDD3-597 GG GG TT CC
CDD3-598 GG CC

I have written the following code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import pandas as pd


input_file = "trail_berry.csv"
output_file = "trail_output_result.csv"

# Read the CSV file without header
df = pd.read_csv(input_file, header=None)
print(df[0].str.split(',', n=2, expand=True))
# Extract SNP Name, ID, and Alleles from the data
df[['SNP_Name', 'ID', 'Alleles']] = df[0].str.split(',', n=-1, expand=True)

# Create a new DataFrame with unique SNP_Name values as columns
result_df = pd.DataFrame(columns=df['SNP_Name'].unique(), dtype=str)

# Populate the new DataFrame with ID and Alleles data
for _, row in df.iterrows():
    result_df.at[row['ID'], row['SNP_Name']] = row['Alleles']

# Reset the index
result_df.reset_index(inplace=True)
result_df.rename(columns={'index': 'ID'}, inplace=True)

# Fill NaN values with an appropriate representation (e.g., 'NULL' or '')
result_df = result_df.fillna('NULL')

# Save the result to a new CSV file
result_df.to_csv(output_file, index=False)

# Print a message indicating that the file has been saved
print("Result has been saved to {}".format(output_file))

but this has been giving me the following error:

Traceback (most recent call last):
  File "berry_trail.py", line 11, in <module>
    df[['SNP_Name', 'ID', 'Alleles']] = df[0].str.split(',', n=-1, expand=True)
  File "/nas/longleaf/home/svennam/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 3367, in __setitem__
    self._setitem_array(key, value)
  File "/nas/longleaf/home/svennam/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 3389, in _setitem_array
    raise ValueError('Columns must be same length as key')

Can someone please help, I am having hard time figuring this out.Thanks in advance!
ValueError: Columns must be same length as key

>Solution :

What are you trying to do is called pivoting, so use DataFrame.pivot():

import pandas as pd

df = pd.read_csv("your_file.csv", header=None)

out = (
    df.pivot(index=1, columns=0, values=2)
    .rename_axis(index="ID", columns=None)
    .reset_index()
)
print(out)

Prints:

         ID 1:100011159-T-G 1:10002775-GA 1:100122796-C-T 1:100152282-CAAA-T
0  CDD3-597              GG            GG              TT                 CC
1  CDD3-598              GG           NaN             NaN                 CC

EDIT: With your updated input:

import pandas as pd

df = pd.read_csv("your_file.csv", header=None)

df["tmp"] = df[[2, 3]].agg("".join, axis=1)  # <-- join the two columns together

out = (
    df.pivot(index=1, columns=0, values="tmp")
    .rename_axis(index="ID", columns=None)
    .reset_index()
)
print(out)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading