Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

KeyError: "None of [['', '']] are in the [columns]" (Pandas Dataframe)

I am trying to write a function that takes in a dataframe, that has some columns that are within the same genre, and some columns are not. An example of the columns would be:

['id', 't_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 
 't_energy1', 't_energy2']

I am tryiong to generate two new dataframes, one with the columns without duplicates, and one with only the duplicate columns with the code below:

# Function that takes in a dataframe and returns new dataframes with all the sub-dataframes

def sub_dataframes(dataframe):

    copy = dataframe.copy()                  # To avoid SettingWithCopyWarning

    # Iterate through all the columns of the df
    for (col_name, col_data) in copy.iteritems():

        temp = str(col_name)
        rest = copy.iloc[:, 1:]
        new_df = [[]]

        # If it's not a duplicate, we just add it to the new df
        if len(temp) < 6:
            new_df[temp] = copy[col_data]

        # If the length of the column name is greater than or equal to 6, we know it's a duplicate
        if len(temp) >= 6:
            stripped = temp.rstrip(temp[2:])

            # Second for-loop to check the next column
            for (col_name2, col_data2) in rest.iteritems():
                temp2 = str(col_name2)
                rest2 = rest.iloc[:, 1:]
                only_dups = [[]]

                if len(temp2) >= 6:
                    stripped2 = temp2.rstrip(temp2[2:])

                    # Compare the two column names (without the integer 0,1, or 2)
                    if stripped[:-1] == stripped2[:-1]:

                        # Create new df of the two columns
                        only_dups[stripped] = col_data
                        only_dups[stripped2] = col_data2

                        # Third for-loop to check the remaining columns
                        for (col_name3, col_data3) in rest2.iteritems():
                            temp3 = str(col_name3)

                            if len(temp3) >= 6:
                                stripped3 = temp3.rstrip(temp3[2:])

                                # Compare the two column names (without the integer 0,1, or 2)
                                if stripped2[:-1] == stripped3[:-1]:
                                    only_dups[stripped3] = col_data3

    print("Original:\n{}\nWithout duplicates:\n{}\nDuplicates:\n{}".format(copy, new_df, only_dups))


sub_dataframes(df)

When I run this code, I get this error:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

KeyError: "None of [Int64Index([ 22352, 106534,  23608,   8655,  49670, 101988,   9136, 
141284,\n             28564,  14262,\n            ...\n             76690, 150965, 
143106, 142370,  68004,  33980, 110832,  14491,\n            123511,   6207],\n           
dtype='int64', length=2833)] are in the [columns]"

I tried looking at other questions here on StackOverflow, to see if I can fix the problem, but all I’ve understood so far, is that I cannot add columns the way I have now, with new_df[temp] = copy[col_data] or only_dups[stripped] = col_data, but I can’t seem to figure out how to properly create new columns. How do I add a new column based on the variables I have now? Is it possible, or do I have to re-write the code so that it doesn’t have so many for-loops?

EDIT

An example of the output I want would be:

Original:
        id    t_dur0    t_dur1    t_dur2    ... 
0      22352  292720  293760.0  292733.0  
1     106534  213760  181000.0  245973.0 
2      23608  157124  130446.0  152450.0  
3       8655  127896  176351.0  166968.0  
4      49670  210320  226253.0  211880.0  
...      ...     ...       ...       ...

Without duplicates:
        id  
0      22352
1     106534
2      23608
3       8655
4      49670
...      ..

Duplicates: 
      t_dur0  t_dur1    t_dur2   
0     292720  293760.0  292733.0  
1     213760  181000.0  245973.0 
2     157124  130446.0  152450.0  
3     127896  176351.0  166968.0  
4     210320  226253.0  211880.0  
...      ...     ...       ... 

>Solution :

IIUC:

def sub_dataframes(dataframe):
  # extract common prefix -> remove trailing digits
  cols = dataframe.columns.str.replace(r'\d*$', '', regex=True) \
                  .to_series().value_counts()

  # split columns
  unq_cols = cols[cols == 1].index
  dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)]

  return (dataframe[unq_cols], dataframe[dup_cols])

df1, df2 = sub_dataframes(df)

Output:

>>> df1
       id
0   22352
1  106534
2   23608
3    8655
4   49670

>>> df2
   t_dur0    t_dur1    t_dur2
0  292720  293760.0  292733.0
1  213760  181000.0  245973.0
2  157124  130446.0  152450.0
3  127896  176351.0  166968.0
4  210320  226253.0  211880.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading