Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

python dict flatten nested dict values to create new key value pair in numpy/pandas

Hi I have data in below structure where I have a map of key as label and value as array of array and I want to flatten the values and dynamically add index to the key to create a new row like below. I can iterate over each key-value pain and create new dict and add these values to it and get the expected result but its slow. I have around 50M values in array, is there a faster approach in numpy/pandas?

This is what I have

{'user_feature': 
array([
[ 1.33677050e-02, -1.45685431e-02], 
[-2.30765194e-02, 0.00000000e+00],
[0.00000000e+00,  0.00000000e+00],  
[1.16669689e-04,  1.33677050e-02]]), 
'sequence_service_id_list': 
array([y
[215., 215., 215., ..., 554., 215., 215.],
[215., 215., 215., ..., 215., 215., 215.],
[215., 215., 554., ..., 215., 215., 215.], 
'target_label': 
array([
1., 
1., 
1., ..., 1., 1., 1.])}

Expected:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

{'user_feature_1': [ 1.33677050e-02, -1.45685431e-02], 
'user_feature_2': [-2.30765194e-02, 0.00000000e+00],
'user_feature_3': [0.00000000e+00,  0.00000000e+00],
'sequence_service_id_list_1': [215., 215., 215., ..., 554., 215., 215.],
'sequence_service_id_list_2': [215., 215., 215., ..., 215., 215., 215.],
'sequence_service_id_list_3': [215., 215., 554., ..., 215., 215., 215.], 
'target_label_1': 1., 
'target_label_2': 1., 
'target_label_3': 1., 
}

>Solution :

This isn’t a vectorized solution to create the dict you want, but a way to access the required rows using keys that follow the new format.

Let’s define a class to wrap this input dictionary. When you try to get a key from an object of this class, the __getitem__ method is invoked, where the key is parsed into its "original key" and "index" components, and the appropriate row of the appropriate value is returned.

class CustomDict:
    def __init__(self, input_dict):
        self.__data = input_dict

    def __getitem__(self, key):
        orig_key, elem_index = key.rsplit("_", 1)
        return self.__data[orig_key][int(elem_index)-1]

Let’s test this:

array = np.array

inp_dict = {'user_feature': array([[ 1.33677050e-02, -1.45685431e-02], 
                                   [-2.30765194e-02, 0.00000000e+00],
                                   [0.00000000e+00,  0.00000000e+00],  
                                   [1.16669689e-04,  1.33677050e-02]]), 
            'sequence_service_id_list': array([[215., 215., 215., 554., 215., 215.],
                                               [215., 215., 215., 215., 215., 215.],
                                               [215., 215., 554., 215., 215., 215.]]), 
            'target_label': array([1., 1., 1., 1., 1., 1.])}

cus_dict = CustomDict(inp_dict)

print(cus_dict['user_feature_1'])
# [ 0.01336771 -0.01456854]

print(cus_dict['user_feature_2'])
# [-0.02307652  0.        ]

print(cus_dict['user_feature_3'])
# [0. 0.]

Since you never iterate over anything, and splitting the key is a simple, quick operation that happens at the time of access, this will be much faster than creating a new dictionary.

You can also implement a similar __setitem__ method to set elements of the original dictionary:

def __setitem__(self, key, value):
    orig_key, elem_index = key.rsplit("_", 1)
    self.__data[orig_key][elem_index] = value
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading