Hi I have data in below structure where I have a map of key as label and value as array of array and I want to flatten the values and dynamically add index to the key to create a new row like below. I can iterate over each key-value pain and create new dict and add these values to it and get the expected result but its slow. I have around 50M values in array, is there a faster approach in numpy/pandas?
This is what I have
{'user_feature':
array([
[ 1.33677050e-02, -1.45685431e-02],
[-2.30765194e-02, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00],
[1.16669689e-04, 1.33677050e-02]]),
'sequence_service_id_list':
array([y
[215., 215., 215., ..., 554., 215., 215.],
[215., 215., 215., ..., 215., 215., 215.],
[215., 215., 554., ..., 215., 215., 215.],
'target_label':
array([
1.,
1.,
1., ..., 1., 1., 1.])}
Expected:
{'user_feature_1': [ 1.33677050e-02, -1.45685431e-02],
'user_feature_2': [-2.30765194e-02, 0.00000000e+00],
'user_feature_3': [0.00000000e+00, 0.00000000e+00],
'sequence_service_id_list_1': [215., 215., 215., ..., 554., 215., 215.],
'sequence_service_id_list_2': [215., 215., 215., ..., 215., 215., 215.],
'sequence_service_id_list_3': [215., 215., 554., ..., 215., 215., 215.],
'target_label_1': 1.,
'target_label_2': 1.,
'target_label_3': 1.,
}
>Solution :
This isn’t a vectorized solution to create the dict you want, but a way to access the required rows using keys that follow the new format.
Let’s define a class to wrap this input dictionary. When you try to get a key from an object of this class, the __getitem__ method is invoked, where the key is parsed into its "original key" and "index" components, and the appropriate row of the appropriate value is returned.
class CustomDict:
def __init__(self, input_dict):
self.__data = input_dict
def __getitem__(self, key):
orig_key, elem_index = key.rsplit("_", 1)
return self.__data[orig_key][int(elem_index)-1]
Let’s test this:
array = np.array
inp_dict = {'user_feature': array([[ 1.33677050e-02, -1.45685431e-02],
[-2.30765194e-02, 0.00000000e+00],
[0.00000000e+00, 0.00000000e+00],
[1.16669689e-04, 1.33677050e-02]]),
'sequence_service_id_list': array([[215., 215., 215., 554., 215., 215.],
[215., 215., 215., 215., 215., 215.],
[215., 215., 554., 215., 215., 215.]]),
'target_label': array([1., 1., 1., 1., 1., 1.])}
cus_dict = CustomDict(inp_dict)
print(cus_dict['user_feature_1'])
# [ 0.01336771 -0.01456854]
print(cus_dict['user_feature_2'])
# [-0.02307652 0. ]
print(cus_dict['user_feature_3'])
# [0. 0.]
Since you never iterate over anything, and splitting the key is a simple, quick operation that happens at the time of access, this will be much faster than creating a new dictionary.
You can also implement a similar __setitem__ method to set elements of the original dictionary:
def __setitem__(self, key, value):
orig_key, elem_index = key.rsplit("_", 1)
self.__data[orig_key][elem_index] = value