Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Reduce the values of a dictionary included as a column in a pandas DataFrame

I have the following Python code that creates a DataFrame with a combination of parameters for a specified clustering algorithm.

The function is called as follows:

fixed_params = {"random_state": 1234} 
param_grid = {"n_clusters": range(2,4), "max_iter": [200, 300]}

dataset = myGridSearch(df, fixed_params, param_grid, "KMeans")
print(dataset)

The function returns the next resulting pandas DataFrame:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

| params                                                                                                                                                           | num_cluster  | silhouette |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------ | ---------- |
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 200, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 2            | 0.854996   | 
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 2, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 2            | 0.854996   | 
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 200, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 3            | 0.742472   | 
| {'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 3, 'n_init': 10, 'random_state': 1234, 'tol': 0.0001, 'verbose': 0}    | 3            | 0.742472   | 

I would like that once this DataFrame is obtained, the column ‘param’ only contains the information for the parameters that are changing, that is, the ones stored in grid_param. An idea of the resulting DataFrame would be the following:

| params                                | num_cluster  | silhouette |
| ------------------------------------- | ------------ | ---------- |
| {'max_iter': 200, 'n_clusters': 2}    | 2            | 0.854996   | 
| {'max_iter': 300, 'n_clusters': 2}    | 2            | 0.854996   | 
| {'max_iter': 200, 'n_clusters': 3}    | 3            | 0.742472   | 
| {'max_iter': 300, 'n_clusters': 3}    | 3            | 0.742472   | 

If you need to send me the code for the myGridSearch function, let me know in the comments.

>Solution :

IIUC, you can use pandas.json_normalize to create multiple columns from "params", then filter the non-unique values using nunique and boolean indexing, finally convert back to_dict:

df2 = pd.json_normalize(dataset['params'])
dataset['params'] = pd.Series(df2.loc[:, df2.nunique().gt(1)]
                                 .to_dict(orient='index'))

output:

                               params  num_cluster  silhouette
0  {'max_iter': 200, 'n_clusters': 2}            2    0.854996
1  {'max_iter': 300, 'n_clusters': 2}            2    0.854996
2  {'max_iter': 200, 'n_clusters': 3}            3    0.742472
3  {'max_iter': 300, 'n_clusters': 3}            3    0.742472

intermediate:

df2.nunique()

algorithm       1
copy_x          1
init            1
max_iter        2
n_clusters      2
n_init          1
random_state    1
tol             1
verbose         1
dtype: int64
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading