Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Merge pandas dataframes while keeping duplicate entries

I have 2 dataframes, A and B, which contain different kinds of data. Dataframe A consists of multiple different measurements of an object which where taken at the same time as well as the corresponding timestamp. So a single timestamp will have multiple entries. Dataframe B contains additional data with timestamps that overlap with A. Similar to A the timestamps here also contain multiple entries for the same timestamp but not the same amount as A. B also has more rows than A.

I am interested in assigning a specific value from B to A where the timestamps match, creating a new column for A. For the same timestamp that value is going to be consistent.

My first instinct was to simply use the merge function, however this creates a much larger dataframe caused by the duplicate timestamps. To do that efficiently I extracted only the timestamps from A and the timestamps as well as the value from B.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

merge = pd.DataFrame(A['time']).merge(B[['time', 'value']], on='time', how='left')
A['value'] = merge['value']

Is there a different method that takes the duplicate timestamps into consideration and handles them in the way I need it to?

EDIT: Here is a simplified example of how the data could look like. In this case I would like to assign Value1 to all occurences of ‘20.08.2023 20:00’ and Value2 to ‘21.08.2023 21:00’ in a new column in A.

Simplified description of the data structure

>Solution :

My assumption is that you have two dataframes, A and B, in which A has the following format:

time measurement_1 measurement_2
1 .5 .4
1 .3 4

and B the format below

time value value_2
1 medium short
1 medium tall
1 medium short

The first thing I would look at from a data science perspective, I do not understand why your B will have anything but a single row assigned to your B timestamp (unless if the format is simply timestamp | value where the values have not been split into segregate columns) nor your dataframe A. But, my suggestion would be as follows:

C = A.merge(B[['time', 'value']].drop_duplicates(), on='time', how='left')

The result is the dataframe below:

time measurement_1 measurement_2 value
1 0.5 0.4 medium
1 0.3 4 medium
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading