Merge pandas dataframes while keeping duplicate entries

I have 2 dataframes, A and B, which contain different kinds of data. Dataframe A consists of multiple different measurements of an object which where taken at the same time as well as the corresponding timestamp. So a single timestamp will have multiple entries. Dataframe B contains additional data with timestamps that overlap with A. Similar to A the timestamps here also contain multiple entries for the same timestamp but not the same amount as A. B also has more rows than A.

I am interested in assigning a specific value from B to A where the timestamps match, creating a new column for A. For the same timestamp that value is going to be consistent.

My first instinct was to simply use the merge function, however this creates a much larger dataframe caused by the duplicate timestamps. To do that efficiently I extracted only the timestamps from A and the timestamps as well as the value from B.

merge = pd.DataFrame(A['time']).merge(B[['time', 'value']], on='time', how='left')
A['value'] = merge['value']

Is there a different method that takes the duplicate timestamps into consideration and handles them in the way I need it to?

EDIT: Here is a simplified example of how the data could look like. In this case I would like to assign Value1 to all occurences of ‘20.08.2023 20:00’ and Value2 to ‘21.08.2023 21:00’ in a new column in A.

Simplified description of the data structure

>Solution :

My assumption is that you have two dataframes, A and B, in which A has the following format:

time measurement_1 measurement_2
1 .5 .4
1 .3 4

and B the format below

time value value_2
1 medium short
1 medium tall
1 medium short

The first thing I would look at from a data science perspective, I do not understand why your B will have anything but a single row assigned to your B timestamp (unless if the format is simply timestamp | value where the values have not been split into segregate columns) nor your dataframe A. But, my suggestion would be as follows:

C = A.merge(B[['time', 'value']].drop_duplicates(), on='time', how='left')

The result is the dataframe below:

time measurement_1 measurement_2 value
1 0.5 0.4 medium
1 0.3 4 medium

Leave a Reply