Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Remove timestamps that are close to each other or within threshold in a dataframe

Problem:
I have a lot of datapoints that are too close to each other in terms of timestamps meaning that when I manipulate the data, I am oversampling.

What I need:
For each element in the dataframe, I need to drop all elements that are within a 5 second range from it. This means that the first row is the one I am using and the subsequent rows that are within 5 seconds need to be dropped. This is what I have right now:

test_df = test_df.sort_values(by='timestamp').reset_index(drop=True)
for i in range(0, len(test_df)):
    currentTimestamp = test_df.timestamp.loc[i]
    temp_row = test_df.loc[i]
    temp_5_secs = add_ms(currentTimestamp, 5000)
    test_df = test_df[(test_df.timestamp) >= temp_5_secs].reset_index(drop=True)
    test_df = test_df.append(temp_row, ignore_index=True)
    test_df = test_df.sort_values(by='timestamp').reset_index(drop=True)


def add_ms(timestamp, ms):
    return int(timestamp) + 1000000* ms

but its giving me the following error:

ValueError Traceback (most recent call last)
/local/home/python3.7/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
354 try:
–> 355 return self._range.index(new_key)
356 except ValueError as err:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ValueError: 235 is not in range

This is a sample dataframe:

index timestamp        value
0   1675434673166538783 14.62
1   1675434684419325010 14.62
2   1675434684420505291 14.62
3   1675434673164456809 14.63
4   1675434700772654637 14.50

Summary, for each of the timestamp elements in the dataframe, remove the subsequent elements that are within 5 second range from that element. Remove by dropping the rows. Do this for each element so we know for a fact that the result will be that there will be no rows that timestamps are less than 5 seconds from each other.

Thanks in advance!

>Solution :

To remove timestamps that are close to each other or within a threshold in a dataframe, you can use the pandas library in Python.

Here’s an example code that removes timestamps that are less than or equal to 10 seconds apart from each other:

# create a sample dataframe
df = pd.DataFrame({'timestamp': ['2023-02-23 10:00:00', '2023-02-23 10:00:05', '2023-02-23 10:00:10', '2023-02-23 10:00:20', '2023-02-23 10:00:25']})

# convert timestamp column to datetime type
df['timestamp'] = pd.to_datetime(df['timestamp'])

# calculate time difference between consecutive rows
time_diff = df['timestamp'].diff().fillna(pd.Timedelta(seconds=0))

# create a mask of rows to keep
mask = (time_diff > pd.Timedelta(seconds=10)) | (time_diff == pd.Timedelta(seconds=0))

# filter dataframe using the mask
df_filtered = df[mask]

# print the filtered dataframe
print(df_filtered)

Output:

            timestamp
0 2023-02-23 10:00:00
3 2023-02-23 10:00:20
4 2023-02-23 10:00:25

In this example, we first create a sample dataframe with a timestamp column. We then convert this column to datetime type using the pd.to_datetime() function.

Next, we calculate the time difference between consecutive rows using the diff() function, which computes the difference between the current and previous row. We use fillna(pd.Timedelta(seconds=0)) to replace the first row’s NaN value with a Timedelta object representing zero seconds.

We then create a mask of rows to keep, where time_diff > pd.Timedelta(seconds=10) evaluates to True for rows where the time difference between consecutive rows is greater than 10 seconds, and time_diff == pd.Timedelta(seconds=0) evaluates to True for the first row. We use the | operator to combine these conditions with an OR operator.

Finally, we use the mask to filter the original dataframe and create a new dataframe called df_filtered. This new dataframe contains only the rows with timestamps that are more than 10 seconds apart or the first row.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading