Sample Pandas Dataframe with equal number based on binary column

April 12, 2022

I have a dataframe with a data column, and a value column, as in the example below. The value column is always binary, 0 or 1:

I need to sample the dataset so that basically I have an equal number of both values. So, if I originally have less 1 class, I’ll need to use that one as a reference. In turn, if I have less 0 classes, I need to use that.

Any clues on how to do this? I’m working on a jupyter notebook, Python 3.6 (I cannot go up versions).

>Solution :

Sample data

data = [173,926,634,706,398]
value = [1,0,0,1,0]

df = pd.DataFrame({"data": data, "value": value})

print(df)

# data  value 
# 0   173      1
# 1   926      0
# 2   634      0
# 3   706      1
# 4   398      0

Filter to two DFs

ones = df[df['value'] == 1]
zeros = df[df['value'] == 0]

print(ones)
print()
print()
print(zeros)

# data  value 
# 0   173      1
# 3   706      1


# data  value 
# 1   926      0
# 2   634      0
# 4   398      0

Truncate as required

Find the minimum and then truncate it (take n first rows)

if len(ones) <= len(zeros):
  zeros = zeros.iloc[:len(ones), :]
else:
  ones = ones.iloc[:len(zeros), :]

print(ones)
print()
print()
print(zeros)

# data  value 
# 0   173      1
# 3   706      1
#
#
# data  value 
# 1   926      0
# 2   634      0