I have a dataframe with a data column, and a value column, as in the example below. The value column is always binary, 0 or 1:
data,value
173,1
1378,0
926,0
643,0
1279,0
472,0
706,0
1345,0
1167,1
1401,1
1236,0
447,1
1204,1
398,0
714,0
734,0
1732,0
98,0
1696,0
160,0
1611,0
274,1
562,0
625,0
1028,0
1766,0
511,0
1691,0
898,1
I need to sample the dataset so that basically I have an equal number of both values. So, if I originally have less 1 class, I’ll need to use that one as a reference. In turn, if I have less 0 classes, I need to use that.
Any clues on how to do this? I’m working on a jupyter notebook, Python 3.6 (I cannot go up versions).
>Solution :
Sample data
data = [173,926,634,706,398]
value = [1,0,0,1,0]
df = pd.DataFrame({"data": data, "value": value})
print(df)
# data value
# 0 173 1
# 1 926 0
# 2 634 0
# 3 706 1
# 4 398 0
Filter to two DFs
ones = df[df['value'] == 1]
zeros = df[df['value'] == 0]
print(ones)
print()
print()
print(zeros)
# data value
# 0 173 1
# 3 706 1
# data value
# 1 926 0
# 2 634 0
# 4 398 0
Truncate as required
Find the minimum and then truncate it (take n first rows)
if len(ones) <= len(zeros):
zeros = zeros.iloc[:len(ones), :]
else:
ones = ones.iloc[:len(zeros), :]
print(ones)
print()
print()
print(zeros)
# data value
# 0 173 1
# 3 706 1
#
#
# data value
# 1 926 0
# 2 634 0