I’m having problem with python while constructing a loop (or a nested loop) for my calculation. I have a very large data set (about 600k entries) with repeating ID’s and their values. I’d like to form a loop which starts from the first entry, looks if it is repeating and if it is how manty times it does and calculates average of values for the same ID.
For example in below table I’d like it to return Average a = 2, Average b = 4 and Average c = 1.
I tried a nested for loop below but completely lost my track.
ID | Value |
---|---|
a | 1 |
a | 3 |
b | 3 |
b | 4 |
b | 5 |
c | 1 |
for i in range(0,len(ID)):
if ID[i] == ID[i+1]: #if the next ID is equal to the current one
for j in range (i,len(ID)): #look for how many repating values for current ID
if ID[j] != ID[j+1]:
i=j
break
#since it's know how many repeating ID's exist, somehow average them and continue for the second (distinct) ID
>Solution :
This is a snippet of code that does not use any external modules. You are able to use them if you wish, but thought to leave this as it is still an option.
def avg(lst):
return sum(lst) / len(lst)
filename = "mycsvfile.csv"
averages = dict()
all_values = dict()
with open(filename, "r") as f:
# read all lines and save
for line in f.readlines():
split = line.split(",") # split on comma
row_id = split[0]
try:
# "ID is in the first column Value is at fifth." -OP
value = float(split[4])
# check if there is a new ID or not
if row_id in all_values.keys():
all_values[row_id].append(value)
else:
all_values[row_id] = [value]
except ValueError:
pass # ignore header lines
# get averages into dictionary
for k, v in all_values.items():
averages[k] = avg(v)
# print the data to show it is correct
for k, v in averages.items():
print(f"{k}: {v}")
This snippet of code reads in all data from a CSV file and then gets the averages into a dictionary. Key is the ID, value is the average.