I recently finished writing a mass spec data reduction program that will replace a program written in LabVIEW, however, I’m comparing my results to those from LabVIEW and finding some differences on the order of 0.01% – 1% in very basic calculations that have me scratching my head.
Both programs start by reading data files and importing the raw data at each mass for the analysis. Each batch of raw data is then fit with a linear regression to t=0. We also calculate things like uncertainties, averages, standard deviations, etc.
Here’s an example of the raw data files. Note that the raw intensity data has seven significant figures.
He 4420 CB-UV 2023-09-22T10:40:49 time_sec 2.02 3.02 4.00 5.25 40.02
23.295 9.387086E-11 1.674618E-13 -3.206630E-16 3.373704E-14 2.288376E-14
28.895 9.180904E-11 1.690122E-13 1.364953E-15 3.260850E-14 3.299508E-14
2.789 3.386823E-10 1.332874E-10 5.501391E-14 3.523762E-14 1.053051E-11
8.289 3.412776E-10 1.342065E-10 5.657902E-14 3.267960E-14 1.117998E-11
13.789 3.447253E-10 1.349348E-10 6.633267E-14 3.418557E-14 1.199392E-11
19.189 3.458286E-10 1.354857E-10 6.325153E-14 3.507785E-14 1.242906E-11
24.689 3.471647E-10 1.357906E-10 6.458323E-14 3.259249E-14 1.279006E-11
30.289 3.500519E-10 1.367072E-10 6.307386E-14 3.412402E-14 1.334153E-11
35.689 3.489635E-10 1.364054E-10 6.470872E-14 3.464260E-14 1.394222E-11
41.189 3.507102E-10 1.371554E-10 6.159706E-14 3.401083E-14 1.416036E-11
46.689 3.517030E-10 1.372364E-10 6.440496E-14 3.328290E-14 1.531714E-11
52.089 3.526178E-10 1.376700E-10 6.662408E-14 3.173344E-14 1.542191E-11
57.589 3.547731E-10 1.381437E-10 6.997605E-14 3.255263E-14 1.615525E-11
63.089 3.583457E-10 1.384707E-10 7.329222E-14 3.186776E-14 1.576015E-11
68.589 3.553523E-10 1.388127E-10 7.325879E-14 3.174203E-14 1.623024E-11
73.989 3.553024E-10 1.394373E-10 6.629945E-14 3.500163E-14 1.659189E-11
79.489 3.567244E-10 1.395758E-10 6.466660E-14 3.414220E-14 1.709554E-11
85.000 3.558431E-10 1.396115E-10 7.296217E-14 3.356015E-14 1.707882E-11
90.400 3.584943E-10 1.399037E-10 7.224490E-14 3.036364E-14 1.758635E-11
95.900 3.618750E-10 1.404177E-10 7.380499E-14 3.356473E-14 1.764968E-11
He 4420 CB-UV 2023-09-22T10:50:29
The first column of data is time_sec, the second is intensity at 2.02 amu, third is intensity at 3.02 amu, etc. The first two rows of data are pre-analysis baseline measurements, followed by the actual analysis data. If you are going to crunch these numbers yourself, please exclude the first two rows.
The data are imported, and the 4/3 ratio is calculated, in the function parse_raw():
def parse_raw(file_path):
# import the data file as df
df = pd.read_csv(file_path, sep='\t', header=1)
# check if the data file is empty
if df.shape[0] <= 3:
raise ValueError(f"File {file_path} contains no data.")
# grab unique analysis data from non-header 'headers'
helium_number = int(df.columns[0].split()[1])
analysis_label = str(df.columns[1])
timestamp = pd.to_datetime(df.columns[2], format='%Y-%m-%dT%H:%M:%S')
# grab the pre-analysis baseline and raw analysis data
PAB_data = df.iloc[0:2].dropna(axis=1).apply(pd.to_numeric)
raw_data = df.iloc[3:-1].dropna(axis=1).apply(pd.to_numeric)
# these data files are a mess, jesus, just rename the headers manually
data_cols = ['time_sec', '2 amu', '3 amu', '4 amu', '5 amu', '40 amu']
raw_data.columns = data_cols
PAB_data.columns = data_cols
# get timestamps from time_sec
time_sec = raw_data['time_sec']
# re-organize raw_data
org_data = {
"2 amu": raw_data['2 amu'],
"3 amu": raw_data['3 amu'],
"4 amu": raw_data['4 amu'],
"5 amu": raw_data['5 amu'],
"40 amu": raw_data['40 amu'],
"4/3 Ratio": pd.DataFrame(columns=['4/3 Ratio'])
}
# calculate 4/3 Ratio * 1000
org_data['4/3 Ratio'] = raw_data['4 amu'] / raw_data['3 amu'] * 1000
return helium_number, analysis_label, timestamp, time_sec, PAB_data, org_data
When it comes time to do the calculations, the code is below. Note the filters [active_data==1] and [m3/m4_active==1] are for removing outlier data, however, to keep things simple I’m not selecting any outliers in either LabVIEW or Python.
import numpy as np
from scipy.stats import linregress
# Calculates statistics and stores them in the 'filtered_data[analysis]' object
# only works on one analysis at a time
def calculate_stats(analysis):
# calculate average of pre-analysis baseline data
for mass in ['2 amu', '3 amu', '4 amu', '5 amu', '40 amu']:
analysis.PAB_mean[mass] = np.mean(analysis.PAB_data[mass])
# calculate t0 intercepts for masses 2, 3, 4, and 40
for mass in ['2 amu', '3 amu', '4 amu', '40 amu']:
# exclude any outlier data
active_data = analysis.data_status[mass].to_numpy().flatten()
x = analysis.time_sec[active_data==1]
y = analysis.raw_data[mass][active_data==1]
# Linear regression with linregress
result = linregress(x, y)
analysis.t0_intercept[mass] = result.intercept
analysis.t0_uncertainty[mass] = result.intercept_stderr
analysis.t0_pctuncert[mass] = result.intercept_stderr / result.intercept * 100
analysis.t0_slope[mass] = result.slope
# ratio 4/3 intercept, uncertainty, percent uncertainty, and slope
analysis.ratio43_t0int = analysis.t0_intercept['4 amu'] / analysis.t0_intercept['3 amu'] * 1000
analysis.ratio43_uncert = analysis.ratio43_t0int * np.sqrt(
(analysis.t0_uncertainty['3 amu'] / analysis.t0_intercept['3 amu'])**2 +
(analysis.t0_uncertainty['4 amu'] / analysis.t0_intercept['4 amu'])**2)
analysis.ratio43_puncert = analysis.ratio43_uncert / analysis.ratio43_t0int * 100
# 4/3 slope calculation feels asinine but here we go
# todo: filter by active indices
fit_43_ratio_data = linregress(analysis.time_sec, analysis.raw_data['4 amu'] / analysis.raw_data['3 amu'])
analysis.ratio43_slope = fit_43_ratio_data.slope
# ratio 4/3 average, SD, %SD
m4_active = analysis.data_status['4 amu'].to_numpy().flatten()
m3_active = analysis.data_status['3 amu'].to_numpy().flatten()
analysis.averages['4/3 Ratio avg'] = np.mean(analysis.raw_data['4 amu'][m4_active==1] / analysis.raw_data['3 amu'][m3_active==1])
analysis.averages['4/3 Ratio std'] = np.std(analysis.raw_data['4 amu'][m4_active==1] / analysis.raw_data['3 amu'][m3_active==1])
analysis.averages['4/3 Ratio %SD'] = analysis.averages['4/3 Ratio std'] / analysis.averages['4/3 Ratio avg'] * 100
# calculate averages for masses 2, 3, 4, 5, and 40
for mass in ['2 amu', '3 amu', '4 amu', '5 amu', '40 amu']:
active_data = analysis.data_status[mass].to_numpy().flatten()
analysis.averages[f'{mass} avg'] = np.mean(analysis.raw_data[mass][active_data==1])
analysis.averages[f'{mass} std'] = np.std(analysis.raw_data[mass][active_data==1])
analysis.averages[f'{mass} %SD'] = (analysis.averages[f'{mass} std'] / analysis.averages[f'{mass} avg']) * 100
That’s pretty much it. The above scripts are the only locations in the entire program that do any calculations; that is, there is no secret script running thousands of arithmetic operations that would cause rounding errors to add up.
Still, when I compare my results with the LabVIEW code which should be doing the exact same thing, I’m finding differences. Now, there are many methods of finding the error on the intercept of a line, so I’m not surprised that we’re coming up with different uncertainties, however, different averages?? Different standard deviations? The fact that those don’t match tells me that something is very wrong.
For example, the average of mass 2 data (second column, excluding the first two rows of data) comes out as:
Python: 3.5234E-10
LabVIEW: 3.5160E-10
Excel: 3.5158E-10
That’s a 0.10% difference in the average value between Python and LabVIEW. I tried to double check the math in Excel and ended up getting a third value which suggests that my Python math is the one that’s off.
When it comes to the t0 intercepts, I’m getting the following:
mass 3 mass 4
Python: 1.341E-10 5.980E-14
LabVIEW: 1.339E-10 5.869E-14
Excel: 1.339E-10 5.869E-14
Given that LabVIEW and Excel agree here, it’s pretty clear that Python is the culprit behind these differences.
I’ve done the math as simple and straightforwardly as possible, so why am I getting inaccurate results and what can I do to fix these?
>Solution :
The problem is your indexing. When you say this:
PAB_data = df.iloc[0:2].dropna(axis=1).apply(pd.to_numeric)
raw_data = df.iloc[3:-1].dropna(axis=1).apply(pd.to_numeric)
What you really wanted was
PAB_data = df.iloc[0:2].dropna(axis=1).apply(pd.to_numeric)
raw_data = df.iloc[2:-1].dropna(axis=1).apply(pd.to_numeric)
What YOUR code does is drop row #2 completely. With that one change,
print(raw_data['CB-UV'].mean())
prints
3.515797333333333e-10
as expected.