Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python returning inaccurate math results

I recently finished writing a mass spec data reduction program that will replace a program written in LabVIEW, however, I’m comparing my results to those from LabVIEW and finding some differences on the order of 0.01% – 1% in very basic calculations that have me scratching my head.

Both programs start by reading data files and importing the raw data at each mass for the analysis. Each batch of raw data is then fit with a linear regression to t=0. We also calculate things like uncertainties, averages, standard deviations, etc.

Here’s an example of the raw data files. Note that the raw intensity data has seven significant figures.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

He 4420 CB-UV   2023-09-22T10:40:49 time_sec    2.02    3.02    4.00    5.25    40.02
23.295  9.387086E-11    1.674618E-13    -3.206630E-16   3.373704E-14    2.288376E-14
28.895  9.180904E-11    1.690122E-13    1.364953E-15    3.260850E-14    3.299508E-14
2.789   3.386823E-10    1.332874E-10    5.501391E-14    3.523762E-14    1.053051E-11
8.289   3.412776E-10    1.342065E-10    5.657902E-14    3.267960E-14    1.117998E-11
13.789  3.447253E-10    1.349348E-10    6.633267E-14    3.418557E-14    1.199392E-11
19.189  3.458286E-10    1.354857E-10    6.325153E-14    3.507785E-14    1.242906E-11
24.689  3.471647E-10    1.357906E-10    6.458323E-14    3.259249E-14    1.279006E-11
30.289  3.500519E-10    1.367072E-10    6.307386E-14    3.412402E-14    1.334153E-11
35.689  3.489635E-10    1.364054E-10    6.470872E-14    3.464260E-14    1.394222E-11
41.189  3.507102E-10    1.371554E-10    6.159706E-14    3.401083E-14    1.416036E-11
46.689  3.517030E-10    1.372364E-10    6.440496E-14    3.328290E-14    1.531714E-11
52.089  3.526178E-10    1.376700E-10    6.662408E-14    3.173344E-14    1.542191E-11
57.589  3.547731E-10    1.381437E-10    6.997605E-14    3.255263E-14    1.615525E-11
63.089  3.583457E-10    1.384707E-10    7.329222E-14    3.186776E-14    1.576015E-11
68.589  3.553523E-10    1.388127E-10    7.325879E-14    3.174203E-14    1.623024E-11
73.989  3.553024E-10    1.394373E-10    6.629945E-14    3.500163E-14    1.659189E-11
79.489  3.567244E-10    1.395758E-10    6.466660E-14    3.414220E-14    1.709554E-11
85.000  3.558431E-10    1.396115E-10    7.296217E-14    3.356015E-14    1.707882E-11
90.400  3.584943E-10    1.399037E-10    7.224490E-14    3.036364E-14    1.758635E-11
95.900  3.618750E-10    1.404177E-10    7.380499E-14    3.356473E-14    1.764968E-11
He 4420 CB-UV   2023-09-22T10:50:29

The first column of data is time_sec, the second is intensity at 2.02 amu, third is intensity at 3.02 amu, etc. The first two rows of data are pre-analysis baseline measurements, followed by the actual analysis data. If you are going to crunch these numbers yourself, please exclude the first two rows.

The data are imported, and the 4/3 ratio is calculated, in the function parse_raw():

def parse_raw(file_path):
    # import the data file as df
    df = pd.read_csv(file_path, sep='\t', header=1)    

    # check if the data file is empty
    if df.shape[0] <= 3:
        raise ValueError(f"File {file_path} contains no data.")
    
    # grab unique analysis data from non-header 'headers'
    helium_number  = int(df.columns[0].split()[1])
    analysis_label = str(df.columns[1])
    timestamp      = pd.to_datetime(df.columns[2], format='%Y-%m-%dT%H:%M:%S')

    # grab the pre-analysis baseline and raw analysis data
    PAB_data = df.iloc[0:2].dropna(axis=1).apply(pd.to_numeric)
    raw_data = df.iloc[3:-1].dropna(axis=1).apply(pd.to_numeric)
    
    # these data files are a mess, jesus, just rename the headers manually
    data_cols        = ['time_sec', '2 amu', '3 amu', '4 amu', '5 amu', '40 amu']
    raw_data.columns = data_cols
    PAB_data.columns = data_cols

    # get timestamps from time_sec
    time_sec = raw_data['time_sec']

    # re-organize raw_data
    org_data = {
        "2 amu":     raw_data['2 amu'],
        "3 amu":     raw_data['3 amu'],
        "4 amu":     raw_data['4 amu'],
        "5 amu":     raw_data['5 amu'],
        "40 amu":    raw_data['40 amu'],
        "4/3 Ratio": pd.DataFrame(columns=['4/3 Ratio'])
    }

    # calculate 4/3 Ratio * 1000
    org_data['4/3 Ratio'] = raw_data['4 amu'] / raw_data['3 amu'] * 1000

    return helium_number, analysis_label, timestamp, time_sec, PAB_data, org_data

When it comes time to do the calculations, the code is below. Note the filters [active_data==1] and [m3/m4_active==1] are for removing outlier data, however, to keep things simple I’m not selecting any outliers in either LabVIEW or Python.

import numpy as np
from scipy.stats import linregress


# Calculates statistics and stores them in the 'filtered_data[analysis]' object
# only works on one analysis at a time
def calculate_stats(analysis):

    # calculate average of pre-analysis baseline data 
    for mass in ['2 amu', '3 amu', '4 amu', '5 amu', '40 amu']:
        analysis.PAB_mean[mass] = np.mean(analysis.PAB_data[mass])

    # calculate t0 intercepts for masses 2, 3, 4, and 40
    for mass in ['2 amu', '3 amu', '4 amu', '40 amu']:
        # exclude any outlier data
        active_data = analysis.data_status[mass].to_numpy().flatten()
        x           = analysis.time_sec[active_data==1]
        y           = analysis.raw_data[mass][active_data==1]

        # Linear regression with linregress
        result                        = linregress(x, y)
        analysis.t0_intercept[mass]   = result.intercept
        analysis.t0_uncertainty[mass] = result.intercept_stderr
        analysis.t0_pctuncert[mass]   = result.intercept_stderr / result.intercept * 100
        analysis.t0_slope[mass]       = result.slope

    # ratio 4/3 intercept, uncertainty, percent uncertainty, and slope
    analysis.ratio43_t0int  = analysis.t0_intercept['4 amu'] / analysis.t0_intercept['3 amu'] * 1000
    analysis.ratio43_uncert = analysis.ratio43_t0int * np.sqrt(
        (analysis.t0_uncertainty['3 amu'] / analysis.t0_intercept['3 amu'])**2 +
        (analysis.t0_uncertainty['4 amu'] / analysis.t0_intercept['4 amu'])**2)
    analysis.ratio43_puncert = analysis.ratio43_uncert / analysis.ratio43_t0int * 100
    
    # 4/3 slope calculation feels asinine but here we go
    # todo: filter by active indices
    fit_43_ratio_data        = linregress(analysis.time_sec, analysis.raw_data['4 amu'] / analysis.raw_data['3 amu'])
    analysis.ratio43_slope   = fit_43_ratio_data.slope

    # ratio 4/3 average, SD, %SD
    m4_active                          = analysis.data_status['4 amu'].to_numpy().flatten()
    m3_active                          = analysis.data_status['3 amu'].to_numpy().flatten()
    analysis.averages['4/3 Ratio avg'] = np.mean(analysis.raw_data['4 amu'][m4_active==1] / analysis.raw_data['3 amu'][m3_active==1])
    analysis.averages['4/3 Ratio std'] = np.std(analysis.raw_data['4 amu'][m4_active==1] / analysis.raw_data['3 amu'][m3_active==1])
    analysis.averages['4/3 Ratio %SD'] = analysis.averages['4/3 Ratio std'] / analysis.averages['4/3 Ratio avg'] * 100

    # calculate averages for masses 2, 3, 4, 5, and 40
    for mass in ['2 amu', '3 amu', '4 amu', '5 amu', '40 amu']:
        active_data = analysis.data_status[mass].to_numpy().flatten()
        analysis.averages[f'{mass} avg'] = np.mean(analysis.raw_data[mass][active_data==1])
        analysis.averages[f'{mass} std'] = np.std(analysis.raw_data[mass][active_data==1])
        analysis.averages[f'{mass} %SD'] = (analysis.averages[f'{mass} std'] / analysis.averages[f'{mass} avg']) * 100

That’s pretty much it. The above scripts are the only locations in the entire program that do any calculations; that is, there is no secret script running thousands of arithmetic operations that would cause rounding errors to add up.

Still, when I compare my results with the LabVIEW code which should be doing the exact same thing, I’m finding differences. Now, there are many methods of finding the error on the intercept of a line, so I’m not surprised that we’re coming up with different uncertainties, however, different averages?? Different standard deviations? The fact that those don’t match tells me that something is very wrong.

For example, the average of mass 2 data (second column, excluding the first two rows of data) comes out as:

Python:  3.5234E-10
LabVIEW: 3.5160E-10
Excel:   3.5158E-10   

That’s a 0.10% difference in the average value between Python and LabVIEW. I tried to double check the math in Excel and ended up getting a third value which suggests that my Python math is the one that’s off.

When it comes to the t0 intercepts, I’m getting the following:

           mass 3      mass 4
Python:    1.341E-10   5.980E-14
LabVIEW:   1.339E-10   5.869E-14
Excel:     1.339E-10   5.869E-14

Given that LabVIEW and Excel agree here, it’s pretty clear that Python is the culprit behind these differences.

I’ve done the math as simple and straightforwardly as possible, so why am I getting inaccurate results and what can I do to fix these?

>Solution :

The problem is your indexing. When you say this:

    PAB_data = df.iloc[0:2].dropna(axis=1).apply(pd.to_numeric)
    raw_data = df.iloc[3:-1].dropna(axis=1).apply(pd.to_numeric)

What you really wanted was

    PAB_data = df.iloc[0:2].dropna(axis=1).apply(pd.to_numeric)
    raw_data = df.iloc[2:-1].dropna(axis=1).apply(pd.to_numeric)

What YOUR code does is drop row #2 completely. With that one change,

    print(raw_data['CB-UV'].mean())

prints

3.515797333333333e-10

as expected.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading