I am trying to fit a linear regression model that passes through the origin. I have tried using the curve_fit function from SciPy and the ols function from Statsmodels to achieve this, but they give different R2 scores despite having the same parameters. I am wondering why this is the case and which approach would be best for fitting a linear regression model that passes through the origin. Below is the code I have tried:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score
# create sample data
x = np.linspace(0, 1, 100)
y = x**2 + np.random.normal(scale=0.1, size=100)
# fit linear model with no intercept using statsmodels
data = pd.DataFrame({'x': x, 'y': y})
model = smf.ols('y ~ x + 0', data=data)
results = model.fit()
r2_sm = results.rsquared
# fit linear model with no intercept using curve_fit
def lin_func(A, x):
return A*x
popt, pcov = curve_fit(lin_func, x, y)
y_fit = lin_func(x, *popt)
r2_scipy = r2_score(y, y_fit)
print("R-squared (statsmodels):", r2_sm)
print("R-squared (curve_fit):", r2_scipy)
>Solution :
The discrepancy between the R2 scores in your example is due to the fact that curve_fit from SciPy and ols from Statsmodels are optimizing slightly different objectives.
curve_fit minimizes the sum of squared residuals, whereas ols minimizes the sum of squared residuals with the constraint that the sum of the residuals is zero. When you force the linear regression model to pass through the origin, this constraint no longer holds true. This is why you get slightly different R2 scores.
If your goal is to fit a linear regression model that passes through the origin, you can use the following approaches:
-
Continue using
olsfrom Statsmodels with the formula ‘y ~ x + 0’, but keep in mind that the constraint for the sum of residuals being zero is not met. The R2 score might be slightly off, but the parameters are estimated correctly. -
Use
curve_fitfrom SciPy, as it directly minimizes the sum of squared residuals. The R2 score and the parameters should be accurate.
In your case, I would recommend using curve_fit from SciPy, since it directly optimizes the objective you are interested in (minimizing the sum of squared residuals). Here’s a modified version of your code that computes the R2 score correctly for both methods:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from scipy.optimize import curve_fit
from sklearn.metrics import r2_score
# create sample data
x = np.linspace(0, 1, 100)
y = x**2 + np.random.normal(scale=0.1, size=100)
# fit linear model with no intercept using statsmodels
data = pd.DataFrame({'x': x, 'y': y})
model = smf.ols('y ~ x + 0', data=data)
results = model.fit()
y_fit_sm = results.predict(data)
r2_sm = r2_score(y, y_fit_sm)
# fit linear model with no intercept using curve_fit
def lin_func(x, A):
return A*x
popt, pcov = curve_fit(lin_func, x, y)
y_fit_scipy = lin_func(x, *popt)
r2_scipy = r2_score(y, y_fit_scipy)
print("R-squared (statsmodels):", r2_sm)
print("R-squared (curve_fit):", r2_scipy)
This code should give you R2 scores that are more comparable between the two methods, and you can proceed with using curve_fit for fitting a linear regression model that passes through the origin.