I’m attempting to perform linear regression on two complex arrays. That is, I’d like to find the line of best fit, w=mz+b, where m and b are both permitted to be complex and where the R^2-value, R^2=1-RSS/TSS is minimized. (Here RSS and TSS are the sum of squared residuals and the total of sum of squares.)
I know this can be done by creating a design matrix, computing m and b, etc., but out of curiosity, I tried using linregress from scipy.stats, which did return values:
import numpy as np
from scipy import stats
rng = np.random.default_rng()
x = rng.random(10)+1j*rng.random(10)
y = 1.6*x + rng.random(10)+1j*rng.random(10)
res = stats.linregress(x, y)
print(res)
LinregressResult(slope=(1.5814820568268182-0.004143389169974774j), intercept=.
(0.37141513243354485+0.4522070413718836j), rvalue=(0.8607413430092087-
0.002255091256570885j), pvalue=0.00138658952096427, stderr=.
(0.3306870298601568+0.0024769249452937106j), intercept_stderr=.
(0.16366363994151886+0.12045799398296754j))
What meaning does a non-real, complex-valued rvalue have? Is the modulus of this value the coefficient of determination?
>Solution :
The function stats.linregress from the Python’s scipy library returns complex R-values for complex input arrays because the calculation of the regression line involves the calculation of the covariance and the standard deviation of the input arrays. The calculation of these statistics is done using the formula:
Covariance = sum((x – mean(x)) * (y – mean(y))) / (n – 1)
Standard deviation = sqrt(sum((x – mean(x)) ** 2) / (n – 1))
If the input arrays contain complex numbers, these formulas can lead to complex results. In particular, the standard deviation can be complex if the input array contains complex numbers. When the standard deviation is complex, the calculation of the R-value, which is the covariance divided by the product of the standard deviations, will also result in a complex number.
In general, the presence of complex numbers in the regression line should not be surprising, since linear regression is a linear model and complex numbers can be used to represent complex relationships between variables. The interpretation of complex R-values is not straightforward and should be done with caution.