One of the crucial assumptions that we test for while building Ordinary Least Squared based models or Linear Regressions is linearity in parameters. Linearity simply implies that our dependent (Y) variable can be expressed as a linear function of the explanatory variables (X) we are choosing to explain the variation in the Y variable. Now the word ‘Linear regression’ itself emphasizes on the importance of linearity assumption.
However, when we run a simple linear regression model in Excel, R or Python; we can not automatically see any test statistic in the regression output to understand if our model is satisfying this linearity assumption. So there are 2 separate tests on linearity that I am going to discuss today. Both of them can be carried out in R and Python.
Rainbow Test: The basic idea of the Rainbow test is that even if the true relationship is non-linear, a good linear fit can be achieved on a subsample in the “middle” of the data. The null hypothesis is rejected whenever the overall fit is significantly worse than the fit for the subsample. The test statistic under H_0 follows an F distribution with parameter degrees of freedom.
So basically, in simple words, if a plot suggests a non-linear pattern and then you zoom into the non-linear part of the curve ( a sub-sample) it will appear to be more linear. The null hypothesis is the fit of the model using full sample is the same as using a central subset.
How the Rainbow Test Works
- Split the Data:
- Divide the dataset into two parts:
- Middle portion (central ~50-70% of observations, ordered by fitted values).
- Outer portion (remaining ~30-50% at the extremes).
- Divide the dataset into two parts:
- Fit Separate Regressions:
- Estimate the original model on the full sample and the middle subsample.
- Compare the Models:
- Use an F-test to check if the coefficients from the two regressions are significantly different.
- Null Hypothesis (H₀): The model is linear (no misspecification).
- Alternative Hypothesis (H₁): The model is nonlinear.
- Interpretation:
- If the p-value < 0.05, reject H₀ → nonlinearity exists.
- If the p-value > 0.05, the linear model may be adequate.
R code for Rainbow Test:
# Install the car package if not already installed
install.packages("car")
# Load the car package
library(car)
# Simulate some example linear data
set.seed(123)
x <- 1:100
y <- 5 + 2 * x + rnorm(100, mean = 0, sd = 10)
# Fit a linear model
model <- lm(y ~ x)
# Perform the Rainbow test
rainbow_test <- rainbow.test(model)
# Print the result
print(rainbow_test)
Python Code for Rainbow Test:
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.diagnostic import linear_rainbow
# Example: Simulated linear data
np.random.seed(0)
n = 100
X = np.linspace(0, 10, n)
y = 3 * X + np.random.normal(0, 3, size=n)
# Add constant term for intercept
X_with_const = sm.add_constant(X)
# Fit OLS model
model = sm.OLS(y, X_with_const).fit()
# Rainbow test
rainbow_stat, p_value = linear_rainbow(model)
print(f"Rainbow test statistic: {rainbow_stat:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpretation
if p_value < 0.05:
print("❌ Evidence of non-linearity (reject linearity assumption)")
else:
print("✅ No evidence against linearity (linearity assumption holds)")
Output Interpretation:
- A significant result (p < 0.05) suggests nonlinearity.
Harvey Collier Test: The alternative option of Rainbow test is Harvey Collier test. This test performs a t-test with parameter degrees of freedom on the recursive residuals. Recursive residuals are basically linear transformations of ordinary residuals and are independently and identically distributed. If the true relationship is not linear but convex or concave the mean of the recursive residuals should differ from zero significantly. A statistically significant result means that we can reject the null hypothesis of the true model being linear.
How the Harvey-Collier Test Works
- Recursive Estimation:
- The model is estimated sequentially, adding one observation at a time.
- For each step, the next observation is predicted using the previous model.
- Cumulative Prediction Errors:
- The test compares the actual values of the dependent variable with the recursively predicted values.
- If the model is truly linear, prediction errors should be random.
- If nonlinearity exists, errors will show a systematic trend.
- Test Statistic:
- The test uses a t-statistic to determine if the mean prediction error significantly deviates from zero.
- Null Hypothesis (H₀): The model is linear (no misspecification).
- Alternative Hypothesis (H₁): The model is nonlinear.
- Interpretation:
- Reject H₀ (p < 0.05): Evidence of nonlinearity.
- Fail to reject H₀ (p > 0.05): Linear model may be adequate.
R Code for Harvey Collier Test:
# Install the car package if you haven't already
install.packages("car")
# Load the car package
library(car)
# Simulate some example linear data
set.seed(123)
x <- 1:100
y <- 3 * x + rnorm(100, sd = 10)
# Fit a linear model
model <- lm(y ~ x)
# Perform the Harvey-Collier test
hc_test <- harvtest(model)
# Print the result
print(hc_test)
Python Code for Harvey Collier Test:
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.diagnostic import harvey_collier
# Simulate example linear data
np.random.seed(42)
n = 100
X = np.linspace(0, 10, n)
y = 5 + 2 * X + np.random.normal(0, 2, size=n)
# Add constant term
X_with_const = sm.add_constant(X)
# Fit OLS model
model = sm.OLS(y, X_with_const).fit()
# Harvey-Collier test
t_stat, p_value = harvey_collier(model)
print(f"Harvey-Collier t-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpretation
if p_value < 0.05:
print("❌ Evidence of non-linearity (reject linearity assumption)")
else:
print("✅ No evidence against linearity (linearity assumption holds)")
Output Interpretation:
- A significant p-value (p < 0.05) suggests nonlinearity.

Trackbacks/Pingbacks