April 5, 2026 12:29 am

When we build regression models to estimate causal relationships, one key assumption of the ordinary least squares (OLS) estimator is that our explanatory variables are exogenous—that is, uncorrelated with the error term. However, in many real-world settings this assumption is violated. Endogeneity, arising from omitted variables, measurement error, or simultaneous causality, can bias OLS estimates and lead to spurious conclusions.

Instrumental variables (IV) methods—most commonly implemented via two-stage least squares (2SLS)—offer a powerful way to recover consistent causal estimates when endogeneity is present. In this post, we’ll explore:

  1. What is endogeneity?
  2. What are valid instruments?
  3. Two-Stage Least Squares (2SLS): mechanics and interpretation
  4. Practical guidance: choosing instruments, testing validity, and reporting results.

1. What Is Endogeneity and Why Does It Matter?

Consider the simple regression model:

Equation representing a simple regression model: y_i = β x_i + u_i, where y_i is the outcome variable, x_i is the independent variable, β is the coefficient, and u_i is the error term.

where yi is the outcome, xi is our regressor of interest, and ui is the unobserved error. The OLS estimator picks up not only the true effect β but also the spurious correlation between x and the omitted factors in u.

Common Sources of Endogeneity

  • Omitted variables: Important confounders (e.g., innate ability, firm culture) affect both x and y but are absent from the model.
  • Measurement error: Noisy measurement in x induces attenuation bias (estimates biased toward zero).
  • Simultaneity: Reverse causality or feedback loops (e.g., price and quantity determined jointly in a market).

When endogeneity is present, OLS estimates are biased and inconsistent—even as sample size grows. IV methods offer a way to isolate variation in xxx that is “as good as randomly assigned,” enabling consistent estimation of β.


2. What Makes a Valid Instrument?

An instrument zi is a variable that helps us purge the endogenous component of xi​. For zi to be valid, it must satisfy two key conditions:

  1. Relevance: Cov(zi,xi)≠0.
    The instrument must be correlated with the endogenous regressor. In practice, you check this with the first-stage regression xi=π zi+vi and require a strong F-statistic (rule of thumb: F>10).
  2. Exogeneity (Exclusion Restriction): Cov(zi,ui)=0.
    The instrument affects the outcome yi only through its effect on xi, not through any other channel. This condition is not testable directly and must be justified on theoretical or institutional grounds.

Examples of Instruments

  • Randomized encouragement designs: e.g., assignment to receive a subsidy encourages take-up of a job training program, but random assignment is orthogonal to unobserved ability.
  • Policy or legal thresholds: e.g., eligibility rules (Medicaid cutoff based on income) that create discontinuous jumps in treatment.
  • Geographic or institutional variation: e.g., distance to a college as an instrument for education level.

3. Two-Stage Least Squares (2SLS): Mechanics and Interpretation

With a valid instrument z, 2SLS proceeds in two steps:

Stage 1: Predict the Endogenous Regressor

Equation representing the relationship between the endogenous regressor xi, the instrument zi, exogenous control variables wi, and the error term vi.

where wi are any exogenous control variables. We estimate this by OLS and obtain the fitted values x^i.

  • Check relevance: Examine the F-statistic on zi. A weak instrument (low F) leads to biased 2SLS estimates.

Stage 2: Regress the Outcome on Predicted Regressor

Mathematical equation representing a regression model: yi = β x̂i + δ wi + εi.

The coefficient β^2SLS is a consistent estimate of the causal effect of x on y, provided the instrument is exogenous.

Interpretation

  • β2SLS recovers the local average treatment effect (LATE) for the compliers—the subsample whose value of x changes in response to the instrument.
  • Standard errors from 2SLS must account for the two-stage nature (use robust or clustered SEs).

4. Practical Guidance

4.1 Choosing and Justifying Instruments

  • Economic theory or institutional detail: Rely on domain knowledge to argue that the instrument affects y only through x.
  • Balance and placebo checks: Show that z is uncorrelated with pre-treatment covariates or future outcomes.

4.2 Testing Instrument Strength and Validity

  • First-stage F-statistic: Ensure F>10 on the excluded instrument(s).
  • Overidentification tests (if multiple instruments): e.g., Hansen’s JJJ-test checks if all instruments are exogenous.
  • Endogeneity test: Hausman test comparing OLS and 2SLS estimates; a significant difference suggests endogeneity is present and 2SLS is preferable.

4.3 Reporting 2SLS Results

When you present your IV estimates, include:

  1. First-stage regression: coefficient on z, F-statistic.
  2. Second-stage regression: 2SLS coefficient β^, standard error, and p-value.
  3. Diagnostic tests: instrument strength, overidentification p-value, endogeneity test.

Conclusion

Instrumental variables methods—most notably two-stage least squares—provide a solution to the pervasive problem of endogeneity in observational data. By leveraging a valid instrument that shifts the endogenous regressor in an exogenous way, 2SLS recovers consistent estimates of causal effects. The key challenges lie in finding credible instruments, testing their strength, and defending the exclusion restriction. When done carefully, IV estimation can transform otherwise biased analyses into robust evidence for policy, finance, and social science.


Further Reading & Resources

  • Angrist, J. & Pischke, J. (2009). Mostly Harmless Econometrics.
  • Wooldridge, J. (2010). Econometric Analysis of Cross Section and Panel Data.
  • Online tutorials with R/Python code: IV Estimation in R | Statsmodels IV in Python

Discover more from SolutionShala

Subscribe now to keep reading and get access to the full archive.

Continue reading