Censored and truncated regression both deal with situations where we don’t observe the full distribution of the dependent variable, but they differ in how the data is missing or limited.
Censored Regression
- Definition: You observe all data points, but for some, the dependent variable is only partially known.
- Example: Suppose you’re studying incomes, but any income above $100k is just recorded as “$100k+”. You know the person is in the dataset, but not their exact income if it’s above $100k.
- Common model: Tobit regression is used when data is censored.
- Key Point: The individual is still included in the dataset, but with incomplete outcome info.
Truncated Regression
- Definition: You only observe individuals whose dependent variable falls within a certain range—others are completely excluded from the dataset.
- Example: If you’re analyzing incomes but only have data for people making between $30k and $100k, people below or above that range aren’t in your data at all.
- Effect: This leads to sample selection bias, since your sample doesn’t represent the full population.
- Key Point: You never see some individuals at all if their outcome is outside the range.
Quick Summary
| Feature | Censored | Truncated |
|---|---|---|
| Data outside bounds | Observed, but with limited info | Not observed at all |
| Inclusion in dataset | Yes | No |
| Typical model | Tobit regression | Truncated regression model |
| Example | Income > $100k shown as “$100k+” | People earning > $100k not included |
1. Censored Regression (Tobit Model)
We’ll use the tobit() function from the AER package.
Here are code samples in R demonstrating both censored and truncated regression using built-in or common packages like AER and truncreg.
# Install packages if needed
install.packages("AER")
library(AER)
# Simulate some data
set.seed(123)
n <- 1000
x <- rnorm(n)
y <- 2 + 3 * x + rnorm(n)
# Censor the dependent variable: values below 1 are set to 1
y_censored <- ifelse(y < 1, 1, y)
# Run Tobit regression
model_censored <- tobit(y_censored ~ x, left = 1)
# Summary of the model
summary(model_censored)
Here, the model accounts for the fact that we don’t know the true values when y < 1, just that they are censored at 1.
2. Truncated Regression
We’ll use the truncreg() function from the truncreg package.
# Install if needed
install.packages("truncreg")
library(truncreg)
# Simulate the same data
set.seed(123)
x <- rnorm(n)
y <- 2 + 3 * x + rnorm(n)
# Truncate the data: only keep observations where y > 1
keep <- y > 1
x_trunc <- x[keep]
y_trunc <- y[keep]
# Run truncated regression
model_truncated <- truncreg(y_trunc ~ x_trunc, point = 1, direction = "left")
# Summary of the model
summary(model_truncated)
This model only includes cases where y > 1, and assumes we have no data at all for cases where y <= 1.
Hope you found this post useful 🙂
