Survival analysis is a branch of statistics used to analyze the expected duration of time until one or more events happen, such as death in biological organisms, failure in mechanical systems, or churn in customers. It’s especially useful when the outcome is time-to-event data, and not all subjects experience the event during the study (leading to censored data).
Key Concepts in Survival Analysis
- Event
The occurrence you’re tracking (e.g., death, failure, churn). - Time-to-event
How long it takes for the event to happen. - Censoring
Happens when the event has not occurred for a subject during the observation period. Types:- Right-censoring: Event hasn’t occurred yet.
- Left-censoring: Event already happened before observation started.
- Interval-censoring: Event happened within a known time interval.
- Survival Function S(t)
Probability that the event has not occurred by time t. S(t)=P(T>t) - Hazard Function h(t)
Instantaneous event rate at time t, given survival up to that point. - Kaplan-Meier Estimator
A non-parametric method to estimate the survival function from observed data. - Cox Proportional Hazards Model
A semi-parametric regression model that relates covariates to the hazard rate.
Let’s deep dive into some of these concepts.
1. Hazard Function:
The hazard function, often denoted as h(t), is a fundamental concept in survival analysis. It describes the instantaneous rate at which events occur, given that the individual has survived up to time t.
Mathematical Definition
The hazard function h(t) is defined as:

where:
- T is a random variable representing the time until the event occurs.
- P is the probability.
- Δt is a small time interval.
Alternative Expressions
The hazard function can also be expressed in terms of the probability density function (PDF) f(t) and the survival function S(t):

where:
- S(t)=P(T≥t) (the probability of surviving beyond time t).
- f(t) represents the derivative of the function S(t) with respect to time t, taken with a negative sign.
Interpretation
- h(t) gives the instantaneous failure rate at time t
- A higher hazard function at a given time means a higher risk of the event occurring at that time.
- If h(t) is increasing over time, it indicates aging/wear-out failures (common in mechanical systems).
- If h(t) is decreasing over time, it suggests early failures (e.g., infant mortality in electronics).
- If h(t) is constant, it implies an exponential distribution (memoryless property, common in electronic components).
Relationship with Survival Function
The hazard function and survival function are related by:

The integral function is called the cumulative hazard function.
Example
- Exponential Distribution: If h(t)=λ (constant), then the time-to-event follows an exponential distribution.
- Weibull Distribution: The hazard function can be increasing or decreasing depending on its shape parameter.
Applications
- Medical Research: Modeling time until death or relapse.
- Engineering: Predicting failure times of machines.
- Economics: Analyzing unemployment duration.
2. Kaplan-Meier Estimator
The Kaplan-Meier (KM) estimator is a non-parametric statistic used to estimate the survival function S(t) from time-to-event data, particularly in the presence of censoring (where some subjects have not experienced the event by the end of the study). It is widely used in survival analysis, medical research, and engineering reliability.
Key Concepts
- Survival Function S(t):
- Probability that an individual survives beyond time t.
- S(t)=P(T>t), where T is the time until the event.
- Censoring:
- Right-censoring: Some subjects are lost to follow-up or haven’t experienced the event by the study’s end.
- The KM estimator accounts for censored data.
Kaplan-Meier Formula
The KM estimator is calculated as a product-limit estimator:

where:
- ti = Time at which at least one event occurred.
- di = Number of events (e.g., deaths) at time ti
- ni = Number of individuals at risk just before ti (i.e., those who haven’t yet experienced the event or been censored).
Advantages
- Handles censored data effectively.
- Non-parametric (no assumptions about the underlying distribution).
- Easy to compute and interpret.
Limitations
- Does not account for covariates (use Cox regression for that).
- Less precise with small sample sizes.
Applications
- Clinical trials: Compare survival between treatment groups.
- Engineering: Estimate time until machine failure.
- Economics: Analyze unemployment duration.
R Code Implementation:
# Install and load required packages (run once)
if (!require("survival")) install.packages("survival")
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("ggsurvfit")) install.packages("ggsurvfit")
library(survival); library(ggplot2); library(ggsurvfit)
# (1) Create example dataset
data <- data.frame(
time = c(2, 3, 5, 6, 8), # Time-to-event
event = c(1, 1, 0, 1, 1), # 1=event, 0=censored
group = c(1, 1, 2, 2, 1) # Group variable (optional)
)
# (2) Fit Kaplan-Meier model (overall)
km_fit <- survfit(Surv(time, event) ~ 1, data = data)
print(summary(km_fit)) # View survival table
# (3) Basic KM plot (base R)
plot(km_fit, xlab = "Time (months)", ylab = "Survival Probability",
main = "Kaplan-Meier Curve", conf.int = TRUE, col = "blue")
# (4) Fancy KM plot (ggplot2)
ggsurvfit(km_fit) +
labs(x = "Time", y = "Survival Prob.") +
add_confidence_interval() +
add_risktable()
# (5) Compare groups (if group variable exists)
if ("group" %in% colnames(data)) {
km_group <- survfit(Surv(time, event) ~ group, data = data)
plot(km_group, col = c("red", "blue"), lty = 1:2,
xlab = "Time", ylab = "Survival Prob.")
legend("topright", legend = paste("Group", 1:2),
col = c("red", "blue"), lty = 1:2)
# Log-rank test for group difference
cat("\nLog-rank test p-value:\n")
print(survdiff(Surv(time, event) ~ group, data = data))
}
# (6) Optional: Use built-in 'lung' dataset example
data(lung)
km_lung <- survfit(Surv(time, status) ~ 1, data = lung)
plot(km_lung, xlab = "Days", ylab = "Survival Prob.", main = "Lung Data KM Curve")
3. Cox Proportional Hazard Model
The Cox Proportional Hazards (PH) Model (also called the Cox regression model) is a semi-parametric statistical method used in survival analysis to examine the effect of predictor variables on the time until an event occurs (e.g., death, failure, relapse).
Key Features
- Models Hazard Rates:
- It estimates how covariates (e.g., age, treatment) influence the hazard function h(t)h(t) (instantaneous risk of the event at time t).
- The hazard function is defined as:h(t∣X)=h0(t)⋅exp(β1X1+β2X2+⋯+βpXp)
- h0(t): Baseline hazard (unspecified; depends only on time).
- exp(βX): How covariates multiplicatively shift the baseline hazard.
- Proportional Hazards Assumption:
- The effect of predictors is constant over time (i.e., hazard ratios between groups are proportional).
- Example: If Treatment A has half the hazard of Treatment B at time t=1, this ratio holds for all t.
- Handles Censored Data:
- Works with right-censored observations (common in survival data).
When to Use the Cox Model?
- Compare survival between groups (e.g., drug vs. placebo).
- Identify risk factors (e.g., how age, smoking affect cancer survival).
- Adjust for confounders (multivariable analysis).
R CODE IMPLEMENTATION:
library(survival)
data(lung) # Built-in dataset
# Fit Cox model
cox_model <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung)
summary(cox_model) # View hazard ratios and p-values
# Check PH assumption
test_ph <- cox.zph(cox_model)
print(test_ph) # Global p > 0.05 means PH holds
plot(test_ph) # Visual check (lines should be flat)
Advantages
- No need to specify h0(t) (robust to baseline hazard shape).
- Handles both continuous and categorical predictors.
Limitations
- Requires Proportional Hazards (PH) assumption.
- Does not estimate absolute survival probabilities (unless combined with KM estimates).
Extensions
- Time-Dependent Covariates: For non-PH effects.
- Stratified Cox Model: When PH holds only within subgroups.
