Transitioning from Clinical SAS to R opens up a world of powerful data manipulation tools, especially within the Tidyverse ecosystem. Developed by Hadley Wickham, the Tidyverse is a collection of R packages designed for data science, emphasizing readability, consistency, and efficiency.
If you're coming from SAS, where procedures like PROC SORT, PROC MEANS, and PROC SQL dominate data workflows, you'll find Tidyverse's dplyr and tidyr packages intuitive yet more expressive. This chapter will guide you through essential Tidyverse operations—grouping, summarizing, and merging datasets—with clear examples and comparisons to SAS for easier comprehension.
By the end of this chapter, you'll be able to:
- Group data using group_by()
- Summarize data with summarize()
- Merge datasets using joins like left_join()
- Understand how these operations compare to SAS
Let’s dive in!
Before jumping into specific operations, it's essential to understand the Tidyverse philosophy:
%>%) chain operations, making code readable. filter(), select()). # Install (if not already installed)
install.packages("tidyverse")
# Load the core packages
library(tidyverse)
group_by()In SAS, you'd use PROC SORT followed by BY statements to group data. In R, the equivalent is group_by().
Suppose we have a dataset trial_data with columns:
- patient_id
- treatment (e.g., "Drug A", "Placebo")
- visit_number
- response_value
# Group data by treatment and visit
grouped_data <- trial_data %>%
group_by(treatment, visit_number)
Comparison to SAS:
PROC SORT DATA=trial_data;
BY treatment visit_number;
RUN;
group_by()?summarize()) is applied. summarize()In SAS, PROC MEANS or PROC SUMMARY generates summary statistics. In R, summarize() does this concisely.
summary_stats <- grouped_data %>%
summarize(
mean_response = mean(response_value, na.rm = TRUE),
sd_response = sd(response_value, na.rm = TRUE),
n_patients = n()
)
Comparison to SAS:
PROC MEANS DATA=trial_data MEAN STD N;
CLASS treatment visit_number;
VAR response_value;
RUN;
summarize() collapses grouped data into one row per group. na.rm = TRUE to exclude missing values (like PROC MEANS's default behavior). n() counts rows, mean(), sd(), min(), max(), etc. In SAS, PROC SQL or DATA STEP merges datasets. In R, Tidyverse provides join functions:
| R Function | SAS Equivalent | Description |
|---|---|---|
left_join() |
LEFT JOIN (SQL) |
Keeps all rows from the left table |
right_join() |
RIGHT JOIN (SQL) |
Keeps all rows from the right table |
inner_join() |
INNER JOIN (SQL) |
Keeps only matching rows |
full_join() |
FULL JOIN (SQL) |
Keeps all rows from both tables |
# Patient demographics
demographics <- tibble(
patient_id = c(1, 2, 3),
age = c(45, 60, 38),
sex = c("M", "F", "M")
)
# Lab results
lab_results <- tibble(
patient_id = c(1, 2, 4),
lab_value = c(120, 95, 110)
)
# Left join (keep all patients in demographics)
merged_data <- left_join(demographics, lab_results, by = "patient_id")
Result:
# A tibble: 3 Ă— 4
patient_id age sex lab_value
<dbl> <dbl> <chr> <dbl>
1 1 45 M 120
2 2 60 F 95
3 3 38 M NA
Comparison to SAS:
PROC SQL;
CREATE TABLE merged_data AS
SELECT d.*, l.lab_value
FROM demographics d
LEFT JOIN lab_results l ON d.patient_id = l.patient_id;
QUIT;
Let’s simulate a real-world scenario:
Task:
- Group clinical trial data by treatment.
- Calculate the average response per treatment.
- Merge with a treatment reference table.
# Step 1: Group and summarize
treatment_summary <- trial_data %>%
group_by(treatment) %>%
summarize(avg_response = mean(response_value, na.rm = TRUE))
# Step 2: Merge with reference table (e.g., treatment codes)
treatment_ref <- tibble(
treatment = c("Drug A", "Placebo"),
treatment_code = c("A", "P")
)
final_data <- left_join(treatment_summary, treatment_ref, by = "treatment")
Output:
# A tibble: 2 Ă— 3
treatment avg_response treatment_code
<chr> <dbl> <chr>
1 Drug A 78.2 A
2 Placebo 65.5 P
group_by() to define groups (similar to SAS BY statements). No immediate changes to data until an action (e.g., summarize()) is applied.
Summarizing Data:
summarize() computes statistics per group (like PROC MEANS). Helper functions (n(), mean(), sd()) make calculations easy.
Merging Data:
left_join(), inner_join()) replace SAS PROC SQL or DATA STEP merges. Explicit by argument ensures clarity in merges.
Workflow Efficiency:
%>%) streamline multi-step operations. By mastering these Tidyverse functions, you’ll unlock efficient, reproducible workflows that are simpler to debug and maintain compared to legacy SAS procedures.
Next Steps:
- Practice with real clinical datasets.
- Explore tidyr for reshaping data (pivoting, spreading).
- Learn to visualize grouped summaries using ggplot2.
Happy coding! 🚀