If you're transitioning from Clinical SAS to R, you’ve likely spent countless hours mastering the DATA step and PROC SQL for data manipulation. The good news? R’s dplyr package provides an intuitive, efficient, and powerful alternative to these SAS staples.
dplyr is part of the tidyverse, a collection of R packages designed for data science. It simplifies data wrangling with a consistent syntax that mirrors SQL and DATA step operations—filtering, sorting, joining, summarizing, and transforming data.
In this chapter, you’ll learn:
- How dplyr verbs replace traditional SAS procedures.
- Key functions to manipulate data frames like SAS datasets.
- Practical comparisons between SAS and R for common data tasks.
By the end, you’ll see why dplyr is a game-changer for SAS programmers embracing R.
dplyr Verbs and Their SAS Equivalentsdplyr organizes data manipulation around verbs—functions named for their actions. Below is a mapping between dplyr and SAS:
R (dplyr) |
SAS Equivalent | Purpose |
|---|---|---|
filter() |
WHERE (DATA Step/PROC SQL) |
Subset rows based on conditions. |
select() |
KEEP/DROP (DATA Step) |
Select or remove columns. |
mutate() |
DATA Step calculations |
Create or modify columns. |
arrange() |
PROC SORT |
Sort data. |
summarize() + group_by() |
PROC MEANS/PROC SUMMARY |
Aggregate data by groups. |
left_join() |
PROC SQL joins |
Merge datasets. |
filter() vs. WHERE)SAS:
DATA subset;
SET original_data;
WHERE age > 30 AND sex = 'M';
RUN;
R (dplyr):
library(dplyr)
subset <- original_data %>%
filter(age > 30, sex == "M")
Key Notes:
- %>% (pipe operator) chains operations, similar to SAS RUN steps.
- Conditions in filter() use R’s logical operators (==, >, &).
select() vs. KEEP/DROP)SAS:
DATA selected_cols;
SET original_data;
KEEP subject_id age treatment;
RUN;
R (dplyr):
selected_cols <- original_data %>%
select(subject_id, age, treatment)
Pro Tip: Use - to exclude columns (like DROP in SAS):
dropped_cols <- original_data %>%
select(-address, -screening_date)
mutate() vs. DATA Step)SAS:
DATA modified;
SET original_data;
bmi = (weight / (height**2)) * 703;
IF bmi >= 30 THEN obese_flag = 1; ELSE obese_flag = 0;
RUN;
R (dplyr):
modified <- original_data %>%
mutate(
bmi = (weight / (height^2)) * 703,
obese_flag = ifelse(bmi >= 30, 1, 0)
)
Bonus: Use case_when() for multi-condition logic (like SAS SELECT-WHEN):
modified <- original_data %>%
mutate(
bmi_category = case_when(
bmi < 18.5 ~ "Underweight",
bmi < 25 ~ "Normal",
bmi < 30 ~ "Overweight",
TRUE ~ "Obese"
)
)
arrange() vs. PROC SORT)SAS:
PROC SORT DATA=original_data;
BY descending age treatment;
RUN;
R (dplyr):
sorted_data <- original_data %>%
arrange(desc(age), treatment)
summarize() + group_by() vs. PROC MEANS)SAS:
PROC MEANS DATA=original_data N MEAN STD;
CLASS treatment;
VAR age weight;
RUN;
R (dplyr):
summary_stats <- original_data %>%
group_by(treatment) %>%
summarize(
n = n(),
mean_age = mean(age, na.rm = TRUE),
sd_weight = sd(weight, na.rm = TRUE)
)
Note: na.rm = TRUE excludes missing values (like SAS NMISS handling).
*_join() vs. PROC SQL)SAS (Inner Join):
PROC SQL;
CREATE TABLE merged_data AS
SELECT a.*, b.lab_value
FROM demog_data a
INNER JOIN lab_data b ON a.subject_id = b.subject_id;
QUIT;
R (dplyr):
merged_data <- demog_data %>%
inner_join(lab_data, by = "subject_id")
Join Types in dplyr:
- left_join(): Keep all rows from the left table (like SAS LEFT JOIN).
- right_join(): Keep all rows from the right table.
- full_join(): Keep all rows from both tables (FULL JOIN in SAS).
dplyr Featurestidyr Integration)Reshape data like SAS PROC TRANSPOSE:
Long to Wide (SAS PROC TRANSPOSE):
wide_data <- long_data %>%
pivot_wider(names_from = visit, values_from = result)
Wide to Long:
long_data <- wide_data %>%
pivot_longer(cols = starts_with("visit"), names_to = "visit", values_to = "result")
rowwise())Avoid SAS macros with:
row_means <- data %>%
rowwise() %>%
mutate(avg_score = mean(c(score1, score2, score3), na.rm = TRUE))
%>%Chain operations instead of intermediate datasets:
final_output <- raw_data %>%
filter(visit == "Baseline") %>%
select(-screening_date) %>%
mutate(bmi = weight / (height^2)) %>%
group_by(treatment) %>%
summarize(mean_bmi = mean(bmi))
dplyr mirrors SAS functionality but with cleaner, more readable syntax. filter() → WHERE mutate() → DATA step calculations summarize() + group_by() → PROC MEANS *_join() vs. PROC SQL. %>%) streamline workflows, reducing temporary datasets. Practice converting your existing SAS scripts to R using dplyr. Start with simple filters and merges, then experiment with grouped summaries. The more you use dplyr, the more you’ll appreciate its efficiency—and wonder how you ever lived without it!
Ready to dive deeper? Chapter 6 explores ggplot2, R’s answer to SAS PROC SGPLOT.
This chapter equips you with the foundational dplyr skills to transition seamlessly from SAS to R. Happy coding! 🚀