Creating Sample Data Made Easy: SAS vs R – A Friendly Comparison

Creating sample data is essential for testing, simulation, and analysis. Both SAS and R provide versatile methods to generate sample datasets. In this post, we'll compare how to create sample data side by side in SAS and R, offering more than five methods with examples and explanations.

Method 1: Using Basic Data Structures

SAS

In SAS, the DATA step is the most basic way to create a sample dataset.

/* SAS Example: Basic Data Step */
DATA sample_data;
    input ID Name $ Age Height Weight;
    DATALINES;
    1 John 25 68 150
    2 Jane 30 65 120
    3 Jim 22 70 180
    4 Jill 28 64 135
    ;
RUN;

PROC PRINT DATA=sample_data; 
RUN;


Explanation:

  • The DATA statement starts a new data step, creating a dataset named sample_data.
  • The input statement defines the variables: IDName (character type indicated by $), AgeHeight, and Weight.
  • DATALINES allows you to enter data manually.
  • PROC PRINT prints the dataset to verify the data.

R

In R, you can use the data.frame function to create a similar dataset.

# R Example: Basic Data Frame
sample_data <- data.frame(
  ID = 1:4,
  Name = c("John", "Jane", "Jim", "Jill"),
  Age = c(25, 30, 22, 28),
  Height = c(68, 65, 70, 64),
  Weight = c(150, 120, 180, 135)
)

print(sample_data)


Explanation:

  • The data.frame function creates a data frame named sample_data.
  • Each variable (IDNameAgeHeightWeight) is defined with corresponding values.
  • print outputs the data frame to the console for verification.


Method 2: Using Random Number Functions

SAS

SAS provides random number functions that can be used within a DATA step.

/* SAS Example: Using Random Number Functions */
DATA random_data;
    DO ID = 1 TO 10;
        Age = ROUND(18 + 47*RAND('UNIFORM'));
        Height = ROUND(150 + 50*RAND('NORMAL'), 0.1);
        Weight = ROUND(100 + 100*RAND('NORMAL'), 0.1);
        OUTPUT;
    END;
RUN;

PROC PRINT DATA=random_data; 
RUN;


Explanation:

  • DATA starts a new data step, creating a dataset named random_data.
  • DO ID = 1 TO 10 generates 10 rows with IDs from 1 to 10.
  • RAND('UNIFORM') and RAND('NORMAL') generate random values for AgeHeight, and Weight.
  • ROUND rounds the generated values.
  • OUTPUT writes the generated row to the dataset.
  • PROC PRINT prints the dataset.

R

In R, you can use functions like sample and rnorm to generate random data.

# R Example: Using Random Number Functions
set.seed(123)  # For reproducibility

random_data <- data.frame(
  ID = 1:10,
  Age = sample(18:65, 10, replace = TRUE),
  Height = rnorm(10, mean = 170, sd = 10),
  Weight = rnorm(10, mean = 70, sd = 15)
)

print(random_data)


Explanation:

  • set.seed(123) ensures reproducibility of random numbers.
  • data.frame creates a data frame named random_data.
  • sample generates random ages between 18 and 65.
  • rnorm generates normally distributed values for Height and Weight.
  • print outputs the data frame.

Method 4: Using Specialized Packages

SAS

SAS's PROC SURVEYSELECT can be used to create random samples from an existing dataset.

/* SAS Example: Using PROC SURVEYSELECT */
DATA population;
DO ID = 1 TO 1000;
Age = FLOOR(18 + RAND('UNIFORM')*47);
OUTPUT;
END;
RUN;

PROC SURVEYSELECT DATA=population OUT=sample_data SAMPSIZE=10;
RUN;

PROC PRINT DATA=sample_data;
RUN;

Explanation:

  • DATA creates a dataset named population with 1000 rows.
  • FLOOR and RAND('UNIFORM') generate random ages.
  • PROC SURVEYSELECT selects a random sample of 10 rows from population, creating sample_data.
  • PROC PRINT prints the sampled dataset.

R

In R, you can use the dplyr package for powerful data manipulation and sampling.

# R Example: Using dplyr Package
library(dplyr)

population <- tibble(
  ID = 1:1000,
  Age = floor(18 + runif(1000, min = 0, max = 47))
)

sample_data <- population %>% sample_n(10)

print(sample_data)


Explanation:

  • library(dplyr) loads the dplyr package.
  • tibble creates a tibble named population with 1000 rows.
  • floor and runif generate random ages.
  • sample_n selects a random sample of 10 rows from population.
  • print outputs the sampled data.

Method 5: Using Inline Data Entry

SAS

SAS allows inline data entry using CARDS or DATALINES.

/* SAS Example: Using CARDS or DATALINES */
DATA small_data;
    INPUT ID Name $ Age;
    DATALINES;
    1 John 25
    2 Jane 30
    3 Jim 22
    4 Jill 28
    ;
RUN;

PROC PRINT DATA=small_data; RUN;


Explanation:

  • DATA starts a new data step, creating a dataset named small_data.
  • INPUT defines the variables.
  • DATALINES allows manual data entry.
  • PROC PRINT prints the dataset.

R

In R, the tribble function from the tibble package provides a convenient way to enter data inline.

# R Example: Using tribble Function library(tibble) small_data <- tribble( ~ID, ~Name, ~Age, 1, "John", 25, 2, "Jane", 30, 3, "Jim", 22, 4, "Jill", 28 ) print(small_data)


Explanation:

  • library(tibble) loads the tibble package.
  • tribble creates a tibble named small_data.
  • Each variable is defined with corresponding values.
  • print outputs the tibble.

Summary

Both SAS and R offer a variety of methods to create sample data, each suited to different scenarios and preferences. Whether you prefer the structured environment of SAS or the flexible ecosystem of R, understanding these methods can help you efficiently generate sample datasets for your analyses.

By mastering these techniques, you can streamline your data preparation process and focus more on deriving insights from your data. Happy coding!


Feel free to leave a comment or reach out if you have any questions or need further clarifications on creating sample data in SAS or R. Stay tuned for more posts comparing these two powerful programming languages!

Popular posts from this blog

Calculating Study Day in R for CDISC Compliance: A Step-by-Step Guide

Mastering the Art of Debugging Nested Macros in SAS

HOW TO ACCESS SPECIAL CHARACTERS IN SAS