Stratified Sampling in SAS

Anand Bisen bio photo By Anand Bisen Comment

Just started playing with SAS for data analysis. It’s an interesting environment, but as is the case with any new language I have to get back to reference guides for many common tasks. The task at hand today was to split my primary dataset into two one for training my models and the other one for testing.

With python, pandas & numpy I used to generate a new column named randn with random values between [0, 1). Followed by splitting the primary pandas table into two one were the value of randn >= 0.8 and the other where the value for randn < 0.8. This gave me a good enough subsets for train & test to work with.

# df is the pandas DataFrame with the values
df['randn'] = np.random.random(len(df))
test  = df[(df.randn <  0.8)]
train = df[(df.randn >= 0.8)]

SAS has a dedicated procedure to achieve this with some additional fancy and valuable options. It’s often desirable in Data Mining workflows to ensure that the subsets are stratified. Meaning that if the original dataset the subpopulation vary based on some categorical variable then the ideal method should subsample from each of those subpopulations individually.

So for example if my target variable contains two classes GOOD and BAD then I would ideally want to have the same ratio of GOOD & BAD in the two subsets that I create for training and testing datasets. Below is the code block that shows how to achieve this in SAS.

The block below is all you need to create a new dataset called TRAIN that contains an additional column selected with values 0 & 1. The split of this dataset on the column selected would give two datasets one with 80% and the other 20% of records, while making sure that there is fair representation of values from the stratified variable targetc.

	STRATA targetc;

Now we will quickly validate if the split is how we would ideally want it.

	TABLES selected*targetc / PLOTS=all;


image image
comments powered by Disqus