Just started playing with SAS for data analysis. It’s an interesting environment, but as is the case with any new language I have to get back to reference guides for many common tasks. The task at hand today was to split my primary dataset into two one for training my models and the other one for testing.
numpy I used to generate a new column named
randn with random values between [0, 1). Followed by splitting the primary
pandas table into two one were the value of
randn >= 0.8 and the other where
the value for
randn < 0.8. This gave me a good enough subsets for
test to work with.
# df is the pandas DataFrame with the values df['randn'] = np.random.random(len(df)) test = df[(df.randn < 0.8)] train = df[(df.randn >= 0.8)]
SAS has a dedicated procedure to achieve this with some additional fancy and valuable options. It’s often desirable in Data Mining workflows to ensure that the subsets are stratified. Meaning that if the original dataset the subpopulation vary based on some categorical variable then the ideal method should subsample from each of those subpopulations individually.
So for example if my
target variable contains two classes
BAD then I would ideally want to have the same ratio of
in the two subsets that I create for training and testing datasets. Below is
the code block that shows how to achieve this in SAS.
The block below is all you need to create a new dataset called
contains an additional column
selected with values 0 & 1. The split of this
dataset on the column
selected would give two datasets one with 80% and the
other 20% of records, while making sure that there is fair representation of
values from the stratified variable
PROC SURVEYSELECT DATA=LCSUBSET OUT=TRAIN rate=0.8 outall; STRATA targetc; RUN;
Now we will quickly validate if the split is how we would ideally want it.
PROC FREQ DATA=TRAIN; TABLES selected*targetc / PLOTS=all; RUN;
comments powered by Disqus