About once a year my home NAS experiences a disk failure. It's not a catastrophie as much as it is an inconvenience since I use 5 disks in RAID-5 (RAIDZ in ZFS parlance) that can handle a single disk failure. But from the moment there is a disk failure the NAS is in a critical mode as it cannot sustain another disk failure. Time is of an essence here, replacing the failed disk with a healthy disk as soon as possible is very important.

For various reasons (right or wrong) I chose not to use a commercial NAS and instead built it using Solaris 11 & ZFS. And unfortunately it does not comes with a capability for setting e-mail alerts out of the box. So to address that void I wrote a crude script to monitor my ZPOOL and alert me via mail in the event there is a disk ...

## Stratified Sampling in SAS

Just started playing with SAS for data analysis. It's an interesting environment, but as is the case with any new language I have to get back to reference guides for many common tasks. The task at hand today was to split my primary dataset into two one for training my models and the other one for testing.

With python, pandas & numpy I used to generate a new column named randn with random values between [0, 1). Followed by splitting the primary pandas table into two one were the value of randn >= 0.8 and the other where the value for randn < 0.8. This gave me a good enough subsets for train & test to work with.

# df is the pandas DataFrame with the values
df['randn'] = np.random.random(len(df))
test  = df[(df.randn <  0.8)]
train = df[(df.randn >= 0.8)]


SAS has a dedicated procedure to ...

## Gathering Tweets in Python

The amount of data freely available for playing around has increased leaps and bound. Amongst various data sources I feel Twitter is one of the most interesting and rich source. First and foremost they have made the system very approachable (documentation & tools), and secondly the data is interesting in a sense that anybody can find a subset of data that is interesting to them.

Occasionally I gather tweets for either playing around or for creating visualizations for identifying trends. One of the use-case for me is to gather tweets during a conference that I am interested in and see what was the most popular trend/technology during the days of conference. I do this several times every year especially for Hadoop Summit, Strata and PyData.

Below is a small snippet that I use to collect the tweets. It collects tweets in byte sized multiple compressed files in raw form that ...

## Bokeh Maps

For some time now I have been searching for a correct shapefile for India. There are some shapefile at Global Administrative Areas but they are outdated. Additionally they don't correctly represent Kashmir. While cleaning my desktop I came across some shapefiles that seems to be up-to-date and correct, unfortunately I don't know the source of these files. If you know the source of these shapefiles please do let me know and I will duly reference the original author. Shapefile for Indian with State Boundaries

With an updated shapefile and latest interest of exploring Python for data analysis I thought of giving Bokeh with Python a shot. I really like where Bokeh is going, it's pretty powerful albeit not so accessible just yet. But looking at the roadmap I believe user friendly and more intuitive interfaces (like ggplot) will be coming soon.

Some things I really like about ...