Playing with Pandas

Anand Bisen bio photo By Anand Bisen Comment

I would categorize myself as an occasional R user. Typically using R once a quarter while trying to make sense of some data that I get hold of. It’s a great tool with a lot of power. But I use Python a little more that R and came across pandas module that provides many functionalities that I tend to use in R. With this post and possibly some more posts in future I will try to document pandas as I learn the module.

Disclaimer :) I am not a coder so it’s very likely that there are many more optimal ways to achieve the same task. If you know how the content of this post can be improved or corrected please feel free to use the comment box at the end of the post. Now that the disclaimer is out of the way I can start speaking my mind.

The “pandas” module for Python provides very powerful functions and data structures for data analysis. It is heavily influenced by R (or S) and like R it provides a 2D data structure called DataFrame. But the differences are big enough that sometimes it just takes time to figure out how a certain basic task should be performed… In this post I am documenting some simple data munging tasks that I regularly perform.

Working dataset (compressed)

The dataset we will be using is of a system utilization of a server that was collected using dstat during an io workload. If you have not used Dag Wieer’s dstat check it out it’s a great tool for aggregating various metrics on a Linux system.

The command-line I usually use to collect the metrics is described below. It collects most of the individual metrics but aggregates the CPU utilization into a single variable. With the multi-core multi-socket systems translates into many columns (16 or more) just for CPU. The output filename represents the host from where the log was collected from along with a timestamp.

$ dstat -Tnlfvs -C total --output=`hostname`-`date +%Y-%m-%d-%H-%M-%S`.csv

Many times the size of these log files becomes pretty big and I end up gzipping them individually for saving space while archiving them for record keeping. As you will see in the later sections, pandas can transparently deals with gzipped files which obviates the need of uncompressing the files before analysis. Pretty cool :)

The file that I will be using can be downloaded here. It’s a 13.4MB file uncompressed and 1.3MB compressed (9.7x reduction). The log contains ~50K samples on 52 measures. Below are the first few lines of the log file (cleaned up for clarity).

$ gzcat dstat-output-s01-07-31-05-26-20.csv.gz | more
	"Dstat 0.7.2 CSV output"
	"Author:","Dag Wieers <dag@wieers.com>",......
	"Host:","s01",,,,"User:","root"
	"Cmdline:","dstat -Tnlfvs -C total --output=dstat-output-s01-07-31-05-26-20.csv"
	"Date:","31 Jul 2013 05:26:20 BST"

	"epoch","net/eth0",,"load avg",,,"procs",,,"memory usage",,,,.......
	"epoch","recv","send","1m","5m","15m","run","blk","new",............
	1375244780.498,0.0,0.0,0.480,0.400,0.360,0.0,0.0,2.475,.............
	1375244781.499,10020.0,1014.0,0.520,0.410,0.370,0.0,0.0,3.0,........

The log file contains a header which is 5 lines followed by 2 lines of column names which is then followed by the measured values. I will skip the first 6 lines and start reading the data from 7th line onwards.

Reading CSV to a DataFrame

In order to do anything useful with the data we have to start by loading the data from file to a DataFrame so that we can do interesting things with the data.

import pandas as pd

FILENAME="dstat-output-s01-07-31-05-26-20.csv.gz"
df=pd.read_csv(FILENAME, index_col=False, compression="gzip", header=True, skiprows=6)

This would read the content of the csv file into a DataFrame df. All the arguments to pd.read_csv() are obvious with the exception of index_col. With pd.read_csv when a file is read into a DataFrame pandas by default pandas uses the first column as index. And if you don’t want the first column to be used as index the parameter index_col=False (or 0) should be passed.

Listing the content of DataFrame

Interactive visual validation is always a must while dealing with unstructured data. And pandas DataFrame provides many methods that helps in that respect.

For a large data set just typing df on the prompt would result in an output like this

In [13]: df
Out[13]:
<class 'pandas.core.frame.DataFrame'>
Index: 47329 entries, 1375244780.5 to 1375292108.5
Data columns (total 52 columns):
	recv       47329  non-null values
	send       47329  non-null values
	1m         47329  non-null values
	5m         47329  non-null values
	15m        47329  non-null values
	run        47329  non-null values
	blk        47329  non-null values
	.
	.
	.
	.
	used.1     47329  non-null values
	free.1     47329  non-null values
	dtypes: float64(52)

This output is not what one would typically expect, personally i expected a tabular output with all of the values in a DataFrame. But this output is also very telling as it lists the names of all the columns and the number of values in each column.

And moreover in case of large datasets columnar output of the entire data set has limited value as one would typically filter the output for making it legible. The sections below would just show a few different ways to view the content in a DataFrame.

Listing top 10 records of selected columns

In [42]: df[["int","csw","sys","wai"]].head(10)
Out[42]:
	        int       csw    sys    wai
	0  4025.386  4573.767  0.460  1.573
	1   451.000   634.000  0.103  0.000
	2   229.000   464.000  0.000  0.000
	3   267.000   503.000  0.000  0.000
	4   253.000   514.000  0.000  0.000
	5   255.000   496.000  0.000  0.000
	6   255.000   514.000  0.000  0.000
	7   260.000   504.000  0.070  0.000
	8   283.000   535.000  0.000  0.000
	9   254.000   504.000  0.000  0.000

Listing last 10 records of selected columns

In [43]: df[["int","csw","sys","wai"]].tail(10)
Out[43]:
	       int  csw    sys  wai
	47319  179  352  0.000    0
	47320  198  392  0.099    0
	47321  209  416  0.040    0
	47322  251  436  0.000    0
	47323  189  369  0.000    0
	47324  175  371  0.000    0
	47325  211  399  0.000    0
	47326  192  393  0.000    0
	47327  249  467  0.083    0
	47328  192  389  0.000    0

Listing a slice of data (colums & rows)

The command below df.iloc allows you to specify the range of rows and columns you would like to take a look at.

In [66]: df.iloc[100:115,3:6]
Out[66]:
	       1m    5m   15m
	100  1.05  0.59  0.44
	101  1.13  0.62  0.44
	102  1.13  0.62  0.44
	103  1.13  0.62  0.44
	104  1.13  0.62  0.44
	105  1.13  0.62  0.44
	106  1.12  0.62  0.45
	107  1.12  0.62  0.45
	108  1.12  0.62  0.45
	109  1.12  0.62  0.45
	110  1.12  0.62  0.45
	111  1.19  0.65  0.46
	112  1.19  0.65  0.46
	113  1.19  0.65  0.46
	114  1.19  0.65  0.46

Listing data in JSON, CSV, Text or HTML

Pandas makes is super easy to export the entire DataFrame or a subset into various different format.

In [79]: df.iloc[100:115,3:6].to_json()
Out[79]: '{"1m":
         {"100":1.05,"101":1.13,"102":1.13,"103":1.13,"104":1.13,
          "105":1.13,"106":1.12,"107":1.12,"108":1.12,"109":1.12,
          "110":1.12,"111":1.19,"112":1.19,"113":1.19,"114":1.19},
          "5m":{"100":0.59,"101":0.62,"102":0.62,"103":0.62,"104":0.62,
          "105":0.62,"106":0.62,"107":0.62,"108":0.62,"109":0.62,"110":0.62,
          "111":0.65,"112":0.65,"113":0.65,"114":0.65},"15m":
         {"100":0.44,"101":0.44,"102":0.44,"103":0.44,"104":0.44,"105":0.44,
          "106":0.45,"107":0.45,"108":0.45,"109":0.45,"110":0.45,
          "111":0.46,"112":0.46,"113":0.46,"114":0.46}}'

Notice the use of print statement in case of the method .to_string() below. Without print statement the output is still there but that would print the new line characters and will defeat the purpose of visual inspection of content interactively.

In [81]: print df.iloc[100:115,3:6].to_string()
	       1m    5m   15m
	100  1.05  0.59  0.44
	101  1.13  0.62  0.44
	102  1.13  0.62  0.44
	103  1.13  0.62  0.44
	104  1.13  0.62  0.44
	105  1.13  0.62  0.44
	106  1.12  0.62  0.45
	107  1.12  0.62  0.45
	108  1.12  0.62  0.45
	109  1.12  0.62  0.45
	110  1.12  0.62  0.45
	111  1.19  0.65  0.46
	112  1.19  0.65  0.46
	113  1.19  0.65  0.46
	114  1.19  0.65  0.46

Listing large DataFrame

While interactively working with medium/large (it’s all relative eh?) data-set’s I sometime still find it useful to list the entire dataframe on console for finding patterns. The DataFrame structure provides a method .to_string() that would convert the dataframe into plain text that could be printed on the console.

I find the following function more convenient as it provides me with some flexibility in choosing the range of columns that I am interested in or I am able to fit on my terminal and would allow me to page through the content screen by screen.

def listdf ( df, npp=15, st_idx=None, en_idx=None ):

	# Inputs:
	#       df: Name of the data-frame
	#      npp: Number of records per page (default=15)
	#   st_idx: Starting column number (default=0)
	#   en_idx: Ending column number (default=last_one)

	# Get the size of DataFrame
	nrows, ncols = df.shape

	# Number of Rows / Number of Pages Per Page
	total_pages = nrows / npp

	# Number of records in last page
	rec_last_pg = nrows % npp

	# Set starting and ending index if not pre-defined
	if ( st_idx == None ):
		st_idx = 0

	if ( en_idx == None ):
		en_idx = ncols


	ptr = 0
	for i in range(total_pages):

	# From - "fmr" | To - "to"
	fmr = ptr
	tor = fmr + npp
	ptr = tor + 1

	print df.ix[fmr:tor, st_idx:en_idx].to_string()

	try:
		raw_input("Press 'Enter' to continue or '^C' to break...")

	except KeyboardInterrupt:
		print "Breaking loop"
		break

A sample screen shows how the output would look like using listdf()

In [67]: listdf(df, st_idx=0, en_idx=10)
           epoch   recv  send    1m    5m   15m  run  blk     new        used
0   1.375226e+09      0     0  0.52  0.42  0.39    0    0   2.482  1614442496
1   1.375226e+09    488  1014  0.52  0.42  0.39    0    0   0.000  1615060992
2   1.375226e+09   1018   406  0.52  0.42  0.39    0    0   3.000  1615052800
3   1.375226e+09    368   406  0.52  0.42  0.39    0    0   0.000  1614999552
4   1.375226e+09   1367   626  0.48  0.41  0.39    0    0   8.000  1614876672
5   1.375226e+09   1564   658  0.48  0.41  0.39    0    0   1.000  1614876672
6   1.375226e+09    410   556  0.48  0.41  0.39    0    0   0.000  1614868480
7   1.375226e+09   1209   492  0.48  0.41  0.39    0    0   3.000  1614868480
8   1.375226e+09    368   406  0.48  0.41  0.39    0    0   0.000  1615491072
9   1.375226e+09    402   610  0.52  0.42  0.39    0    0  14.000  1622495232
10  1.375226e+09    956   492  0.52  0.42  0.39    0    0   1.000  1622654976
11  1.375226e+09    768   664  0.52  0.42  0.39    1    0   7.000  1624350720
12  1.375226e+09  10242   406  0.52  0.42  0.39    0    1  32.000  1627070464
13  1.375226e+09    766   948  0.52  0.42  0.39    1    0  30.000  1627877376
14  1.375226e+09    948   492  0.48  0.41  0.39    0    0  10.000  1627705344
15  1.375226e+09    780   760  0.48  0.41  0.39    1    0   9.000  4526784512

Press 'Enter' to continue or '^C' to break...

With a single statement (using the above function) on my IPython console I can review the content of a DataFrame page by page with some flexibility on selection of columns.

Cleaning up the DataFrame

Converting Unix “epoch” to human readable

You will notice from the first column “epoch” that it is of limited to no use in the format it’s presented. First of all we humans (atleast me) don’t think in epoch, and even if we try hard to make it work presenting it in exp notation makes it completely useless. So first we will convert this epoch to a human readable format. The apply function provided by pandas allows to quickly apply any function on selected rows/columns.

import time

# assuming the dataframe "df" is still in place
dt = df["epoch"].apply(time.ctime)

# now inset the Series dt inside dataframe
# with column name as datetime
df["datetime"] = dt

Now let’s see if the content of the column are what we would like them to be:

# Print first 10 rows and first 5 columns on console
In [170]: print df.ix[1:10, 45:].to_string()
	  usr    sys      idl    wai  hiq  siq  used.1  free.1         datetime
1   0.052  0.103   99.845    0    0    0       0       0  Tue Jul 30 21:26:21 2013
2   0.055  0.000   99.945    0    0    0       0       0  Tue Jul 30 21:26:22 2013
3   0.101  0.000   99.899    0    0    0       0       0  Tue Jul 30 21:26:23 2013
4   0.073  0.000   99.927    0    0    0       0       0  Tue Jul 30 21:26:24 2013
5   0.067  0.000   99.933    0    0    0       0       0  Tue Jul 30 21:26:25 2013
6   0.000  0.000  100.000    0    0    0       0       0  Tue Jul 30 21:26:26 2013
7   0.279  0.070   99.652    0    0    0       0       0  Tue Jul 30 21:26:27 2013
8   0.096  0.000   99.904    0    0    0       0       0  Tue Jul 30 21:26:28 2013
9   0.053  0.000   99.947    0    0    0       0       0  Tue Jul 30 21:26:29 2013
10  0.050  0.050   99.900    0    0    0       0       0  Tue Jul 30 21:26:30 2013

Relabeling the column names

I like working with DataFrames with clean column names. There are two places this could be done, ideally if the ingest source has cleaned up names then the following process would not be required. But unfortunately that is not always possible and for addressing that “pandas” makes it very easy to rename the column names easily in a single command. The following code block explains how we can replace the column names with a list of new names…

In [171]: colname=[ "epoch",                                      \
		"net_recv", "net_send",                                       \
		"load_1m", "load_5m", "load_15m",                             \
		"proc_run", "proc_blk", "proc_new",                           \
		"mem_used", "mem_buff", "mem_cache", "mem_free",              \
		"pg_in", "pg_out",                                            \
		"sda_read", "sda_writ",                                       \
		"sfa00_read", "sfa00_writ", "sfa01_read", "sfa01_writ",       \
		"sfa02_read", "sfa02_writ", "sfa03_read", "sfa03_writ",       \
		"sfa04_read", "sfa04_writ", "sfa05_read", "sfa05_writ",       \
		"sfa06_read", "sfa06_writ", "sfa07_read", "sfa07_writ",       \
		"sfa08_read", "sfa08_writ", "sfa09_read", "sfa09_writ",       \
		"sfa10_read", "sfa10_writ", "sfa11_read", "sfa11_writ",       \
		"sfa12_read", "sfa12_writ",                                   \
		"int", "csw",                                                 \
		"cpu_usr", "cpu_sys", "cpu_idl", "cpu_wai",                   \
		"cpu_hiq", "cpu_siq",                                         \
		"swap_used", "swap_free",                                     \
		"datetime" ]

In [172]: df.columns = colname

In [173]: df
Out[173]:
	<class 'pandas.core.frame.DataFrame'>
	Int64Index: 47329 entries, 0 to 47328
	Data columns (total 54 columns):
	epoch         47329  non-null values
	net_recv      47329  non-null values
	net_send      47329  non-null values
	load_1m       47329  non-null values
	load_5m       47329  non-null values
	load_15m      47329  non-null values
	proc_run      47329  non-null values
	proc_blk      47329  non-null values
	proc_new      47329  non-null values
	mem_used      47329  non-null values
	mem_buff      47329  non-null values
	mem_cache     47329  non-null values
	mem_free      47329  non-null values
	pg_in         47329  non-null values
	pg_out        47329  non-null values
	sda_read      47329  non-null values
	sda_writ      47329  non-null values
	sfa00_read    47329  non-null values
	sfa00_writ    47329  non-null values
	sfa01_read    47329  non-null values
	.
	.
	.
	.
	.
	cpu_wai       47329  non-null values
	cpu_hiq       47329  non-null values
	cpu_siq       47329  non-null values
	swap_used     47329  non-null values
	swap_free     47329  non-null values
	datetime      47329  non-null values
	dtypes: float64(53), object(1)

Deleting uninteresting columns

Next step I take towards cleaning up a DataFrame before analyzing is to get rid of the columns that I have no interest in or the ones that does not contain any interesting data. For example in this example the block devices that did not register any activities..

Using the listdf( ) function we can tell that out of the 13 block devices (sfa00..sfa12) there seems to be only two block devices that are active during this measurement window. Now let’s validate that and get rid of the columns that did not register any IO activity.

# The code block below subsets the data-frame df on a condition
# where the selected rows are the rows where "sfa09_writ" !=0
In [183]: len ( df[ df["sfa09_writ"] !=0 ] )
Out[183]: 0

Additionally the method describe() could also be used to further validate that the column “sfa09_writ” has not registered much (or any activity). The method describe() provides a summary (mean, min, max, count) of the specified column.

In [187]: df["sfa09_read"].describe()
Out[187]:
	count    47329.000000
	mean         0.259757
	std         56.482978
	min          0.000000
	25%          0.000000
	50%          0.000000
	75%          0.000000
	max      12288.000000
	dtype: float64

By using listdf for visual inspection, conditional filter and describe I have identified that columns sfa02, sfa04, sfa05, sfa06, sfa07, sfa08, sfa09, sfa10, sfa11, sfa12 (for both read & write) have not registered any IO and can be safely dropped.

Saving the DataFrame for re-use

Once the dataframe is in a state that is amenable for analysis I typically save it in persistent store such that it could be used over and over again. For that I use the standard “shelve” module in python.

import shelve

# define a location to save the object
db=shelve.open("dataframe.dbm")

# assign the objects that we would like to save
db["df"]=df

# close the open file descriptor
db.close()

Once the objects have been saved using shelve they could be accessed at a later point in time. While performing shotgun analysis I find it helpful to save the cleansed dataframes in file to save me time and make it easier to start off from where I left. Below is a snippet that shows how you can quickly get access to your saved objects.

import shelve

# open the file that contains your objects
db = shelve.open("dataframe.dbm")

# get the object you want from the dictionary
df = db["df"]

# the variable df now contains the DataFrame we saves earlier
comments powered by Disqus