Basic plotting with matplotlib

Anand Bisen bio photo By Anand Bisen Comment

In this post I have documented some common plots that I typically use while analyzing system performance data. And following the last post we will use similar dstat dataset. In these steps I am not trying to come to a inference rather I am just plotting the data that I have at hand in different forms and the intent is that these steps (plots) should automatically point us in right direction… “We will let the data speak :-)”

So once the data is in a form that it could be easily processed (as explained in the previous post), the next step for me is to visually analyze it by plotting in form of simple graphs. The module matplotlib makes it very easy to quickly visualize tabular data and can natively use DataFrames from pandas. For this post I will be using yet another log file generated by dstat. The data used in the example can be downloaded here [ raw_data and cleaned ready to go shelve_db ].

The structure of the DataFrame df is described in the block below. Each column contains values of various system component represented by column name. The column epoch represents the Unix time stamp when the values were measured.

In [60]: df
Out[60]:
	<class 'pandas.core.frame.DataFrame'>
	Int64Index: 3078 entries, 0 to 3077
	Data columns (total 45 columns):
	epoch        3078  non-null values
	net0_recv    3078  non-null values
	net0_send    3078  non-null values
	net1_recv    3078  non-null values
	net1_send    3078  non-null values
	load_1m      3078  non-null values
	load_5m      3078  non-null values
	load_15m     3078  non-null values
	proc_run     3078  non-null values
	proc_blk     3078  non-null values
	proc_new     3078  non-null values
	mem_used     3078  non-null values
	mem_buff     3078  non-null values
	mem_cach     3078  non-null values
	mem_free     3078  non-null values
	pg_in        3078  non-null values
	pg_out       3078  non-null values
	sda_read     3078  non-null values
	sda_write    3078  non-null values
	sdb_read     3078  non-null values
	sdb_write    3078  non-null values
	sdc_read     3078  non-null values
	sdc_write    3078  non-null values
	sdd_read     3078  non-null values
	sdd_write    3078  non-null values
	sde_read     3078  non-null values
	sde_write    3078  non-null values
	sdf_read     3078  non-null values
	sdf_write    3078  non-null values
	sdg_read     3078  non-null values
	sdg_write    3078  non-null values
	sdh_read     3078  non-null values
	sdh_write    3078  non-null values
	sdi_read     3078  non-null values
	sdi_write    3078  non-null values
	int          3078  non-null values
	csw          3078  non-null values
	cpu_usr      3078  non-null values
	cpu_sys      3078  non-null values
	cpu_idl      3078  non-null values
	cpu_wai      3078  non-null values
	cpu_hiq      3078  non-null values
	cpu_siq      3078  non-null values
	swap_used    3078  non-null values
	swap_free    3078  non-null values
	dtypes: float64(45)

A good place to start exploring the data set is to plot the graph of cpu utilization to identify the window of time when there was some action happenning on the system. A quick way to do that would be to plot the value of 100 - cpu_idl column as that would represent the value of (cpu_wait + cpu_usr + cpu_sys).

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

plt.plot(df['epoch'], 100-df['cpu_idl'])
plt.xlabel("Time")
plt.ylabel("CPU Utilization")
plt.title("CPU Busy")

I really like the ability of matplotlib where you can interact with the plot and zoom into selection. The figure below shows how we can zoom into the section which we are interested in.

figure 1b

The figure below shows the result of the selection where we can see the magnified view of CPU Utilization. The red boxes and arrows represents the “Zoom to selection” toolbar action and the selection that was used for this selection respectively.

figure 2

Since we are sampling the utilization every second and there is a lot of variation (even after zooming in) it sometimes helps to smooth out the graph and get a new perspective of the average utilization. I found a useful snippet or Signal Smoothing at SciPy.org that could be used to smooth out the signal which could be used for plotting again using the same method.

The function smooth() implements various algorithms like flat, hamming, hanning, bartlett and blackman it also provides an option to choose the window length for smoothing. For this dataset the defaults worked just fine.

plt.plot(smooth(df['epoch']), smooth(100-df['cpu_idl']))

figure 3

For clarity if we just plot the smooth data only (still zoomed into the same area of interest), it results in the following graph. The resulting graph after smoothing gives a much clearer picture CPU utilization. The second figure is the complete figure (zoomed out) for the entire dataset.

Next we will dig down a bit deeper into the individual component of CPU Utilization. One of the indicator of inefficient use of system resource is “cpu_wait”. This is the time CPU is not performing any useful function and is just waiting for event like access to data (IO).

The log’s from dstat provides multiple columns that represents the CPU utilization broken down in individual components like cpu_wait, cpu_sys, cpu_usr, cpu_idl. So let’s plot a graph with two lines one representing potentially effecient cycles (cpu_usr+cpu_sys) and the other one representing wasted cycles (cpu_wait). And this time we will also use legends in our graph.

# Clear the previous figure
plt.clf()

plt.plot(smooth(df['epoch'], window_len=33), smooth(df['cpu_sys']+df['cpu_usr'], \
		window_len=33),"g-", label="cpu sys+usr")
plt.plot(smooth(df['epoch'], window_len=33), smooth(df['cpu_wai'], \
		window_len=33), "r-", label="cpu wait")

plt.legend()
plt.xlabel("Time")
plt.ylabel("CPU Utilization")
plt.title("CPU Busy")

figure 6

Another metric to investigate the “CPU Wait” is the state of processes which is exposed by the /proc/stat interface and is collected by dstat as well. The two measures we are interested in are procs_run and “procs_blk” which represents running and blocked processes respectively.

# Initialize subplot with 1 row and 2 columns
x=plt.subplot(2,2,1)
plt.plot(df["proc_run"], "bo")
plt.title("Running Processes")

# Now move to the second column of the subplot
x=plt.subplot(1,2,2)
plt.plot(df["proc_blk"], "ro")
plt.title("Blocked Processes")

# Now move to the third cell (2,1)
# And plot the frequencies of the running and blocked processes
x=plt.subplot(2,2,3)
plt.hist(df["proc_run"], bins=50)
plt.title("Running Processes Frequency")
plt.xlabel("Number of Running Processes")

# Last cell "Blocked Processes"
x=plt.subplot(2,2,4)
plt.hist(df["proc_blk"], bins=50)
plt.title("Blocked Processes Frequency")
plt.xlabel("Number of Blocked Processes")

figure 10

Now let’s see if there is any corelation between the number of running processes and number of blocked processes. This could be easily quantified by correlation method provided by “pandas” on a series. Alternatively we could also visually see that using scatter plot. As you can see from the code block below that there’s very little corelation between the two variables, nevertheless we will go ahead and plot the data as well.

In [188]: df.proc_run.corr(df.proc_blk)
Out[188]: 0.23837856548713582

plt.plot(df["proc_blk"], df["proc_run"], "ro")
plt.xlabel("Blocked Processes")
plt.ylabel("Running Processes")
plt.title("Scatter Plot of Running Processes vs Blocked Processed")

figure 8

So we know that there significant amount of cpu wait cycles which generally refers to a bottleneck in storage let’s plot the cpu_wait graph aligned with the block device that is being used in IO (/dev/sdb in this case).

x=plt.subplot(3,1,1)

plt.title("Read IO to /dev/sdb")
x.plot(smooth(df['sdb_read']/(1024*1024)), "b-")

x=plt.subplot(3,1,2)
plt.title("Write IO to /dev/sdb")

x.plot(smooth(df['sdb_write']/(1024*1024)), "g-")
x=plt.subplot(3,1,3)

plt.title("CPU IO Wait")
x.plot(smooth(df['cpu_wai']), "r-")

In the next post we will continue to work with the same dataset and dig deeper to see if there patterns and correlation’s we could find between these variables.

comments powered by Disqus