In this post I have documented some common plots that I typically use while
analyzing system performance data. And following the last post we will use
dstat dataset. In these steps I am not trying to come to a inference
rather I am just plotting the data that I have at hand in different forms and
the intent is that these steps (plots) should automatically point us in right
direction… “We will let the data speak :-)”
So once the data is in a form that it could be easily processed (as explained in the previous post), the next step for me is to visually analyze it by plotting in form of simple graphs. The module matplotlib makes it very easy to quickly visualize tabular data and can natively use DataFrames from pandas. For this post I will be using yet another log file generated by dstat. The data used in the example can be downloaded here [ raw_data and cleaned ready to go shelve_db ].
The structure of the DataFrame
df is described in the block below. Each column
contains values of various system component represented by column name. The
epoch represents the Unix time stamp when the values were measured.
In : df Out: <class 'pandas.core.frame.DataFrame'> Int64Index: 3078 entries, 0 to 3077 Data columns (total 45 columns): epoch 3078 non-null values net0_recv 3078 non-null values net0_send 3078 non-null values net1_recv 3078 non-null values net1_send 3078 non-null values load_1m 3078 non-null values load_5m 3078 non-null values load_15m 3078 non-null values proc_run 3078 non-null values proc_blk 3078 non-null values proc_new 3078 non-null values mem_used 3078 non-null values mem_buff 3078 non-null values mem_cach 3078 non-null values mem_free 3078 non-null values pg_in 3078 non-null values pg_out 3078 non-null values sda_read 3078 non-null values sda_write 3078 non-null values sdb_read 3078 non-null values sdb_write 3078 non-null values sdc_read 3078 non-null values sdc_write 3078 non-null values sdd_read 3078 non-null values sdd_write 3078 non-null values sde_read 3078 non-null values sde_write 3078 non-null values sdf_read 3078 non-null values sdf_write 3078 non-null values sdg_read 3078 non-null values sdg_write 3078 non-null values sdh_read 3078 non-null values sdh_write 3078 non-null values sdi_read 3078 non-null values sdi_write 3078 non-null values int 3078 non-null values csw 3078 non-null values cpu_usr 3078 non-null values cpu_sys 3078 non-null values cpu_idl 3078 non-null values cpu_wai 3078 non-null values cpu_hiq 3078 non-null values cpu_siq 3078 non-null values swap_used 3078 non-null values swap_free 3078 non-null values dtypes: float64(45)
A good place to start exploring the data set is to plot the graph of cpu
utilization to identify the window of time when there was some action happenning
on the system. A quick way to do that would be to plot the value of
100 - cpu_idl column as that would represent the value of
(cpu_wait + cpu_usr + cpu_sys).
import pandas as pd import matplotlib.pyplot as plt import numpy as np plt.plot(df['epoch'], 100-df['cpu_idl']) plt.xlabel("Time") plt.ylabel("CPU Utilization") plt.title("CPU Busy")
I really like the ability of matplotlib where you can interact with the plot and zoom into selection. The figure below shows how we can zoom into the section which we are interested in.
The figure below shows the result of the selection where we can see the magnified view of CPU Utilization. The red boxes and arrows represents the “Zoom to selection” toolbar action and the selection that was used for this selection respectively.
Since we are sampling the utilization every second and there is a lot of variation (even after zooming in) it sometimes helps to smooth out the graph and get a new perspective of the average utilization. I found a useful snippet or Signal Smoothing at SciPy.org that could be used to smooth out the signal which could be used for plotting again using the same method.
The function smooth() implements various algorithms like
blackman it also provides an option to choose the
window length for smoothing. For this dataset the defaults worked just fine.
For clarity if we just plot the smooth data only (still zoomed into the same area of interest), it results in the following graph. The resulting graph after smoothing gives a much clearer picture CPU utilization. The second figure is the complete figure (zoomed out) for the entire dataset.
Next we will dig down a bit deeper into the individual component of CPU Utilization. One of the indicator of inefficient use of system resource is “cpu_wait”. This is the time CPU is not performing any useful function and is just waiting for event like access to data (IO).
The log’s from
dstat provides multiple columns that represents the CPU
utilization broken down in individual components like
cpu_idl. So let’s plot a graph with two lines one representing
potentially effecient cycles (cpu_usr+cpu_sys) and the other one representing
wasted cycles (cpu_wait). And this time we will also use legends in our graph.
# Clear the previous figure plt.clf() plt.plot(smooth(df['epoch'], window_len=33), smooth(df['cpu_sys']+df['cpu_usr'], \ window_len=33),"g-", label="cpu sys+usr") plt.plot(smooth(df['epoch'], window_len=33), smooth(df['cpu_wai'], \ window_len=33), "r-", label="cpu wait") plt.legend() plt.xlabel("Time") plt.ylabel("CPU Utilization") plt.title("CPU Busy")
Another metric to investigate the “CPU Wait” is the state of processes which is
exposed by the
/proc/stat interface and is collected by
dstat as well.
The two measures we are interested in are
procs_run and “procs_blk” which
represents running and blocked processes respectively.
# Initialize subplot with 1 row and 2 columns x=plt.subplot(2,2,1) plt.plot(df["proc_run"], "bo") plt.title("Running Processes") # Now move to the second column of the subplot x=plt.subplot(1,2,2) plt.plot(df["proc_blk"], "ro") plt.title("Blocked Processes") # Now move to the third cell (2,1) # And plot the frequencies of the running and blocked processes x=plt.subplot(2,2,3) plt.hist(df["proc_run"], bins=50) plt.title("Running Processes Frequency") plt.xlabel("Number of Running Processes") # Last cell "Blocked Processes" x=plt.subplot(2,2,4) plt.hist(df["proc_blk"], bins=50) plt.title("Blocked Processes Frequency") plt.xlabel("Number of Blocked Processes")
Now let’s see if there is any corelation between the number of running processes and number of blocked processes. This could be easily quantified by correlation method provided by “pandas” on a series. Alternatively we could also visually see that using scatter plot. As you can see from the code block below that there’s very little corelation between the two variables, nevertheless we will go ahead and plot the data as well.
In : df.proc_run.corr(df.proc_blk) Out: 0.23837856548713582 plt.plot(df["proc_blk"], df["proc_run"], "ro") plt.xlabel("Blocked Processes") plt.ylabel("Running Processes") plt.title("Scatter Plot of Running Processes vs Blocked Processed")
So we know that there significant amount of cpu wait cycles which generally refers
to a bottleneck in storage let’s plot the cpu_wait graph aligned with the block
device that is being used in IO (
/dev/sdb in this case).
x=plt.subplot(3,1,1) plt.title("Read IO to /dev/sdb") x.plot(smooth(df['sdb_read']/(1024*1024)), "b-") x=plt.subplot(3,1,2) plt.title("Write IO to /dev/sdb") x.plot(smooth(df['sdb_write']/(1024*1024)), "g-") x=plt.subplot(3,1,3) plt.title("CPU IO Wait") x.plot(smooth(df['cpu_wai']), "r-")
In the next post we will continue to work with the same dataset and dig deeper to see if there patterns and correlation’s we could find between these variables.comments powered by Disqus