Programming for Data Science¶

Visualization - matplotlib¶

Dr. Bhargavi R

SCOPE, VIT Chennai

Introduction¶

  • Fundamental role of a data scientist is
    • Explore the data.
    • Communicate the exploration/analysiis results efficiently and effectively with target audience.
  • We will use mtplotlib to create few visualization bar charts, scatterplots, histogram etc.
  • matplotlib is a Python-based plotting library for 2D and 3D (limiited) graphics.
In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

# %matplotlib inline
In [2]:
# let's plot a simple one data point (5,3)
plt.figure()
plt.plot(5,3,'.')
Out[2]:
[<matplotlib.lines.Line2D at 0x10b017a00>]
In [3]:
plt.figure()
plt.plot(3, 2, 'o')
Out[3]:
[<matplotlib.lines.Line2D at 0x10b140880>]

Line charts¶

  • Line charts are good for showing the trend
In [4]:
x = list(range(1,10))
y = [item ** 2 for item in x  ]
plt.figure()
plt.plot(x, y, c = 'green', marker = 'o')
plt.title('Quadratic Line')
plt.xlabel('x value')
plt.ylabel('x square')
Out[4]:
Text(0, 0.5, 'x square')
In [6]:
x = np.linspace(0, 10, 100)

fig = plt.figure()
plt.plot(x, np.sin(x), '-.')
plt.plot(x, np.cos(x), '--')
Out[6]:
[<matplotlib.lines.Line2D at 0x10b27f310>]
In [7]:
#Save a figure to a file
fig.savefig('my_figure.png')
In [8]:
plt.figure()  # create a plot figure
x = list(range(1,10))
y = [item ** 2 for item in x  ]
z = [item * 2 for item in x  ]
# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(x, y)

# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, z)
Out[8]:
[<matplotlib.lines.Line2D at 0x10b210d30>]
In [11]:
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(1,2)

x = list(range(1,10))
y = [item ** 2 for item in x  ]
z = [item * 2 for item in x  ]

# Call plot() method on the appropriate object
ax[0].plot(x, y, ':', color = 'c')
ax[1].plot(x, z, '-.g') # combining line color and style

fig.suptitle('This is the Figure Title', fontsize=15)
ax[0].set_title("Plot 1")
ax[1].set_title("Plot 2")
Out[11]:
Text(0.5, 1.0, 'Plot 2')
In [14]:
#Adjusting  Axes limits
x = np.linspace(0, 10, 100)

fig = plt.figure()
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);
plt.plot(x, np.sin(x), '-')
Out[14]:
[<matplotlib.lines.Line2D at 0x10f6aef80>]
In [15]:
day = np.arange('2020-01-01', '2020-01-11', dtype = 'datetime64[D]' )
max_temp = [28, 30, 22, 29, 24, 25, 25, 24, 20, 21]
min_temp = [14, 10, 13, 15, 11, 12, 13, 10, 10, 12]
plt.figure()
plt.plot(day, max_temp, '-o', label = 'Max Temp')
plt.plot(day, min_temp, '-o', label = 'Min Temp')

plt.legend()
# plt.legend(('maximum', 'minimum'))

plt.title(' Temperature Record for January Month')
plt.xlabel('Day')
plt.ylabel('Temprature')
#can also be relpaced as follows:
# plt.plot(day, max_temp, '-o', day, min_temp, '-o')

x = plt.gca().xaxis
# rotate the tick labels for the x axis
for item in x.get_ticklabels():
    item.set_rotation(45) 
    
plt.gca().fill_between(day,
                      max_temp,
                      min_temp,
                      facecolor = 'blue',
                      alpha = 0.25)
Out[15]:
<matplotlib.collections.PolyCollection at 0x10f75a860>

Scatter Plots¶

  • Scatter plots are good to show the relation between pairs of data
In [16]:
x = np.array([1,2,3,4,5,6,7,8,9,10])

y = x ** 3

# create a list of colors for each point to have
# ['green', 'green', 'green', 'green', 'green', 'green', 'green', 'red']
colors = ['green']*(len(x) -2)
colors.append('red')
colors.append('blue')


plt.figure()

# plot the point with size 100 and chosen colors
plt.scatter(x, y, s=50, c=colors)
for i, j in zip(x, y):
    label  = i
    plt.annotate(label,(i,j), textcoords="offset points", xytext=(0,10))
    

Bar Plots¶

  • A bar plot is good to use when you want to show how some quantity varies among some discrete set of items
In [17]:
names = ['Virat','Dhoni', 'Sourav', 'Sachin']
matches_2015 = [6, 8, 6, 10]
matches_2016 = [4, 8, 9, 8]
plt.figure()
bars = plt.bar(names, matches_2015)
In [18]:
x = range(1,5)
matches_2015 = [6, 8, 6, 10]
matches_2016 = [4, 8, 9, 8]
new_x = [a+0.3 for a in x]

plt.bar(x, matches_2015, width = 0.3)
plt.bar(new_x , matches_2016, color = 'red', width = 0.3)
Out[18]:
<BarContainer object of 4 artists>
In [19]:
N_points = 100
n_bins = 5

# Generate a normal distribution, center at x=0
x = np.random.randn(N_points)

fig, axs = plt.subplots(1, 1)

# We can set the number of bins with the `bins` kwarg
axs.hist(x, bins=n_bins)
Out[19]:
(array([10., 28., 47., 11.,  4.]),
 array([-2.43961792, -1.32378085, -0.20794378,  0.90789329,  2.02373036,
         3.13956743]),
 <BarContainer object of 5 artists>)
In [20]:
# Alternatively
N_points = 100
n_bins = 20

# Generate a normal distribution, center at x=0
x = np.random.randn(N_points)

plt.figure()

# We can set the number of bins with the `bins` kwarg
plt.hist(x, bins=n_bins)
Out[20]:
(array([ 1.,  0.,  3.,  3.,  5.,  8., 10., 10.,  8., 11., 10.,  6.,  5.,
         5., 10.,  1.,  0.,  3.,  0.,  1.]),
 array([-2.89158023, -2.57876558, -2.26595092, -1.95313627, -1.64032162,
        -1.32750697, -1.01469231, -0.70187766, -0.38906301, -0.07624836,
         0.23656629,  0.54938095,  0.8621956 ,  1.17501025,  1.4878249 ,
         1.80063956,  2.11345421,  2.42626886,  2.73908351,  3.05189816,
         3.36471282]),
 <BarContainer object of 20 artists>)
In [21]:
# Alternatively
N_points = 100
n_bins = 10

# Generate a normal distribution, center at x=0
x = np.random.randn(N_points)

plt.figure()

# We can set the number of bins with the `bins` kwarg
plt.hist(x, bins=n_bins, color = 'lightblue',  edgecolor = 'orange')
Out[21]:
(array([ 1.,  7., 16., 16., 20., 22., 12.,  3.,  2.,  1.]),
 array([-2.28673264, -1.78894263, -1.29115261, -0.7933626 , -0.29557259,
         0.20221743,  0.70000744,  1.19779746,  1.69558747,  2.19337749,
         2.6911675 ]),
 <BarContainer object of 10 artists>)