• No products in the cart.

Handout – Basic Statistics, Graphs and Reports in Python

 

Introduction

Here we will try to cover basic statistics, graphs and reports. So far, we have covered basic python programming and basic data handling. In this session we will cover the basic statistics. The statistical concepts are very important to learn before we get into real analytics. Once we have imported our datasets, performing some basic statistics will give us the idea of what the parameters are, what the variables do, how they are working, how they are distributed, etc. Understanding this will give us a basic idea of the data that we have imported.

Contents

  • Taking a random sample from data
  • Descriptive statistics
    • Central Tendency
    • Variance
  • Quartiles, Percentiles
  • Box Plots
  • Graphs

Sampling in Python

Sampling is a method to select few observations from a large population or a large dataset, in a way that all the underlying characteristics can be represented with the sample that we already taken. Basically, it is nothing but a subset of a large dataset and each value in the subset is taken randomly. Now if we want to obtain the result from the whole dataset, a similar result can be obtained from the sample dataset. This is the advantage of sampling. We will now see how to perform sampling in python. In order to import any data we need to use pandas package. For sampling we need to use sample() function for sampling the data. The code for sampling the data is as follows:

In [14]:
#Taking a random sample 

import pandas as pd
Online_Retail=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsOnline_Retail_Sales_DataOnline Retail.csv", encoding = "ISO-8859-1")
In [15]:
Online_Retail.shape
Out[15]:
(541909, 8)
In [16]:
sample_data=Online_Retail.sample(n=1000,replace="False")
sample_data.shape
Out[16]:
(1000, 8)

LAB: Sampling in Python

  • Import “Census Income Data/Income_data.csv”
  • Create a new dataset by taking a random sample of 5000 records
In [21]:
#Import “Census Income Data/Income_data.csv”
Income=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsCensus Income DataIncome_data.csv")
Income.shape
Out[21]:
(32561, 15)
In [18]:
Income.head()
Out[18]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country Income_band
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
In [22]:
Income.tail(3)
Out[22]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country Income_band
32558 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States <=50K
32559 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States <=50K
32560 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States >50K
In [23]:
 #Sample size 5000
Sample_income=Income.sample(n=5000)
Sample_income.shape
Out[23]:
(5000, 15)
In [24]:
Sample_income
Out[24]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country Income_band
10267 27 Private 185647 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 50 United-States >50K
20313 33 Private 203488 Some-college 10 Never-married Exec-managerial Own-child White Male 0 0 45 United-States <=50K
12276 21 Private 180339 Assoc-voc 11 Never-married Farming-fishing Not-in-family White Female 0 1602 30 United-States <=50K
7399 42 Local-gov 227065 Masters 14 Never-married Prof-specialty Not-in-family White Male 0 0 22 United-States <=50K
19773 49 Self-emp-not-inc 79627 Some-college 10 Married-civ-spouse Exec-managerial Husband White Male 0 0 40 United-States <=50K
9289 19 Private 145844 Assoc-acdm 12 Never-married Exec-managerial Not-in-family White Female 0 0 50 United-States <=50K
25176 33 Local-gov 175509 HS-grad 9 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States <=50K
7501 51 Self-emp-not-inc 32372 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 1672 70 United-States <=50K
3524 21 Private 197387 HS-grad 9 Never-married Handlers-cleaners Own-child White Male 0 0 20 United-States <=50K
1938 50 State-gov 159219 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 40 Canada >50K
3966 49 Private 101825 Masters 14 Married-civ-spouse Prof-specialty Wife White Female 0 1977 40 United-States >50K
12396 35 Private 267966 11th 7 Never-married Machine-op-inspct Not-in-family White Male 0 0 50 United-States <=50K
23926 59 Private 118358 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States >50K
18136 35 Self-emp-not-inc 42044 Assoc-acdm 12 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K
23862 32 Private 244147 HS-grad 9 Never-married Craft-repair Unmarried White Male 0 0 10 United-States <=50K
26300 50 Self-emp-not-inc 167728 Doctorate 16 Married-civ-spouse Prof-specialty Husband White Male 0 0 60 United-States >50K
28729 70 Self-emp-inc 158437 Prof-school 15 Married-civ-spouse Prof-specialty Husband White Male 0 0 35 United-States >50K
28419 57 Self-emp-not-inc 184553 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 50 United-States >50K
27474 30 Private 108386 Assoc-acdm 12 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K
1646 33 Private 240763 Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 0 0 40 United-States <=50K
9597 63 State-gov 109735 HS-grad 9 Divorced Adm-clerical Not-in-family White Female 0 0 38 United-States <=50K
12785 25 Private 120238 HS-grad 9 Married-spouse-absent Machine-op-inspct Not-in-family White Male 0 0 40 Poland <=50K
26324 18 Private 77845 Some-college 10 Never-married Adm-clerical Own-child White Male 0 1602 15 United-States <=50K
12462 44 Self-emp-not-inc 155930 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 60 United-States <=50K
2798 35 Private 25955 11th 7 Never-married Other-service Unmarried Amer-Indian-Eskimo Male 0 0 40 United-States <=50K
997 48 Federal-gov 33109 Bachelors 13 Divorced Exec-managerial Unmarried White Male 0 0 58 United-States >50K
12591 56 Private 92444 Some-college 10 Married-civ-spouse Exec-managerial Husband Black Male 0 0 50 United-States >50K
29534 43 State-gov 424094 Doctorate 16 Never-married Prof-specialty Not-in-family White Female 0 0 40 United-States <=50K
16456 46 ? 37672 HS-grad 9 Divorced ? Not-in-family White Female 0 0 15 United-States <=50K
23168 22 Private 54825 HS-grad 9 Never-married Sales Not-in-family White Male 0 0 40 United-States <=50K
32304 30 Never-worked 176673 HS-grad 9 Married-civ-spouse ? Wife Black Female 0 0 40 United-States <=50K
29365 25 Private 264055 Bachelors 13 Never-married Adm-clerical Own-child White Male 0 0 40 United-States <=50K
30967 21 Private 156980 HS-grad 9 Never-married Machine-op-inspct Own-child White Male 0 0 60 United-States <=50K
30149 50 Private 135465 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 50 United-States >50K
5904 34 Private 103596 HS-grad 9 Married-civ-spouse Handlers-cleaners Husband White Male 0 0 40 United-States <=50K
4961 28 Private 51331 Assoc-acdm 12 Married-civ-spouse Exec-managerial Wife White Female 0 0 16 United-States >50K
19127 46 Private 164682 Assoc-voc 11 Separated Prof-specialty Not-in-family White Male 0 0 40 United-States <=50K
23627 32 Private 113364 Bachelors 13 Never-married Prof-specialty Not-in-family White Male 0 0 40 United-States >50K
25536 25 Private 187577 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K
11607 38 Private 199816 HS-grad 9 Divorced Machine-op-inspct Own-child White Male 0 0 40 United-States <=50K
2689 22 ? 216563 HS-grad 9 Never-married ? Other-relative White Female 0 0 40 United-States <=50K
5369 30 Private 1184622 Some-college 10 Married-civ-spouse Transport-moving Husband Black Male 0 0 35 United-States <=50K
7863 31 Private 220690 Some-college 10 Divorced Craft-repair Not-in-family White Male 0 0 80 United-States <=50K
21907 56 Private 340171 HS-grad 9 Married-civ-spouse Transport-moving Husband Black Male 0 0 40 United-States <=50K
10043 28 Private 96226 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 45 United-States <=50K
11676 25 Private 167031 10th 6 Never-married Machine-op-inspct Not-in-family White Female 0 0 40 Columbia <=50K
23909 22 Private 113760 11th 7 Never-married Other-service Own-child White Female 0 0 30 United-States <=50K
19997 72 Private 268861 7th-8th 4 Widowed Other-service Not-in-family White Female 0 0 99 ? <=50K
2155 66 ? 117778 11th 7 Married-civ-spouse ? Husband White Male 0 0 40 United-States <=50K
1371 61 ? 190997 HS-grad 9 Married-civ-spouse ? Husband White Male 0 0 6 United-States <=50K
14981 32 Private 193042 Bachelors 13 Married-civ-spouse Sales Husband White Male 0 0 44 United-States <=50K
24814 37 Private 409189 HS-grad 9 Married-civ-spouse Other-service Husband White Male 0 0 30 Mexico <=50K
5967 21 Private 90935 Assoc-voc 11 Never-married Transport-moving Own-child White Male 0 0 40 United-States <=50K
6673 38 Private 343403 Assoc-acdm 12 Divorced Adm-clerical Not-in-family White Female 0 0 16 United-States <=50K
4149 21 Private 237651 Some-college 10 Never-married Other-service Own-child White Male 0 0 35 United-States <=50K
26316 18 Private 205218 Some-college 10 Never-married Other-service Own-child White Female 0 0 20 United-States <=50K
28836 19 Private 311974 HS-grad 9 Never-married Craft-repair Other-relative White Male 0 0 25 Mexico <=50K
27070 44 Local-gov 208528 Assoc-acdm 12 Married-civ-spouse Farming-fishing Husband White Male 0 0 30 United-States <=50K
27361 20 Private 293297 Some-college 10 Never-married Other-service Own-child White Male 0 0 35 United-States <=50K
8997 42 Private 30424 11th 7 Separated Other-service Unmarried White Female 0 0 38 United-States <=50K

5000 rows × 15 columns

Descriptive statistics:

  • The basic descriptive statistics give us an idea about the variables and their distributions.
  • Permit the analyst to describe many pieces of data with few indices.
  • Central tendencies
    • Central tendencies are the middle values of the data frame or any variable.These are of two types.
    • Mean
    • Median
  • Dispersion
    • Dispersion just shows the range or stretch of that variable.
    • Range
    • Variance
    • Standard deviation

Mean

  • The arithmetic mean
  • Sum of values/ Count of values
  • Gives a quick idea on average of a variable

Median

  • Mean is not a good measure in presence of outliers
  • For example Consider below data vector
    • 1.5,1.7,1.9,0.8,0.8,1.2,1.9,1.4, 9 , 0.7 , 1.1
  • 90% of the above values are less than 2, but the mean of above vector is 2
  • There is an unusual value in the above data vector i.e 9
  • It is an outlier in the data vector.
  • Mean is not the true middle value in presence of outliers. Mean is very much effected by the outliers.
  • We use median, the true middle value in such cases.
  • Sort the data either in ascending or descending order.

Caluclating Mean and Median in python

We have to import the income data set and we need to find the mean and median for the variable called capital-gain. In order find the mean or median of a variable, we take the whole dataset, then we use the square bracket to redirect to the column name and then we use .mean() or .median for finding mean or median respectively.

In [30]:
#Import “Census Income Data/Income_data.csv”
Income=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsCensus Income DataIncome_data.csv")

Income.columns.values
Out[30]:
array(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'Income_band'], dtype=object)
In [33]:
#Mean and Median on python
gain_mean=Income["capital-gain"].mean()
gain_mean
Out[33]:
1077.6488437087312
In [34]:
gain_median=Income["capital-gain"].median()
gain_median
Out[34]:
0.0
Here mean of the capital-gain is 1077 and median is 0, which means that 50% of the values are zero. The difference between the mean and median is very high. There are some values in the variable which are trying to compensate the whole division of the values. Looks like there are outliers in the data, so we need to look at percentiles and box plot.

LAB: Mean and Median on Python

  • Dataset: “./Online Retail Sales Data/Online Retail.csv”
  • What is the mean of “UnitPrice”
  • What is the median of “UnitPrice”
  • Is mean equal to median? Do you suspect the presence of outliers in the data?
  • What is the mean of “Quantity”
  • What is the median of “Quantity”
  • Is mean equal to median? Do you suspect the presence of outliers in the data?
In [35]:
Online_Retail=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsOnline_Retail_Sales_DataOnline Retail.csv", encoding = "ISO-8859-1")
Online_Retail.shape
Online_Retail.columns.values
Out[35]:
array(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'], dtype=object)
In [36]:
#Mean and median of 'UnitPrice' in Online Retail data
up_mean=Online_Retail['UnitPrice'].mean()
up_mean
Out[36]:
4.611113626083471
In [37]:
up_median=Online_Retail['UnitPrice'].median()
up_median
Out[37]:
2.08
In [38]:
#Mean of "Quantity" in Online Retail data
Quantity_mean=Online_Retail['Quantity'].mean()
Quantity_mean
Out[38]:
9.55224954743324
In [39]:
Quantity_median=Online_Retail['Quantity'].median()
Quantity_median
Out[39]:
3.0

Dispersion Measures: Variance and Standard Deviation:

Central tendencies are not enough to understand the variable. It only tells us about the middle values, not the degree of the range or stretch of a variable. So the measure of dispersion is necessary to understand the range of the values and the presence of outliers.

Dispersion

  • Just knowing the central tendency is not enough.
  • Two variables might have same mean, but they might be very different.
  • Look at the Profit details of two companies A & B for last 14 Quarters in MMs

 

  • Though the average profit is 15 in both the cases, company B has performed consistently than company A.
  • There were even loses for company A.
  • Measures of dispersion become very vital in such cases

Variance

  • Dispersion is the quantification of deviation of each point from the mean value.
  • Variance is average of squared distances of each point from the mean value.
  • Variance is a fairly good measure of dispersion.
  • Variance in profit for company A is 352 and Company B is 4.9.
  • Company A is giving high fluctuation in profit whereas company B is giving low fluctuation in profit. So company B is preferred.
  • We basically say that, less the variance, less the noice in our data.

Standard Deviation

  • Standard deviation is just the square root of variance
  • Variance gives a good idea on dispersion, but it is in the order of squares.
  • It’s very clear from the formula that variance units are squared than that of original data.
  • Standard deviation is the variance measure that is in the same units as the original data.
Formula:

Caluclating Variance and Standard Deviation

  • Divide the Income data into two sets. USA v/s Others.
  • Find the variance of “education.num” in those two sets. Which one has higher variance?
  • Variance is calculated using var() function. Code is as given below:
In [40]:
usa_income=Income[Income["native-country"]==' United-States']
usa_income.shape
Out[40]:
(29170, 15)
In [41]:
other_income=Income[Income["native-country"]!=' United-States']
other_income.shape
Out[41]:
(3391, 15)
In [42]:
#Var and SD for USA
var_usa=usa_income["education-num"].var()
var_usa
Out[42]:
5.735862879538104
In [43]:
std_usa=usa_income["education-num"].std()
std_usa
Out[43]:
2.394966154152936
In [44]:
var_other=other_income["education-num"].var()
var_other
Out[44]:
13.567613037808737
In [45]:
std_other=other_income["education-num"].std()
std_other 
Out[45]:
3.6834240914954033

LAB: Variance and Standard deviation

  • Dataset: “./Online Retail Sales Data/Online Retail.csv”
  • What is the variance and s.d of “UnitPrice”
  • What is the variance and s.d of “Quantity”
  • Which one these two variables is consistent?
In [46]:
##var and sd UnitPrice
var_UnitPrice=Online_Retail['UnitPrice'].var()
var_UnitPrice
Out[46]:
9362.469164424467
In [47]:
std_UnitPrice=Online_Retail['UnitPrice'].std()
std_UnitPrice 
Out[47]:
96.75985306119716
In [48]:
#variance and sd of Quantity
var_UnitPrice=Online_Retail['Quantity'].var()
var_UnitPrice
Out[48]:
47559.39140913822
In [49]:
std_UnitPrice=Online_Retail['Quantity'].std()
std_UnitPrice 
Out[49]:
218.08115784986612

Percentiles & quartiles in python

Percentiles

A percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20% of the observations are found. For a particular variable population can be devided into 100 equal groups according to the distribution of values.

  • A student attended an exam along with 1000 others.
  • He got 68% marks? How good or bad he performed in the exam?
  • What will be his overall rank?
  • What will be his rank if there were 100 students overall?

Imagine that there are 1000 students in a class and out of 1000 students, 1 student got 68% marks. Can we say whether he performed good or not? Here we need to perform relative scaling among 1000 student and compare the ranks of 1000 students.

  • Lets say with 68 marks, he stood at 910th position. There are 910 students who got less than 68% and only 89 students got more marks than him.
  • He is standing at 91 percentile.
  • Instead of telling 68 marks, 91% gives a good idea on his performance.
  • Percentiles make the data easy to read.
  • pth percentile: p percent of observations below it, (100 – p)% above it.

Lets say there is a guy who got 40 marks and his percentile value is 40 which means 80% people are below him. This kind of scaling is used in competitive exam like CAT,GATE etc.

  • Marks are 40 but percentile is 80%, what does this mean?
  • 80% of CAT exam percentile means 20% of the people are above & 80% are below.
  • Percentiles help us in getting an idea on outliers.
  • For example the highest income value is 400,000 but 95th percentile is 20,000 only. That means 95% of the values are less than 20,000. So the values near 400,000 are clearly outliers.

Quartiles

In descriptive statistics, the quartiles of a ranked set of data values, are the three points that divide the data set into four equal groups, each comprising a quarter of the data. A quartile is a type of quantile. The first quartile (Q1) is defined as the value between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the value between the median and the highest value of the data set.

  • Percentiles divide the whole population into 100 groups where as quartiles divide the population into 4 groups
  • p = 25: First Quartile or Lower quartile (LQ)
  • p = 50: second quartile or Median
  • p = 75: Third Quartile or Upper quartile (UQ)

Code for Percentiles and Quantiles

In [50]:
Income["capital-gain"].describe()
Out[50]:
count    32561.000000
mean      1077.648844
std       7385.292085
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max      99999.000000
Name: capital-gain, dtype: float64
In [51]:
#Finding the percentile & quantile by using .quantile()
Income['capital-gain'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[51]:
0.0        0.0
0.1        0.0
0.2        0.0
0.3        0.0
0.4        0.0
0.5        0.0
0.6        0.0
0.7        0.0
0.8        0.0
0.9        0.0
1.0    99999.0
Name: capital-gain, dtype: float64
In [52]:
Income['capital-loss'].quantile([0, 0.1, 0.2, 0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])
Out[52]:
0.0       0.0
0.1       0.0
0.2       0.0
0.3       0.0
0.4       0.0
0.5       0.0
0.6       0.0
0.7       0.0
0.8       0.0
0.9       0.0
1.0    4356.0
Name: capital-loss, dtype: float64
In [53]:
Income['hours-per-week'].quantile([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.98,1])
Out[53]:
0.00     1.0
0.10    24.0
0.20    35.0
0.30    40.0
0.40    40.0
0.50    40.0
0.60    40.0
0.70    40.0
0.80    48.0
0.90    55.0
0.95    60.0
0.98    70.0
1.00    99.0
Name: hours-per-week, dtype: float64

LAB: percentiles & quartiles in python

  • Dataset: “./Bank Marketing/bank_market.csv”
  • Get the summary of the balance variable
  • Do you suspect any outliers in balance ?
  • Get relevant percentiles and see their distribution.
  • Are there any outliers present?
  • Get the summary of the age variable
  • Do you suspect any outliers in age?
  • Get relevant percentiles and see their distribution.
  • Are there any outliers present?
In [54]:
bank=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsBank Tele Marketingbank_market.csv",encoding = "ISO-8859-1")
bank.shape
Out[54]:
(45211, 18)
In [55]:
#Get the summary of the balance variable
#we can find the summary of the balance variable by using .describe()
summary_bala=bank["balance"].describe()
summary_bala
Out[55]:
count     45211.000000
mean       1362.272058
std        3044.765829
min       -8019.000000
25%          72.000000
50%         448.000000
75%        1428.000000
max      102127.000000
Name: balance, dtype: float64
In [56]:
#Get relevant percentiles and see their distribution.
bank['balance'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[56]:
0.0     -8019.0
0.1         0.0
0.2        22.0
0.3       131.0
0.4       272.0
0.5       448.0
0.6       701.0
0.7      1126.0
0.8      1859.0
0.9      3574.0
1.0    102127.0
Name: balance, dtype: float64
In [57]:
#Get the summary of the age variable
summary_age=bank['age'].describe()
summary_age
Out[57]:
count    45211.000000
mean        40.936210
std         10.618762
min         18.000000
25%         33.000000
50%         39.000000
75%         48.000000
max         95.000000
Name: age, dtype: float64
In [58]:
#Get relevant percentiles and see their distribution
bank['age'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
Out[58]:
0.0    18.0
0.1    29.0
0.2    32.0
0.3    34.0
0.4    36.0
0.5    39.0
0.6    42.0
0.7    46.0
0.8    51.0
0.9    56.0
1.0    95.0
Name: age, dtype: float64

Box plots and outlier detection

The pictorial way to find outliers is called a Box Plot. Box Plots help us in outlier detection. The box plot has a box inside them and therefore they are called box plot. A box plot contains 5 values: minimum value, 1st quartile value or lower quartile (LQ), the median, the 3rd quartile or upper quartile (UQ) and the maximum value. All of these together results in a box plot. The 1st and the 3rd quartile form the box in the box plot. If there are any outliers in the data, the value of the 3rd quartile which covers 75%, will be very small and the maximum value will be far away from the box. If the box in the box plot is very small and if most of it is a line, then definitely there are outliers in the data. Outliers may be plotted as individual points. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacing’s between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers. Box plots can be drawn either horizontally or vertically.

  • Box plots have box from LQ to UQ, with median marked.
  • They portray a five-number graphical summary of the data Minimum, LQ, Median, UQ, Maximum
  • Helps us to get an idea on the data distribution
  • Helps us to identify the outliers easily
  • 25% of the population is below first quartile,
  • 75% of the population is below third quartile
  • If the box is pushed to one side and some values are far away from the box then it’s a clear indication of outliers
  • Some set of values lies far away from box, which gives us a clear indication of outliers.
  • In this example the minimum is 5, maximum is 120, and 75% of the values are less than 15
  • Still there are some records reaching 120. Hence it is a clear indication of outliers.
  • Sometimes the outliers are so evident, that the box appear to be a horizontal line in box plot.

Box plots and outlier detection on Python

In [64]:
#Do you suspect any outliers in balance
bank=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsBank Tele Marketingbank_market.csv",encoding = "ISO-8859-1")
In [63]:
import matplotlib.pyplot as plt
%matplotlib inline
#Basic plot of boxplot by importing the matplot.pyplot as plt ("plt.boxplot())
plt.boxplot(bank.balance);
In [65]:
#Get relevant percentiles and see their distribution
bank['balance'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,0.95, 1])
#Do you suspect any outliers in balance
# outlier are present in balance variable
Out[65]:
0.00     -8019.0
0.10         0.0
0.20        22.0
0.30       131.0
0.40       272.0
0.50       448.0
0.60       701.0
0.70      1126.0
0.80      1859.0
0.90      3574.0
0.95      5768.0
1.00    102127.0
Name: balance, dtype: float64
In [66]:
#Do you suspect any outliers in age
#detect the ouliers in age variable by plt.boxplot()
plt.boxplot(bank.age);

#Do you suspect any outliers in age
#outliers are not present in age variable
In [67]:
#No outliers are present
#Get relevant percentiles and see their distribution
bank['age'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95,1])
Out[67]:
0.00    18.0
0.10    29.0
0.20    32.0
0.30    34.0
0.40    36.0
0.50    39.0
0.60    42.0
0.70    46.0
0.80    51.0
0.90    56.0
0.95    59.0
1.00    95.0
Name: age, dtype: float64

Graphs or Plots

Graphs are diagrams showing relation between variables, quantities or the visual description of a single variable. Graphs and plots are very important in visualization of the data. It gives an idea of how the data is distributed towards a scale.

Scatter Plot:

  • Scatter Plot:
  • Scatter plots give us an indication on the relation between the two chosen variables.
  • The two variables has to be numerical.

Code for Scatter Plot:

In [71]:
##Scatter Plot:

cars=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsCars DataCars.csv",encoding = "ISO-8859-1")
cars.shape
Out[71]:
(428, 15)
In [73]:
cars.columns.values
Out[73]:
array(['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice',
       'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway',
       'Weight', 'Wheelbase', 'Length'], dtype=object)
In [74]:
cars['Horsepower'].describe()
Out[74]:
count    428.000000
mean     215.885514
std       71.836032
min       73.000000
25%      165.000000
50%      210.000000
75%      255.000000
max      500.000000
Name: Horsepower, dtype: float64
In [75]:
cars['MPG_City'].describe()
Out[75]:
count    428.000000
mean      20.060748
std        5.238218
min       10.000000
25%       17.000000
50%       19.000000
75%       21.250000
max       60.000000
Name: MPG_City, dtype: float64
In [76]:
import matplotlib.pyplot as plt
plt.plot(cars.Horsepower,cars.MPG_City)
Out[76]:
[<matplotlib.lines.Line2D at 0x2195b4ed710>]
In [77]:
plt.scatter(cars.Horsepower,cars.MPG_City)
Out[77]:
<matplotlib.collections.PathCollection at 0x2195bbeb940>

LAB: Creating Graphs:

  • Dataset: “./Sporting_goods_sales/Sporting_goods_sales.csv”
  • Draw a scatter plot between Average_Income and Sales. Is there any relation between the two variables?
  • Draw a scatter plot between Under35_Population_pect and Sales. Is there any relation between the two?
In [79]:
import matplotlib.pyplot as plt
#Sports data
sports_data=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsSporting_goods_salesSporting_goods_sales.csv")
sports_data.head(10)
Out[79]:
Sr_no Avg_family_size Average_Income M_F_Gender_Ratio Un_emp_rate Under35_Population_pect Number_schools Sales
0 1 3 9305.306044 46.654268 2.587691 51.426218 395.379432 140870.7288
1 2 2 8907.622334 64.505029 2.731910 28.485052 316.503520 100305.7146
2 3 2 9846.602630 63.595331 4.269577 49.452727 359.077144 135474.6688
3 4 2 8871.731173 50.451251 3.124004 44.678507 346.833014 126349.5082
4 5 4 9891.047985 51.353801 2.004201 37.664024 329.034161 117434.7267
5 6 1 8323.778337 59.561161 4.499456 55.777614 300.024063 144803.2314
6 7 1 9255.367133 64.763245 3.069215 51.349380 341.563948 128177.9573
7 8 4 9164.876835 61.532119 0.969216 37.302362 348.071965 96958.9253
8 9 3 9270.008017 48.847177 3.121700 55.352672 320.158392 138099.8432
9 10 2 9057.719234 51.379914 2.127062 32.919569 377.142785 112535.7189
In [80]:
#Draw a scatter plot between Average_Income and Sales. Is there any relation between two variables
plt.scatter(sports_data.Average_Income,sports_data.Sales)
Out[80]:
<matplotlib.collections.PathCollection at 0x2195c760e48>
In [81]:
#Draw a scatter plot between Under35_Population_pect and Sales. Is there any relation between two
plt.scatter(sports_data.Under35_Population_pect,sports_data.Sales,color="red")
Out[81]:
<matplotlib.collections.PathCollection at 0x2195c74ab38>

Bar Chart:

• Bar charts are used to summarize the categorical variables and see the frequencies or the count of those variables.

Code for Bar Chart:

In order to plot the Bar chart for categorical variables, first we have to find the frequency distribution of the variable using the function, values_count(). Then we divide the frequency table into values and indexes. The values function tells the distribution of values and index tells about categories. The bar chart is plotted between indexes and values using a function called .bar().

In [82]:
#Bar charts used to summarize the categorical variables

import pandas as pd
cars=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsCars DataCars.csv",encoding = "ISO-8859-1")
cars.shape
Out[82]:
(428, 15)
In [83]:
cars.columns.values
Out[83]:
array(['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice',
       'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway',
       'Weight', 'Wheelbase', 'Length'], dtype=object)
In [84]:
freq=cars.Cylinders.value_counts()
In [86]:
freq.values
Out[86]:
array([190, 136,  87,   7,   3,   2,   1], dtype=int64)
In [87]:
freq.index
Out[87]:
Float64Index([6.0, 4.0, 8.0, 5.0, 12.0, 10.0, 3.0], dtype='float64')
In [88]:
import matplotlib.pyplot as plt
plt.bar(freq.index,freq.values)
Out[88]:
<Container object of 7 artists>

LAB: Bar Chart:

  • Dataset: “./Sporting_goods_sales/Sporting_goods_sales.csv”
  • Create a bar chart summarizing the information on family size.
In [90]:
sports_data=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsSporting_goods_salesSporting_goods_sales.csv",encoding = "ISO-8859-1")
In [91]:
sports_data.shape
Out[91]:
(150, 8)
In [92]:
sports_data.columns.values
Out[92]:
array(['Sr_no', 'Avg_family_size', 'Average_Income', 'M_F_Gender_Ratio',
       'Un_emp_rate', 'Under35_Population_pect', 'Number_schools', 'Sales'], dtype=object)
In [93]:
freq=sports_data.Avg_family_size.value_counts()
freq.values
Out[93]:
array([61, 57, 18, 14], dtype=int64)
In [94]:
freq.index
Out[94]:
Int64Index([3, 2, 4, 1], dtype='int64')
In [95]:
import matplotlib.pyplot as plt
plt.bar(freq.index,freq.values)
Out[95]:
<Container object of 4 artists>

Trend Chart:

  • Trend Chart is used for time series datasets.
  • It determines the value of the variable in a particular interval of time.

Code for Trend Chart:

We are taking AirPassengers dataset for plotting the trend, chart.Plot() function is used for plotting the trend chart.

In [96]:
AirPassengers=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsAir Travel DataAir_travel.csv", encoding = "ISO-8859-1")
In [97]:
AirPassengers.head()
Out[97]:
DATE AIR
0 JAN49 112
1 FEB49 118
2 MAR49 132
3 APR49 129
4 MAY49 121
In [98]:
AirPassengers.columns.values
Out[98]:
array(['DATE', 'AIR'], dtype=object)
In [99]:
import matplotlib.pyplot as plt
plt.plot(AirPassengers.AIR)
Out[99]:
[<matplotlib.lines.Line2D at 0x2195bfb8630>]

Conclusion:

  • In this session we discussed some basic data reporting and graph.
  • Studying descriptive statistics is essential before we start our advanced modeling. It gives us an idea on the variable distribution.
  • We also discussed, drawing graphs using some useful packages in Python.

Click to Download  DataSet

DV Analytics

DV Data & Analytics is a leading data science training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.