Introduction

Here we will try to cover basic statistics, graphs and reports. So far, we have covered basic python programming and basic data handling. In this session we will cover the basic statistics. The statistical concepts are very important to learn before we get into real analytics. Once we have imported our datasets, performing some basic statistics will give us the idea of what the parameters are, what the variables do, how they are working, how they are distributed, etc. Understanding this will give us a basic idea of the data that we have imported.

Sampling in Python

Sampling is a method to select few observations from a large population or a large dataset, in a way that all the underlying characteristics can be represented with the sample that we already taken. Basically, it is nothing but a subset of a large dataset and each value in the subset is taken randomly. Now if we want to obtain the result from the whole dataset, a similar result can be obtained from the sample dataset. This is the advantage of sampling. We will now see how to perform sampling in python. In order to import any data we need to use pandas package. For sampling we need to use sample() function for sampling the data. The code for sampling the data is as follows:

In [14]:

#Taking a random sample 

import pandas as pd
Online_Retail=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsOnline_Retail_Sales_DataOnline Retail.csv", encoding = "ISO-8859-1")

In [15]:

Online_Retail.shape

Out[15]:

(541909, 8)

In [16]:

sample_data=Online_Retail.sample(n=1000,replace="False")
sample_data.shape

Out[16]:

(1000, 8)

LAB: Sampling in Python

Import “Census Income Data/Income_data.csv”
Create a new dataset by taking a random sample of 5000 records

In [21]:

#Import “Census Income Data/Income_data.csv”
Income=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsCensus Income DataIncome_data.csv")
Income.shape

Out[21]:

(32561, 15)

In [18]:

Income.head()

Out[18]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	Income_band
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

In [22]:

Income.tail(3)

Out[22]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	Income_band
32558	58	Private	151910	HS-grad	9	Widowed	Adm-clerical	Unmarried	White	Female	0	40	United-States	<=50K
32559	22	Private	201490	HS-grad	9	Never-married	Adm-clerical	Own-child	White	Male	0	20	United-States	<=50K
32560	52	Self-emp-inc	287927	HS-grad	9	Married-civ-spouse	Exec-managerial	Wife	White	Female	15024	40	United-States	>50K

In [23]:

 #Sample size 5000
Sample_income=Income.sample(n=5000)
Sample_income.shape

Out[23]:

(5000, 15)

In [24]:

Sample_income

Out[24]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	Income_band
10267	27	Private	185647	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	50	United-States	>50K
20313	33	Private	203488	Some-college	10	Never-married	Exec-managerial	Own-child	White	Male	0	0	45	United-States	<=50K
12276	21	Private	180339	Assoc-voc	11	Never-married	Farming-fishing	Not-in-family	White	Female	0	1602	30	United-States	<=50K
7399	42	Local-gov	227065	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Male	0	0	22	United-States	<=50K
19773	49	Self-emp-not-inc	79627	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	40	United-States	<=50K
9289	19	Private	145844	Assoc-acdm	12	Never-married	Exec-managerial	Not-in-family	White	Female	0	0	50	United-States	<=50K
25176	33	Local-gov	175509	HS-grad	9	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States	<=50K
7501	51	Self-emp-not-inc	32372	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	1672	70	United-States	<=50K
3524	21	Private	197387	HS-grad	9	Never-married	Handlers-cleaners	Own-child	White	Male	0	0	20	United-States	<=50K
1938	50	State-gov	159219	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	40	Canada	>50K
3966	49	Private	101825	Masters	14	Married-civ-spouse	Prof-specialty	Wife	White	Female	0	1977	40	United-States	>50K
12396	35	Private	267966	11th	7	Never-married	Machine-op-inspct	Not-in-family	White	Male	0	0	50	United-States	<=50K
23926	59	Private	118358	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	>50K
18136	35	Self-emp-not-inc	42044	Assoc-acdm	12	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	<=50K
23862	32	Private	244147	HS-grad	9	Never-married	Craft-repair	Unmarried	White	Male	0	0	10	United-States	<=50K
26300	50	Self-emp-not-inc	167728	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	60	United-States	>50K
28729	70	Self-emp-inc	158437	Prof-school	15	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	35	United-States	>50K
28419	57	Self-emp-not-inc	184553	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	50	United-States	>50K
27474	30	Private	108386	Assoc-acdm	12	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	<=50K
1646	33	Private	240763	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	0	0	40	United-States	<=50K
9597	63	State-gov	109735	HS-grad	9	Divorced	Adm-clerical	Not-in-family	White	Female	0	0	38	United-States	<=50K
12785	25	Private	120238	HS-grad	9	Married-spouse-absent	Machine-op-inspct	Not-in-family	White	Male	0	0	40	Poland	<=50K
26324	18	Private	77845	Some-college	10	Never-married	Adm-clerical	Own-child	White	Male	0	1602	15	United-States	<=50K
12462	44	Self-emp-not-inc	155930	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	60	United-States	<=50K
2798	35	Private	25955	11th	7	Never-married	Other-service	Unmarried	Amer-Indian-Eskimo	Male	0	0	40	United-States	<=50K
997	48	Federal-gov	33109	Bachelors	13	Divorced	Exec-managerial	Unmarried	White	Male	0	0	58	United-States	>50K
12591	56	Private	92444	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	Black	Male	0	0	50	United-States	>50K
29534	43	State-gov	424094	Doctorate	16	Never-married	Prof-specialty	Not-in-family	White	Female	0	0	40	United-States	<=50K
16456	46	?	37672	HS-grad	9	Divorced	?	Not-in-family	White	Female	0	0	15	United-States	<=50K
23168	22	Private	54825	HS-grad	9	Never-married	Sales	Not-in-family	White	Male	0	0	40	United-States	<=50K
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
32304	30	Never-worked	176673	HS-grad	9	Married-civ-spouse	?	Wife	Black	Female	0	0	40	United-States	<=50K
29365	25	Private	264055	Bachelors	13	Never-married	Adm-clerical	Own-child	White	Male	0	0	40	United-States	<=50K
30967	21	Private	156980	HS-grad	9	Never-married	Machine-op-inspct	Own-child	White	Male	0	0	60	United-States	<=50K
30149	50	Private	135465	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	50	United-States	>50K
5904	34	Private	103596	HS-grad	9	Married-civ-spouse	Handlers-cleaners	Husband	White	Male	0	0	40	United-States	<=50K
4961	28	Private	51331	Assoc-acdm	12	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	16	United-States	>50K
19127	46	Private	164682	Assoc-voc	11	Separated	Prof-specialty	Not-in-family	White	Male	0	0	40	United-States	<=50K
23627	32	Private	113364	Bachelors	13	Never-married	Prof-specialty	Not-in-family	White	Male	0	0	40	United-States	>50K
25536	25	Private	187577	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	<=50K
11607	38	Private	199816	HS-grad	9	Divorced	Machine-op-inspct	Own-child	White	Male	0	0	40	United-States	<=50K
2689	22	?	216563	HS-grad	9	Never-married	?	Other-relative	White	Female	0	0	40	United-States	<=50K
5369	30	Private	1184622	Some-college	10	Married-civ-spouse	Transport-moving	Husband	Black	Male	0	0	35	United-States	<=50K
7863	31	Private	220690	Some-college	10	Divorced	Craft-repair	Not-in-family	White	Male	0	0	80	United-States	<=50K
21907	56	Private	340171	HS-grad	9	Married-civ-spouse	Transport-moving	Husband	Black	Male	0	0	40	United-States	<=50K
10043	28	Private	96226	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	45	United-States	<=50K
11676	25	Private	167031	10th	6	Never-married	Machine-op-inspct	Not-in-family	White	Female	0	0	40	Columbia	<=50K
23909	22	Private	113760	11th	7	Never-married	Other-service	Own-child	White	Female	0	0	30	United-States	<=50K
19997	72	Private	268861	7th-8th	4	Widowed	Other-service	Not-in-family	White	Female	0	0	99	?	<=50K
2155	66	?	117778	11th	7	Married-civ-spouse	?	Husband	White	Male	0	0	40	United-States	<=50K
1371	61	?	190997	HS-grad	9	Married-civ-spouse	?	Husband	White	Male	0	0	6	United-States	<=50K
14981	32	Private	193042	Bachelors	13	Married-civ-spouse	Sales	Husband	White	Male	0	0	44	United-States	<=50K
24814	37	Private	409189	HS-grad	9	Married-civ-spouse	Other-service	Husband	White	Male	0	0	30	Mexico	<=50K
5967	21	Private	90935	Assoc-voc	11	Never-married	Transport-moving	Own-child	White	Male	0	0	40	United-States	<=50K
6673	38	Private	343403	Assoc-acdm	12	Divorced	Adm-clerical	Not-in-family	White	Female	0	0	16	United-States	<=50K
4149	21	Private	237651	Some-college	10	Never-married	Other-service	Own-child	White	Male	0	0	35	United-States	<=50K
26316	18	Private	205218	Some-college	10	Never-married	Other-service	Own-child	White	Female	0	0	20	United-States	<=50K
28836	19	Private	311974	HS-grad	9	Never-married	Craft-repair	Other-relative	White	Male	0	0	25	Mexico	<=50K
27070	44	Local-gov	208528	Assoc-acdm	12	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	30	United-States	<=50K
27361	20	Private	293297	Some-college	10	Never-married	Other-service	Own-child	White	Male	0	0	35	United-States	<=50K
8997	42	Private	30424	11th	7	Separated	Other-service	Unmarried	White	Female	0	0	38	United-States	<=50K

5000 rows × 15 columns

Descriptive statistics:

The basic descriptive statistics give us an idea about the variables and their distributions.
Permit the analyst to describe many pieces of data with few indices.
Central tendencies
- Central tendencies are the middle values of the data frame or any variable.These are of two types.
- Mean
- Median
Dispersion
- Dispersion just shows the range or stretch of that variable.
- Range
- Variance
- Standard deviation

Mean

The arithmetic mean
Sum of values/ Count of values
Gives a quick idea on average of a variable

Median

Mean is not a good measure in presence of outliers
For example Consider below data vector
- 1.5,1.7,1.9,0.8,0.8,1.2,1.9,1.4, 9 , 0.7 , 1.1
90% of the above values are less than 2, but the mean of above vector is 2
There is an unusual value in the above data vector i.e 9
It is an outlier in the data vector.
Mean is not the true middle value in presence of outliers. Mean is very much effected by the outliers.
We use median, the true middle value in such cases.
Sort the data either in ascending or descending order.

Caluclating Mean and Median in python

We have to import the income data set and we need to find the mean and median for the variable called capital-gain. In order find the mean or median of a variable, we take the whole dataset, then we use the square bracket to redirect to the column name and then we use .mean() or .median for finding mean or median respectively.

In [30]:

#Import “Census Income Data/Income_data.csv”
Income=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsCensus Income DataIncome_data.csv")

Income.columns.values

Out[30]:

array(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'Income_band'], dtype=object)

In [33]:

#Mean and Median on python
gain_mean=Income["capital-gain"].mean()
gain_mean

Out[33]:

1077.6488437087312

In [34]:

gain_median=Income["capital-gain"].median()
gain_median

Out[34]:

0.0

Here mean of the capital-gain is 1077 and median is 0, which means that 50% of the values are zero. The difference between the mean and median is very high. There are some values in the variable which are trying to compensate the whole division of the values. Looks like there are outliers in the data, so we need to look at percentiles and box plot.

LAB: Mean and Median on Python

Dataset: “./Online Retail Sales Data/Online Retail.csv”
What is the mean of “UnitPrice”
What is the median of “UnitPrice”
Is mean equal to median? Do you suspect the presence of outliers in the data?
What is the mean of “Quantity”
What is the median of “Quantity”
Is mean equal to median? Do you suspect the presence of outliers in the data?

In [35]:

Online_Retail=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsOnline_Retail_Sales_DataOnline Retail.csv", encoding = "ISO-8859-1")
Online_Retail.shape
Online_Retail.columns.values

Out[35]:

array(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'], dtype=object)

In [36]:

#Mean and median of 'UnitPrice' in Online Retail data
up_mean=Online_Retail['UnitPrice'].mean()
up_mean

Out[36]:

4.611113626083471

In [37]:

up_median=Online_Retail['UnitPrice'].median()
up_median

Out[37]:

2.08

In [38]:

#Mean of "Quantity" in Online Retail data
Quantity_mean=Online_Retail['Quantity'].mean()
Quantity_mean

Out[38]:

9.55224954743324

In [39]:

Quantity_median=Online_Retail['Quantity'].median()
Quantity_median

Out[39]:

3.0

Dispersion Measures: Variance and Standard Deviation:

Central tendencies are not enough to understand the variable. It only tells us about the middle values, not the degree of the range or stretch of a variable. So the measure of dispersion is necessary to understand the range of the values and the presence of outliers.

Dispersion

Just knowing the central tendency is not enough.
Two variables might have same mean, but they might be very different.
Look at the Profit details of two companies A & B for last 14 Quarters in MMs

Though the average profit is 15 in both the cases, company B has performed consistently than company A.
There were even loses for company A.
Measures of dispersion become very vital in such cases

Variance

Dispersion is the quantification of deviation of each point from the mean value.
Variance is average of squared distances of each point from the mean value.
Variance is a fairly good measure of dispersion.
Variance in profit for company A is 352 and Company B is 4.9.
Company A is giving high fluctuation in profit whereas company B is giving low fluctuation in profit. So company B is preferred.
We basically say that, less the variance, less the noice in our data.

Standard Deviation

Standard deviation is just the square root of variance
Variance gives a good idea on dispersion, but it is in the order of squares.
It’s very clear from the formula that variance units are squared than that of original data.
Standard deviation is the variance measure that is in the same units as the original data.

Formula:

Caluclating Variance and Standard Deviation

Divide the Income data into two sets. USA v/s Others.
Find the variance of “education.num” in those two sets. Which one has higher variance?
Variance is calculated using var() function. Code is as given below:

In [40]:

usa_income=Income[Income["native-country"]==' United-States']
usa_income.shape

Out[40]:

(29170, 15)

In [41]:

other_income=Income[Income["native-country"]!=' United-States']
other_income.shape

Out[41]:

(3391, 15)

In [42]:

#Var and SD for USA
var_usa=usa_income["education-num"].var()
var_usa

Out[42]:

5.735862879538104

In [43]:

std_usa=usa_income["education-num"].std()
std_usa

Out[43]:

2.394966154152936

In [44]:

var_other=other_income["education-num"].var()
var_other

Out[44]:

13.567613037808737

In [45]:

std_other=other_income["education-num"].std()
std_other

Out[45]:

3.6834240914954033

LAB: Variance and Standard deviation

Dataset: “./Online Retail Sales Data/Online Retail.csv”
What is the variance and s.d of “UnitPrice”
What is the variance and s.d of “Quantity”
Which one these two variables is consistent?

In [46]:

##var and sd UnitPrice
var_UnitPrice=Online_Retail['UnitPrice'].var()
var_UnitPrice

Out[46]:

9362.469164424467

In [47]:

std_UnitPrice=Online_Retail['UnitPrice'].std()
std_UnitPrice

Out[47]:

96.75985306119716

In [48]:

#variance and sd of Quantity
var_UnitPrice=Online_Retail['Quantity'].var()
var_UnitPrice

Out[48]:

47559.39140913822

In [49]:

std_UnitPrice=Online_Retail['Quantity'].std()
std_UnitPrice

Out[49]:

218.08115784986612

Percentiles & quartiles in python

Percentiles

A percentile (or a centile) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20% of the observations are found. For a particular variable population can be devided into 100 equal groups according to the distribution of values.

A student attended an exam along with 1000 others.
He got 68% marks? How good or bad he performed in the exam?
What will be his overall rank?
What will be his rank if there were 100 students overall?

Imagine that there are 1000 students in a class and out of 1000 students, 1 student got 68% marks. Can we say whether he performed good or not? Here we need to perform relative scaling among 1000 student and compare the ranks of 1000 students.

Lets say with 68 marks, he stood at 910th position. There are 910 students who got less than 68% and only 89 students got more marks than him.
He is standing at 91 percentile.
Instead of telling 68 marks, 91% gives a good idea on his performance.
Percentiles make the data easy to read.
pth percentile: p percent of observations below it, (100 – p)% above it.

Lets say there is a guy who got 40 marks and his percentile value is 40 which means 80% people are below him. This kind of scaling is used in competitive exam like CAT,GATE etc.

Marks are 40 but percentile is 80%, what does this mean?
80% of CAT exam percentile means 20% of the people are above & 80% are below.
Percentiles help us in getting an idea on outliers.
For example the highest income value is 400,000 but 95th percentile is 20,000 only. That means 95% of the values are less than 20,000. So the values near 400,000 are clearly outliers.

Quartiles

In descriptive statistics, the quartiles of a ranked set of data values, are the three points that divide the data set into four equal groups, each comprising a quarter of the data. A quartile is a type of quantile. The first quartile (Q1) is defined as the value between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the value between the median and the highest value of the data set.

Percentiles divide the whole population into 100 groups where as quartiles divide the population into 4 groups
p = 25: First Quartile or Lower quartile (LQ)
p = 50: second quartile or Median
p = 75: Third Quartile or Upper quartile (UQ)

Code for Percentiles and Quantiles

In [50]:

Income["capital-gain"].describe()

Out[50]:

count    32561.000000
mean      1077.648844
std       7385.292085
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max      99999.000000
Name: capital-gain, dtype: float64

In [51]:

#Finding the percentile & quantile by using .quantile()
Income['capital-gain'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Out[51]:

0.0        0.0
0.1        0.0
0.2        0.0
0.3        0.0
0.4        0.0
0.5        0.0
0.6        0.0
0.7        0.0
0.8        0.0
0.9        0.0
1.0    99999.0
Name: capital-gain, dtype: float64

In [52]:

Income['capital-loss'].quantile([0, 0.1, 0.2, 0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])

Out[52]:

0.0       0.0
0.1       0.0
0.2       0.0
0.3       0.0
0.4       0.0
0.5       0.0
0.6       0.0
0.7       0.0
0.8       0.0
0.9       0.0
1.0    4356.0
Name: capital-loss, dtype: float64

In [53]:

Income['hours-per-week'].quantile([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.98,1])

Out[53]:

0.00     1.0
0.10    24.0
0.20    35.0
0.30    40.0
0.40    40.0
0.50    40.0
0.60    40.0
0.70    40.0
0.80    48.0
0.90    55.0
0.95    60.0
0.98    70.0
1.00    99.0
Name: hours-per-week, dtype: float64

LAB: percentiles & quartiles in python

Dataset: “./Bank Marketing/bank_market.csv”
Get the summary of the balance variable
Do you suspect any outliers in balance ?
Get relevant percentiles and see their distribution.
Are there any outliers present?
Get the summary of the age variable
Do you suspect any outliers in age?
Get relevant percentiles and see their distribution.
Are there any outliers present?

In [54]:

bank=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsBank Tele Marketingbank_market.csv",encoding = "ISO-8859-1")
bank.shape

Out[54]:

(45211, 18)

In [55]:

#Get the summary of the balance variable
#we can find the summary of the balance variable by using .describe()
summary_bala=bank["balance"].describe()
summary_bala

Out[55]:

count     45211.000000
mean       1362.272058
std        3044.765829
min       -8019.000000
25%          72.000000
50%         448.000000
75%        1428.000000
max      102127.000000
Name: balance, dtype: float64

In [56]:

#Get relevant percentiles and see their distribution.
bank['balance'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Out[56]:

0.0     -8019.0
0.1         0.0
0.2        22.0
0.3       131.0
0.4       272.0
0.5       448.0
0.6       701.0
0.7      1126.0
0.8      1859.0
0.9      3574.0
1.0    102127.0
Name: balance, dtype: float64

In [57]:

#Get the summary of the age variable
summary_age=bank['age'].describe()
summary_age

Out[57]:

count    45211.000000
mean        40.936210
std         10.618762
min         18.000000
25%         33.000000
50%         39.000000
75%         48.000000
max         95.000000
Name: age, dtype: float64

In [58]:

#Get relevant percentiles and see their distribution
bank['age'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])

Out[58]:

0.0    18.0
0.1    29.0
0.2    32.0
0.3    34.0
0.4    36.0
0.5    39.0
0.6    42.0
0.7    46.0
0.8    51.0
0.9    56.0
1.0    95.0
Name: age, dtype: float64

Box plots and outlier detection

The pictorial way to find outliers is called a Box Plot. Box Plots help us in outlier detection. The box plot has a box inside them and therefore they are called box plot. A box plot contains 5 values: minimum value, 1st quartile value or lower quartile (LQ), the median, the 3rd quartile or upper quartile (UQ) and the maximum value. All of these together results in a box plot. The 1st and the 3rd quartile form the box in the box plot. If there are any outliers in the data, the value of the 3rd quartile which covers 75%, will be very small and the maximum value will be far away from the box. If the box in the box plot is very small and if most of it is a line, then definitely there are outliers in the data. Outliers may be plotted as individual points. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacing’s between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers. Box plots can be drawn either horizontally or vertically.

Box plots have box from LQ to UQ, with median marked.
They portray a five-number graphical summary of the data Minimum, LQ, Median, UQ, Maximum
Helps us to get an idea on the data distribution
Helps us to identify the outliers easily
25% of the population is below first quartile,
75% of the population is below third quartile
If the box is pushed to one side and some values are far away from the box then it’s a clear indication of outliers

Some set of values lies far away from box, which gives us a clear indication of outliers.
In this example the minimum is 5, maximum is 120, and 75% of the values are less than 15
Still there are some records reaching 120. Hence it is a clear indication of outliers.
Sometimes the outliers are so evident, that the box appear to be a horizontal line in box plot.

Box plots and outlier detection on Python

In [64]:

#Do you suspect any outliers in balance
bank=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsBank Tele Marketingbank_market.csv",encoding = "ISO-8859-1")

In [63]:

import matplotlib.pyplot as plt
%matplotlib inline
#Basic plot of boxplot by importing the matplot.pyplot as plt ("plt.boxplot())
plt.boxplot(bank.balance);

In [65]:

#Get relevant percentiles and see their distribution
bank['balance'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,0.95, 1])
#Do you suspect any outliers in balance
# outlier are present in balance variable

Out[65]:

0.00     -8019.0
0.10         0.0
0.20        22.0
0.30       131.0
0.40       272.0
0.50       448.0
0.60       701.0
0.70      1126.0
0.80      1859.0
0.90      3574.0
0.95      5768.0
1.00    102127.0
Name: balance, dtype: float64

In [66]:

#Do you suspect any outliers in age
#detect the ouliers in age variable by plt.boxplot()
plt.boxplot(bank.age);

#Do you suspect any outliers in age
#outliers are not present in age variable

In [67]:

#No outliers are present
#Get relevant percentiles and see their distribution
bank['age'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95,1])

Out[67]:

0.00    18.0
0.10    29.0
0.20    32.0
0.30    34.0
0.40    36.0
0.50    39.0
0.60    42.0
0.70    46.0
0.80    51.0
0.90    56.0
0.95    59.0
1.00    95.0
Name: age, dtype: float64

Graphs or Plots

Graphs are diagrams showing relation between variables, quantities or the visual description of a single variable. Graphs and plots are very important in visualization of the data. It gives an idea of how the data is distributed towards a scale.

Scatter Plot:

Scatter Plot:
Scatter plots give us an indication on the relation between the two chosen variables.
The two variables has to be numerical.

Code for Scatter Plot:

In [71]:

##Scatter Plot:

cars=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsCars DataCars.csv",encoding = "ISO-8859-1")
cars.shape

Out[71]:

(428, 15)

In [73]:

cars.columns.values

Out[73]:

array(['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice',
       'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway',
       'Weight', 'Wheelbase', 'Length'], dtype=object)

In [74]:

cars['Horsepower'].describe()

Out[74]:

count    428.000000
mean     215.885514
std       71.836032
min       73.000000
25%      165.000000
50%      210.000000
75%      255.000000
max      500.000000
Name: Horsepower, dtype: float64

In [75]:

cars['MPG_City'].describe()

Out[75]:

count    428.000000
mean      20.060748
std        5.238218
min       10.000000
25%       17.000000
50%       19.000000
75%       21.250000
max       60.000000
Name: MPG_City, dtype: float64

In [76]:

import matplotlib.pyplot as plt
plt.plot(cars.Horsepower,cars.MPG_City)

Out[76]:

[<matplotlib.lines.Line2D at 0x2195b4ed710>]

In [77]:

plt.scatter(cars.Horsepower,cars.MPG_City)

Out[77]:

<matplotlib.collections.PathCollection at 0x2195bbeb940>

LAB: Creating Graphs:

Dataset: “./Sporting_goods_sales/Sporting_goods_sales.csv”
Draw a scatter plot between Average_Income and Sales. Is there any relation between the two variables?
Draw a scatter plot between Under35_Population_pect and Sales. Is there any relation between the two?

In [79]:

import matplotlib.pyplot as plt
#Sports data
sports_data=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsSporting_goods_salesSporting_goods_sales.csv")
sports_data.head(10)

Out[79]:

	Sr_no	Avg_family_size	Average_Income	M_F_Gender_Ratio	Un_emp_rate	Under35_Population_pect	Number_schools	Sales
0	1	3	9305.306044	46.654268	2.587691	51.426218	395.379432	140870.7288
1	2	2	8907.622334	64.505029	2.731910	28.485052	316.503520	100305.7146
2	3	2	9846.602630	63.595331	4.269577	49.452727	359.077144	135474.6688
3	4	2	8871.731173	50.451251	3.124004	44.678507	346.833014	126349.5082
4	5	4	9891.047985	51.353801	2.004201	37.664024	329.034161	117434.7267
5	6	1	8323.778337	59.561161	4.499456	55.777614	300.024063	144803.2314
6	7	1	9255.367133	64.763245	3.069215	51.349380	341.563948	128177.9573
7	8	4	9164.876835	61.532119	0.969216	37.302362	348.071965	96958.9253
8	9	3	9270.008017	48.847177	3.121700	55.352672	320.158392	138099.8432
9	10	2	9057.719234	51.379914	2.127062	32.919569	377.142785	112535.7189

In [80]:

#Draw a scatter plot between Average_Income and Sales. Is there any relation between two variables
plt.scatter(sports_data.Average_Income,sports_data.Sales)

Out[80]:

<matplotlib.collections.PathCollection at 0x2195c760e48>

In [81]:

#Draw a scatter plot between Under35_Population_pect and Sales. Is there any relation between two
plt.scatter(sports_data.Under35_Population_pect,sports_data.Sales,color="red")

Out[81]:

<matplotlib.collections.PathCollection at 0x2195c74ab38>

Bar Chart:

• Bar charts are used to summarize the categorical variables and see the frequencies or the count of those variables.

Code for Bar Chart:

In order to plot the Bar chart for categorical variables, first we have to find the frequency distribution of the variable using the function, values_count(). Then we divide the frequency table into values and indexes. The values function tells the distribution of values and index tells about categories. The bar chart is plotted between indexes and values using a function called .bar().

In [82]:

#Bar charts used to summarize the categorical variables

import pandas as pd
cars=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsCars DataCars.csv",encoding = "ISO-8859-1")
cars.shape

Out[82]:

(428, 15)

In [83]:

cars.columns.values

Out[83]:

array(['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice',
       'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway',
       'Weight', 'Wheelbase', 'Length'], dtype=object)

In [84]:

freq=cars.Cylinders.value_counts()

In [86]:

freq.values

Out[86]:

array([190, 136,  87,   7,   3,   2,   1], dtype=int64)

In [87]:

freq.index

Out[87]:

Float64Index([6.0, 4.0, 8.0, 5.0, 12.0, 10.0, 3.0], dtype='float64')

In [88]:

import matplotlib.pyplot as plt
plt.bar(freq.index,freq.values)

Out[88]:

<Container object of 7 artists>

LAB: Bar Chart:

Dataset: “./Sporting_goods_sales/Sporting_goods_sales.csv”
Create a bar chart summarizing the information on family size.

In [90]:

sports_data=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsSporting_goods_salesSporting_goods_sales.csv",encoding = "ISO-8859-1")

In [91]:

sports_data.shape

Out[91]:

(150, 8)

In [92]:

sports_data.columns.values

Out[92]:

array(['Sr_no', 'Avg_family_size', 'Average_Income', 'M_F_Gender_Ratio',
       'Un_emp_rate', 'Under35_Population_pect', 'Number_schools', 'Sales'], dtype=object)

In [93]:

freq=sports_data.Avg_family_size.value_counts()
freq.values

Out[93]:

array([61, 57, 18, 14], dtype=int64)

In [94]:

freq.index

Out[94]:

Int64Index([3, 2, 4, 1], dtype='int64')

In [95]:

import matplotlib.pyplot as plt
plt.bar(freq.index,freq.values)

Out[95]:

<Container object of 4 artists>

Trend Chart:

Trend Chart is used for time series datasets.
It determines the value of the variable in a particular interval of time.

Code for Trend Chart:

We are taking AirPassengers dataset for plotting the trend, chart.Plot() function is used for plotting the trend chart.

In [96]:

AirPassengers=pd.read_csv("C:UsersADMINDocumentsPython ScriptsPy ProgrammingSession 1DatasetsAir Travel DataAir_travel.csv", encoding = "ISO-8859-1")

In [97]:

AirPassengers.head()

Out[97]:

	DATE	AIR
0	JAN49	112
1	FEB49	118
2	MAR49	132
3	APR49	129
4	MAY49	121

In [98]:

AirPassengers.columns.values

Out[98]:

array(['DATE', 'AIR'], dtype=object)

In [99]:

import matplotlib.pyplot as plt
plt.plot(AirPassengers.AIR)

Out[99]:

[<matplotlib.lines.Line2D at 0x2195bfb8630>]

Conclusion:

In this session we discussed some basic data reporting and graph.
Studying descriptive statistics is essential before we start our advanced modeling. It gives us an idea on the variable distribution.
We also discussed, drawing graphs using some useful packages in Python.

Handout – Basic Statistics, Graphs and Reports in Python

Before start our lesson please download the datasets.