You can download the datasets and R code file for this session here.

Taking a random sample from data
Descriptive statistics
Central Tendency
Variance
Quartiles, Percentiles
Box Plots
Graphs

Sampling in R

We need to use sample() function

Sample_set <- dataset[sample(1:nrow(mydata), n), ]

Retail_data <- read.csv("~\\Online Retail Sales Data\\Online Retail.csv")
dim(Retail_data)

## [1] 541909      8

Sample size 10000

Sample_set <- Retail_data[sample(1:nrow(Retail_data),10000), ]
dim(Sample_set)

## [1] 10000     8

LAB: Sampling in R

1. Import “Census Income Data/Income_data.csv”
1. Create a new dataset by taking a random sample of 5000 records

Income_data<- read.csv("~\\Census Income Data\\Income_data.csv")
sample <- Income_data[sample(1:nrow(Income_data),5000),]
dim(sample)

## [1] 5000   15

Descriptive Statistics

The basic descriptive statistics to give us an idea on the variables and their distributions
Permit the analyst to describe many pieces of data with a few indices
Central tendencies
Mean
Median
Dispersion
Range
Variance
Standard deviation

Central tendencies

Mean

The arithmetic mean
Sum of values/ Count of values
Gives a quick idea on average of a variable

Median

Mean is not a good measure in presence of outliers
For example Consider below data vector 1.5,1.7,1.9,0.8,0.8,1.2,1.9,1.4, 9 , 0.7 , 1.1
90% of the above values are less than 2, but the mean of above vector is 2
There is an unusual value in the above data vector i.e 9
It is also known as outlier.
Mean is not the true middle value in presence of outliers. Mean is very much effected by the outliers.
We use median, the true middle value in such cases
Sort the data either in ascending or descending order

Vector	Sorted Ventor
1.5	0.7
1.7	0.8
1.9	0.8
0.8	1.1
0.8	1.2
1.2	1.4
1.9	1.5
1.4	1.7
9	1.9
0.7	1.9
1.1	9

Mean of the data is 2
Median of the data is 1.4
Even if we have the outlier as 90, we will have the same median
Median is a positional measure, it doesn’t really depend on outliers
When there are no outliers then mean and median will be nearly equal
When mean is not equal to median it gives us an idea on presence of outliers in the data

Mean and Median on R

Income  <- read.csv("~\\Census Income DataIncome_data.csv")

mean(Income$capital.gain)

## [1] 1077.649

median(Income$capital.gain)

## [1] 0

Mean is far away from median. Looks like there are outliers, we need to look at percentiles and box plot.

LAB: Mean and Median on R

1. Import Dataset: “./Online Retail Sales Data/Online Retail.csv”
1. What is the mean of “UnitPrice”
1. What is the median of “UnitPrice”
1. Is mean equal to median? Do you suspect the presence of outliers in the data?
1. What is the mean of “Quantity”
1. What is the median of “Quantity”
1. Is mean equal to median? Do you suspect the presence of outliers in the data?

Solutions

1. Import Dataset: “./Online Retail Sales Data/Online Retail.csv”

Online_Retail<-read.csv("~\\Online Retail Sales Data\\Online Retail.csv")

1. What is the mean of “UnitPrice”

mean(Online_Retail$UnitPrice)

## [1] 4.611114

1. What is the median of “UnitPrice”

median(Online_Retail$UnitPrice)

## [1] 2.08

4.Is mean equal to median? Do you suspect the presence of outliers in the data?Yes, in this case mean and median are close. However we still cannot conclude on the absence of outlier because if there are balancing outliers on the either side of median, then also the mean and median can be close.
1. What is the mean of “Quantity”

mean(Online_Retail$Quantity)

## [1] 9.55225

1. What is the median of “Quantity”

median(Online_Retail$Quantity)

## [1] 3

7.Is mean equal to median? Do you suspect the presence of outliers in the data?No.Looks like there are outliers.

Dispersion

Just knowing the central tendency is not enough.
Two variables might have same mean, but they might be very different.
Dispersion gives us an idea about the spread of the data.
Look at these two variables. Profit details of two companies A & B for last 14 Quarters in MMs

															Mean
Company A	43	44	0	25	20	35	-8	13	-10	-8	32	11	-8	21	15
Company B	17	15	12	17	15	18	12	15	12	13	18	18	14	14	15

Though the average profit is 15 in both the cases
Company B has performed consistently than company A.
There was even loses for company A
Measures of dispersion become very vital in such cases

Variance and Standard Deviation

Dispersion is the quantification of deviation of each point from the mean value.
Variance is average of squared distances of each point from the mean
Variance is a fairly good measure of dispersion.
Variance in profit for company A is 352 and Company B is 4.9

$\sigma ^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n}$

Variance Calculation

Value	Value-Mean	(Value-Mean)^2
43	28	784
44	29	841
0	-15	225
25	10	100
20	5	25
35	20	400
-8	-23	529
13	-2	4
-10	-25	625
-8	-23	529
32	17	289
11	-4	16
-8	-23	529
21	6	36
15.0		352

Value	Value-Mean	(Value-Mean)^2
17	2	4
15	0	0
12	-3	9
17	2	4
15	0	0
18	3	9
12	-3	9
15	0	0
12	-3	9
13	-2	4
18	3	9
18	3	9
14	-1	1
14	-1	1
15.0		4.9

Standard Deviation

Standard deviation is just the square root of variance
Variance gives a good idea on dispersion, but it is of the order of squares.
Its very clear from the formula, variance unites are squared than that of original data.
Standard deviation is the variance measure that is in the same units as the original data

$s =\sqrt{ \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n}}$

Variance and Standard Deviation in R

Divide the Income data into two sets. USA vs Others

usa_income<-Income[(Income$native.country==" United-States"), ] 
other_income<-Income[!(Income$native.country==" United-States"),]
nrow(usa_income)

## [1] 29170

nrow(other_income)

## [1] 3391

Find the variance of “education.num” in those two sets. Which one has higher variance?
Variance and SD for USA

var(usa_income$education.num)

## [1] 5.735863

sd(usa_income$education.num)

## [1] 2.394966

Variance and SD for Other

var(other_income$education.num)

## [1] 13.56761

sd(other_income$education.num)

## [1] 3.683424

other_income dataset has a higher variance

LAB: Variance and Standard deviation

1. Import Dataset: “./Online Retail Sales Data/Online Retail.csv”
1. What is the variance and s.d of “UnitPrice”
1. What is the variance and s.d of “Quantity”
1. Which one these two variables is consistent?

Online_Retail<-read.csv("~\\Online Retail Sales Data\\Online Retail.csv")

Solutions

1. What is the variance and standard dediation of “UnitPrice”

var(Online_Retail$UnitPrice)

## [1] 9362.469

sd(Online_Retail$UnitPrice)

## [1] 96.75985

1. What is the variance and s.d of “Quantity”

var(Online_Retail$Quantity)

## [1] 47559.39

sd(Online_Retail$Quantity)

## [1] 218.0812

4.Which one these two variables is consistent?UnitPrice

Percentiles

A student attended an exam along with 1000 others.
He got 68% marks? How good or bad he performed in the exam?
What will be his rank overall?
What will be his rank if there were 100 students overall?
For example, with 68 marks, he stood at 90th position. There are 910 students who got less than 68, only 89 students got more marks than him
He is standing at 91 percentile.
Instead of stating 68 marks, 91% gives a good idea on his performance
Percentiles make the data easy to read
(p^{th}) percentile: p percent of observations below it, (100 – p)% above it.
Marks are 40 but percentile is 80%, what does this mean?
80% of CAT exam percentile means
20% are above & 80% are below
Percentiles help us in getting an idea on outliers.
For example the highest income value is 400,000 but 95th percentile is 20,000 only. That means 95% of the values are less than 20,000. So the values near 400,000 are clearly outliers

Quartiles

Percentiles divide the whole population into 100 groups where as quartiles divide the population into 4 groups
p = 25: First Quartile or Lower quartile (LQ)
p = 50: second quartile or Median
p = 75: Third Quartile or Upper quartile (UQ)

Percentiles & Quartiles in R

By default summary gives 4 quartiles

summary(Income$capital.gain)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    1078       0  100000

quantile(Income$capital.gain, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))

##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##     0     0     0     0     0     0     0     0     0     0 99999

quantile(Income$capital.loss, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    0    0    0    0    0    0    0    0    0    0 4356

quantile(Income$hours.per.week, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    1   24   35   40   40   40   40   40   48   55   99

Looks like some people are working 90 hours perweek.

LAB: Percentiles & Quartiles in R

1. ImportDataset: “./Bank Marketing/bank_market.csv”
1. Get the summary of the balance variable
1. Do you suspect any outliers in balance ?
1. Get relevant percentiles and see their distribution.
1. Are there really some outliers present?
1. Get the summary of the age variable
1. Do you suspect any outliers in age?
1. Get relevant percentiles and see their distribution.
1. Are there really some outliers present?

Solutions

1. ImportDataset: “./Bank Marketing/bank_market.csv”

bank_market <- read.csv("~\\Bank Marketing\\bank_market.csv")

1. Get the summary of the balance variable

summary(bank_market$balance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -8019      72     448    1362    1428  102100

3.Do you suspect any outliers in balance ?

Yes

1. Get relevant percentiles and see their distribution.

quantile(bank_market$balance, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))

##     0%    10%    20%    30%    40%    50%    60%    70%    80%    90% 
##  -8019      0     22    131    272    448    701   1126   1859   3574 
##   100% 
## 102127

5.Are there really some outliers present?

Yes

1. Get the summary of the age variable

summary(bank_market$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   33.00   39.00   40.94   48.00   95.00

7.Do you suspect any outliers in age?

1. Get relevant percentiles and see their distribution.

quantile(bank_market$age, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##   18   29   32   34   36   39   42   46   51   56   95

9.Are there really some outliers present?

Yes

Box Plots and Outlier Detection

Box plots have box from LQ to UQ, with median marked.
They portray a five-number graphical summary of the data Minimum, LQ, Median, UQ, Maximum
Helps us to get an idea on the data distribution
Helps us to identify the outliers easily
25% of the population is below first quartile,
75% of the population is below third quartile
If the box is pushed to one side and some values are far away from the box then it’s a clear indication of outliers

Some set of values far away from box, is gives us a clear indication of outliers.
In this example the minimum is 5, maximum is 120, and 75% of the values are less than 15
Still there are some records reaching 120. Hence a clear indication of outliers

Sometimes the outliers are so evident that, the box appear to be a horizontal line in box plot.

Box plots and Outlier Detection on R

boxplot(usa_income$capital.gain)

boxplot(usa_income$education.num)

LAB: Box plots and outlier detection

1. Import Dataset: “./Bank Marketing/bank_market.csv”
1. Draw a box plot for balance variable
1. Do you suspect any outliers in balance ?
1. Get relevant percentiles and see their distribution.
1. Draw a box plot for age variable
1. Do you suspect any outliers in age?
1. Get relevant percentiles and see their distribution.

Solutions

1. Import Dataset: “./Bank Marketing/bank_market.csv”

bank_market <- read.csv("~\\Bank Marketing\\bank_market.csv")

1. Draw a box plot for balance variable

boxplot(bank_market$balance)

– 3.Do you suspect any outliers in balance ? Yes – 4.Get relevant percentiles and see their distribution.

quantile(bank_market$balance, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))

##     0%    10%    20%    30%    40%    50%    60%    70%    80%    90% 
##  -8019      0     22    131    272    448    701   1126   1859   3574 
##   100% 
## 102127

1. Draw a box plot for age variable

boxplot(bank_market$age)

6.Do you suspect any outliers in balance ? Yes – 7.Get relevant percentiles and see their distribution.

quantile(bank_market$age, c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##   18   29   32   34   36   39   42   46   51   56   95

Creating Graphs

Scatter Plot:
Scatter plots give us an indication on the relation between the two chosen variables.
Example:

data()
cars
scatter(cars$speed, cars$dist)
plot(cars$speed, cars$dist)

LAB: Graphs

1. Import Dataset: “./Sporting_goods_sales/Sporting_goods_sales.csv”
1. Draw a scatter plot between Average_Income and Sales. Is there any relation between two variables?
1. Draw a scatter plot between Under35_Population_pect and Sales. Is there any relation between two?

Solutions

1. Import Dataset: “./Sporting_goods_sales/Sporting_goods_sales.csv”

Sporting_goods_sales <- read.csv("~\\Sporting_goods_sales\\Sporting_goods_sales.csv")

1. Draw a scatter plot between Average_Income and Sales. Is there any relation between two variables?

plot(Sporting_goods_sales$Average_Income, Sporting_goods_sales$Sales)

Looks like there is as such no relation between Average_Income and Sales.

1. Draw a scatter plot between Under35_Population_pect and Sales. Is there any relation between two?

plot(Sporting_goods_sales$Under35_Population_pect, Sporting_goods_sales$Sales)

There is a strong positive relationship between Under35_Population_pect and Sales.

Bar Chart

bar_table<-table(mtcars$cyl)
barplot(bar_table, main="Cars Data Details",   xlab="Cylinders", ylab="Cars Count")

bar_table_gear<-table(mtcars$gear)
barplot(bar_table_gear, main="Cars Data Details",   xlab="gears", ylab="Cars Count" , col = 4)

LAB: Bar Chart

1. Import Dataset: “./Sporting_goods_sales/Sporting_goods_sales.csv”
1. Create a bar chart summarizing the information on family size.

Solutions

Sporting_goods_sales <- read.csv("~\\Sporting_goods_sales\\Sporting_goods_sales.csv")

1. Create a bar chart summarizing the information on family size.

table_family.size<-table(Sporting_goods_sales$Avg_family_size)
barplot(table_family.size, main="Family Size Details",   xlab="Avg_family_size", ylab="Count" , col = 4)

Trend chart – Trend chart is used for time series datasets.

plot(AirPassengers)

Lab : Trend chart

1. Draw trend chart for Ukgas data from datasets package
1. Plot JohnsonJohnson data in a trend chart

Solutions

1. Draw trend chart for Ukgas data from datasets package

plot(UKgas)

– 2. Plot JohnsonJohnson data in a trend chart

plot(JohnsonJohnson)

ggplot for Better Charts

ggplot is a good plotting library. It has lot of options to make our graphs look pretty
Cars data used in scatter plot

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.3

qplot(speed, dist, data=cars, colour = I("red"), size=I(5))

Scatter plot between Average_Income and Sales

Scatter plot between Under35_Population_pect and Sales

Conclusion

In this session we discussed some basic data reporting and graph
Studying descriptive statistics is essential before we start our advanced modeling. It gives us an idea on variable distribution
We also discussed drawing graphs using some useful packages in R
There are some good visualization packages in R, like ggplot and R-shiny. You can make use of them for story-telling and visualizations.

Handout – Basic Statistics, Plots and Reporting in R

You can download the datasets and R code file for this session here.

Contents

Sampling in R

LAB: Sampling in R

Descriptive Statistics

Central tendencies

Mean

Median

Mean and Median on R

LAB: Mean and Median on R

Solutions

Dispersion

Variance and Standard Deviation

Variance Calculation

Standard Deviation

Variance and Standard Deviation in R

LAB: Variance and Standard deviation

Solutions

Percentiles

Quartiles

Percentiles & Quartiles in R

LAB: Percentiles & Quartiles in R

Solutions

Box Plots and Outlier Detection

Box plots and Outlier Detection on R

LAB: Box plots and outlier detection

Solutions

Creating Graphs

LAB: Graphs

Solutions

Bar Chart

LAB: Bar Chart

Solutions

Lab : Trend chart

Solutions

ggplot for Better Charts

Conclusion

Vector	Sorted Ventor
1.5	0.7
1.7	0.8
1.9	0.8
0.8	1.1
0.8	1.2
1.2	1.4
1.9	1.5
1.4	1.7
9	1.9
0.7	1.9
1.1	9

Vector	Sorted Ventor
1.5	0.7
1.7	0.8
1.9	0.8
0.8	1.1
0.8	1.2
1.2	1.4
1.9	1.5
1.4	1.7
9	1.9
0.7	1.9
1.1	9

Vector	Sorted Ventor
1.5	0.7
1.7	0.8
1.9	0.8
0.8	1.1
0.8	1.2
1.2	1.4
1.9	1.5
1.4	1.7
9	1.9
0.7	1.9
1.1	9