• No products in the cart.

103.3.1 Basic Statistics, Graphs and Reports

Dive into Statistics with basics

This is the 3rd session of the R programming. The 1st session consisted of R introduction, how to code in R, how to compile the code, how to get the output. etc. The 2nd session was all about data handling, importing the data from various sources into R,merging different data, creating a new variable, filtering the data, how to take different  R data set & combine them to create the resultant data, exporting the data, etc. Session 3 is all about sampling, statistics, quartiles, percentiles, box plot, graphs, etc.

Sampling

Many a times we need only some part of the data or sample of the data or a subset of data, instead of the entire data set. For example, lets us consider the sales data or purchase orders data of last 20 years. We might not be interested in the whole 20 years data, we might only need the last 2 years data for the analysis. How do we take the sample. We take the dataset and use the sample function.

Syntax: sampleset<-dataset[sample(1:nrow(mydata),n),]

Let us consider the Online Retail data set.

>Retail_data<-read.csv("C:\\Amrita\\Datavedi\\Online Retail Sales Data\\Online Retail.csv")
>dim(Retail_data)

So there are 541909 rows and 8 columns. We don’t want all the 541909 rows, we want to take only the sample of 10000 rows out of this  Online Retail dataset

>Sample_set<-Retail_data[sample(1:nrow(Retail_data),10000), ]

This is the syntax for sampling the whole dataset for 1000 observations. Sample_set, is the new object created, into which we assign the new dataset of 10000 elements. Retail_data, is the original data, from which we extract  the sample 10000 elements. We need to give 2 parameters, i.e., 1 for the rows and 1 for the column. The row part will include “sample(1:nrow(Retail_data),10000)”, which means for taking random 1000 observations, all the rows, i.e., from 1 to n rows is considered  and the column part is left blank, as we need all the columns.

Instead of “sample(1:nrow(Retail_data),1000)”, we can also give as “sample(1:5000,10000)”, which will consider only the first 5000 rows from the original dataset for sampling the 10000 rows. If the syntax seems to be confusing, then we can simply write it as “sample(1:541909,10000)”, as we know there are total 541909 rows. The 1000 rows which we get from sampling will be randomly taken from the dataset.

Sampling in R

Let us consider the census income data that is there in the dataset folder.

>Income_data<-read.csv("C:\\Amrita\\Datavedi\\Census Income Data\\Income_data.csv")

The exercise is to take a sample dataset of 5000 records, from the dataset, Income-data which is a very large data set.

>dim(Income_data) 
#32561 15

The Income_data consists of 32561 rows and 15 columns.

>sample<-Income_data[sample(1:nrow(Income_data),5000),]

The above command will store the 5000 records in the object sample, and if we check the dimension we get it as 50000 rows and 15 columns. This is how the sampling is done.

In next section, we will be studying about Descriptive Statistics.

20th June 2017

DV Analytics

DV Data & Analytics is a leading data science,  Cyber Security training and consulting firm, led by industry experts. We are aiming to train and prepare resources to acquire the most in-demand data science job opportunities in India and abroad.

Bangalore Center

DV Data & Analytics Bangalore Private Limited
#52, 2nd Floor:
Malleshpalya Maruthinagar Bengaluru.
Bangalore 560075
India
(+91) 9019 030 033 (+91) 8095 881 188
Email: info@dvanalyticsmds.com

Bhubneshwar Center

DV Data & Analytics Private Limited Bhubaneswar
Plot No A/7 :
Adjacent to Maharaja Cine Complex, Bhoinagar, Acharya Vihar
Bhubaneswar 751022
(+91) 8095 881 188 (+91) 8249 430 414
Email: info@dvanalyticsmds.com

top
© 2020. All Rights Reserved.