First course.

To begin with download a dataset and load it into R.

Lets take a dataset from datamarket.com. For example compensations per hour of Manufacturing Sector.

To load the above downloaded CSV into R call the ‘read.csv’ function as follows.

> compensation = read.csv("compensationperhour.csv",header=TRUE, + sep=";",quote="\"",dec=".",fill=TRUE)

you can check if the above data is a data.frame of R.

> is.data.frame(compensation) [1] TRUE

To view the columns of dataset

> ls(compensation) [1] "Manufacturing.Sector" "Quarter"

ls function lists the sub datasets of a data.frame.

You can use rm(data) function to remove a particular data from a dataset.

>rm(compensation)

Do not run the above mentioned command. Lets not delete what we just uploaded.

Now lets explore measures of the center and spread of above data.

‘$’symbol below can be used to select the columns we are interested in. There are many ways to do that, lets follow this for now.

> mean(compensation$Manufacturing.Sector) [1] 79.90603 > sd(compensation$Manufacturing.Sector) [1] 21.8031

The most used measures of center and spread are the mean and standard deviation due to their relationship with the normal distribution, but they suffer when the data has long tails, or many outliers.

>median(compensation$Manufacturing.Sector) [1] 74.4225

A **median** is described as the numerical value separating the higher half of a sample, a population, or a probability distribution, from the lower half. It is one of the resistant measures for handling above mentioned problem with mean and standard deviation.

The idea of a **quantile** generalizes this median. The *p *quantile, (also known as the 100p%-percentile) is the point in the data where 100p% is less, and 100(1-p)% is larger. If there are *n *data points, then the *p *quantile occurs at the position 1+(*n*-1)*p *with weighted averaging if this is between integers.

For example for the above data.

> quantile(compensation$Manufacturing.Sector,c(.25,.75)) 25% 75% 62.07575 99.73125

The **lower hinge** is then the median of all the data to the left of the median, not counting this particular data point (if it is one.) The **upper hinge** is similarly defined.

> fivenum(compensation$Manufacturing.Sector) [1] 49.0040 61.9970 74.4225 99.9350 119.7230

```
```

Returns Tukey’s five number summary (**minimum**, **lower-hinge**, **median,** **upper-hinge**, **maximum**) for the input data.

Another way of viewing many details at once. You can use the summary function as follows.

> summary(compensation$Manufacturing.Sector) Min. 1st Qu. Median Mean 3rd Qu. Max. 49.00 62.08 74.42 79.91 99.73 119.70

More resistant measures of centre and spread.

R has function IQR() for viewing **Inter Quantile range** and MAD() **median average deviation**.

`MAD(data) does this “median(abs(sals – median(sals)))”`

> mad(compensation$Manufacturing.Sector) [1] 27.16642 > IQR(compensation$Manufacturing.Sector) [1] 37.6555

**Now plotting the data and analyzing it visually.**

R allows a very simple function to plot the data points.

>plot(compensation)

For drawing a histogram of a univariate data.

A simple R command would be.

> hist(compensation$Manufacturing.Sector)

You can also specify the breaks for histogram as follows.

>hist(compensation$Manufacturing.Sector,breaks=6)

OR >

>hist(compensation$Manufacturing.Sector,breaks=c(49,65,67,70,80,85,90,92, +max(compensation$Manufacturing.Sector)))

**Boxplots **

A **box plot **is a convenient way of graphically depicting groups of numerical data through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum).

Boxplots display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers. Boxplots can be drawn either horizontally or vertically.

For example :

> boxplot(compensation$Manufacturing.Sector ,horizontal=TRUE)