First course.

Lets take a dataset from datamarket.com. For example compensations per hour of Manufacturing Sector.

```> compensation = read.csv("compensationperhour.csv",header=TRUE,
+ sep=";",quote="\"",dec=".",fill=TRUE)```

you can check if the above data is a data.frame of R.

```> is.data.frame(compensation)
[1] TRUE```

To view the columns of dataset

```> ls(compensation)
[1] "Manufacturing.Sector" "Quarter"```

ls function lists the sub datasets of a data.frame.

You can use rm(data) function to remove a particular data from a dataset.

`>rm(compensation)`

Do not run the above mentioned command. Lets not delete what we just uploaded.

Now lets explore measures of the center and spread of above data.

‘\$’symbol below can be used to select the columns we are interested in. There are many ways to do that, lets follow this for now.

```> mean(compensation\$Manufacturing.Sector)
[1] 79.90603

> sd(compensation\$Manufacturing.Sector)
[1] 21.8031```

The most used measures of center and spread are the mean and standard deviation due to their relationship with the normal distribution, but they suffer when the data has long tails, or many outliers.

```>median(compensation\$Manufacturing.Sector)
[1] 74.4225```

A median is described as the numerical value separating the higher half of a sample, a population, or a probability distribution, from the lower half. It is one of the resistant measures for handling above mentioned problem with mean and standard deviation.

The idea of a quantile generalizes this median. The p quantile, (also known as the 100p%-percentile) is the point in the data where 100p% is less, and 100(1-p)% is larger. If there are n data points, then the p quantile occurs at the position 1+(n-1)p with weighted averaging if this is between integers.

For example for the above data.

```> quantile(compensation\$Manufacturing.Sector,c(.25,.75))
25%      75%
62.07575 99.73125 ```

The lower hinge is then the median of all the data to the left of the median, not counting this particular data point (if it is one.) The upper hinge is similarly defined.

```> fivenum(compensation\$Manufacturing.Sector)
[1]  49.0040  61.9970  74.4225  99.9350 119.7230 ```
```
```

Returns Tukey’s five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input data.

Another way of viewing many details at once. You can use the summary function as follows.

```> summary(compensation\$Manufacturing.Sector)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
49.00   62.08   74.42   79.91   99.73  119.70 ```

More resistant measures of centre and spread.

R has function IQR() for viewing Inter Quantile range and MAD() median average deviation.

`MAD(data) does this “median(abs(sals – median(sals)))”`
```> mad(compensation\$Manufacturing.Sector)
[1] 27.16642
> IQR(compensation\$Manufacturing.Sector)
[1] 37.6555 ```

Now plotting the data and analyzing it visually.

R allows a very simple function to plot the data points.

`>plot(compensation)`

For drawing a histogram of a univariate data.

A simple R command would be.

```> hist(compensation\$Manufacturing.Sector)

```

You can also specify the breaks for histogram as follows.

```>hist(compensation\$Manufacturing.Sector,breaks=6)
```

OR >

```>hist(compensation\$Manufacturing.Sector,breaks=c(49,65,67,70,80,85,90,92,
+max(compensation\$Manufacturing.Sector)))```

Boxplots

A box plot is a convenient way of graphically depicting groups of numerical data through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum).

Boxplots display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers. Boxplots can be drawn either horizontally or vertically.

For example :

```> boxplot(compensation\$Manufacturing.Sector ,horizontal=TRUE)

```