R Basics: Descriptive Statistics
You have to start somewhere!
Friday, Nov 06, 2020 By Hope Snyder.

When learning a new hobby or skill, I often want to jump right into the exciting and flashy things even though they are beyond beginner level. I have realized that sometimes it happens when I’m sharing the information about a skill that I am passionate about. However, before you can run, you need to learn how to walk. So today, I am going to finally go over a few of the basic spells and ingredients in R.
For these examples, I will use the mtcars data set that comes with R when you install it. Here, I am going to cover most of the basic descriptive statistics. Note, that in order to get access to just a single column or variable in data sets in R you use the $ symbol. For example, to get only the values of the mpg variable from the mtcars data set, use the code mtcars$mpg.
Measures of the Middle
First, we will look at some measures of the center of the data. These statistics tell you something about the average of your data, or what value is likely to be seen the most often.
Mean and Median
The great thing about R is most of the functions are named exactly what you want to do. For the measures of central tendency, the functions are mean()and median(). However, the mode does not have a built in function in R. If you want to get this, you will need to make your own function (Need help? See here!).
mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$mpg)
## [1] 19.2
Frequency Tables
For this function, you will need an extra package called the summarytools package. This package has quite a few nice functions for showing the descriptive statistics. For frequency tables, the function freq() will tell you about frequencies, proportions, and information about missing data.
library(summarytools)
## Warning: package 'summarytools' was built under R version 3.6.3
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
## For best results, restart R session and update pander using devtools:: or remotes::install_github('rapporter/pander')
freq(mtcars$gear)
## Frequencies
## mtcars$gear
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 3 15 46.88 46.88 46.88 46.88
## 4 12 37.50 84.38 37.50 84.38
## 5 5 15.62 100.00 15.62 100.00
## <NA> 0 0.00 100.00
## Total 32 100.00 100.00 100.00 100.00
Measures of the Spread
Now we will look at some measure of dispersion within the data. Sometimes it is good to look at how much the data varies. It may reveal some outliers within your data.
Range
The range of the data can be given by a few different functions in R. The range() function gives you the minimum and maximum together as a vector. Alternatively, you can use min() and max() to get the values separately.
range(mtcars$mpg)
## [1] 10.4 33.9
Quantiles
Quantiles and quartiles are all found the same way in R. The function used is quantile() and it has two arguments. The first is the data vector and the second is the percentile you would like computed.
quantile(mtcars$mpg, .25)
## 25%
## 15.425
quantile(mtcars$mpg, .87)
## 87%
## 27.261
Summary Function
R has a few convenience functions that will allow you to compute many of these descriptive statistics all at once. The summary() function is built in to R and will give the minimum, maximum, first and third quartile, median, and mean for every variable in your data set.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Variance
The final measures of dispersion are standard deviation and variance. These function names are abbreviated in R, using sd() and var() respectively. Note, in R these measures are always computed as if the data is a sample.
sd(mtcars$mpg)
## [1] 6.026948
var(mtcars$mpg)
## [1] 36.3241
There are obviously more basics to go over in R: Plots for data visualization, importing your data, and test statistics for inference (for two groups or more). Join me next week for some basics about plots in R.