Real Magic

R Basics: Descriptive Statistics

You have to start somewhere!

When learning a new hobby or skill, I often want to jump right into the exciting and flashy things even though they are beyond beginner level. I have realized that sometimes it happens when I’m sharing the information about a skill that I am passionate about. However, before you can run, you need to learn how to walk. So today, I am going to finally go over a few of the basic spells and ingredients in R.

For these examples, I will use the mtcars data set that comes with R when you install it. Here, I am going to cover most of the basic descriptive statistics. Note, that in order to get access to just a single column or variable in data sets in R you use the $ symbol. For example, to get only the values of the mpg variable from the mtcars data set, use the code mtcars$mpg.

Measures of the Middle

First, we will look at some measures of the center of the data. These statistics tell you something about the average of your data, or what value is likely to be seen the most often.

Mean and Median

The great thing about R is most of the functions are named exactly what you want to do. For the measures of central tendency, the functions are mean()and median(). However, the mode does not have a built in function in R. If you want to get this, you will need to make your own function (Need help? See here!).

mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$mpg)
## [1] 19.2

Frequency Tables

For this function, you will need an extra package called the summarytools package. This package has quite a few nice functions for showing the descriptive statistics. For frequency tables, the function freq() will tell you about frequencies, proportions, and information about missing data.

library(summarytools)
## Warning: package 'summarytools' was built under R version 3.6.3
## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp
## For best results, restart R session and update pander using devtools:: or remotes::install_github('rapporter/pander')
freq(mtcars$gear)
## Frequencies  
## mtcars$gear  
## Type: Numeric  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           3     15     46.88          46.88     46.88          46.88
##           4     12     37.50          84.38     37.50          84.38
##           5      5     15.62         100.00     15.62         100.00
##        <NA>      0                               0.00         100.00
##       Total     32    100.00         100.00    100.00         100.00

Measures of the Spread

Now we will look at some measure of dispersion within the data. Sometimes it is good to look at how much the data varies. It may reveal some outliers within your data.

Range

The range of the data can be given by a few different functions in R. The range() function gives you the minimum and maximum together as a vector. Alternatively, you can use min() and max() to get the values separately.

range(mtcars$mpg)
## [1] 10.4 33.9

Quantiles

Quantiles and quartiles are all found the same way in R. The function used is quantile() and it has two arguments. The first is the data vector and the second is the percentile you would like computed.

quantile(mtcars$mpg, .25)
##    25% 
## 15.425
quantile(mtcars$mpg, .87)
##    87% 
## 27.261

Summary Function

R has a few convenience functions that will allow you to compute many of these descriptive statistics all at once. The summary() function is built in to R and will give the minimum, maximum, first and third quartile, median, and mean for every variable in your data set.

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Variance

The final measures of dispersion are standard deviation and variance. These function names are abbreviated in R, using sd() and var() respectively. Note, in R these measures are always computed as if the data is a sample.

sd(mtcars$mpg)
## [1] 6.026948
var(mtcars$mpg)
## [1] 36.3241

There are obviously more basics to go over in R: Plots for data visualization, importing your data, and test statistics for inference (for two groups or more). Join me next week for some basics about plots in R.