R

This are my notes from the freeCodeCamp course on R.

Set-up

  • Download R
  • Download R-Studio (Environment saves variables)

Basic Keyboard Shortcuts

  • Ctr + Enter # runs the line were the cursor is
  • To get help with something in R, just type a question mark: ?plot
  • for contributed packages:
    - # Get info on package
      p_help(psych)           # Opens package PDF in browser
      p_help(psych, web = F)  # Opens help in R Viewer
    

Packages

There are two types of packages:

  • Base, which are installed with R butnot loaded by default
  • Contributed, which need to be downloaded, installed and loaded separately. Which are available at:
    • CRAN (comprehensive R archive network)
    • Crantastic! (redirect to CRAN)
    • GitHub

Install and load packages

A helpful package manager is pacman. Install it first and then use it for the installation of all other libraries:

# pacman is useful for contributed packages
install.packages("pacman")

# load the package pacman
library(pacman)  

# Or, by using "pacman::p_load" you can use the p_load function from pacman without actually loading pacman.
# These are packages generally useful packages.
pacman::p_load(pacman, dplyr, GGally, ggplot2, ggthemes, 
  ggvis, httr, lubridate, plotly, rio, rmarkdown, shiny, 
  stringr, tidyr)  

library(datasets)  # Load/unload base packages manually

# CLEAN UP #################################################

# Clear packages
p_unload(dplyr, tidyr, stringr) # Clear specific packages
p_unload(all)  # Easier: clears all add-ons
detach("package:datasets", unload = TRUE)  # For base

Plot

R chooses the best plot for your data automatically. To do this, first load the dataset:

library(datasets)  # Load/unload base packages manually

Now to plot the data you can either specify one or more variables that are supposed to be portrayed:

plot(iris$Species)  # Categorical variable
plot(iris$Petal.Length)  # Quantitative variable
plot(iris$Species, iris$Petal.Width)  # Cat x quant
plot(iris$Petal.Length, iris$Petal.Width)  # Quant pair

Or you can just plot the whole dataset, which creates a matrix of graphs that show all different variables and can be useful for an overview:

plot(iris)  # Entire data frame

Of course it is also possible to adjust the visuals of your graph:

# Plot with options
plot(iris$Petal.Length, iris$Petal.Width,
  col = "#cc0000",  # Hex code for datalab.cc red
  pch = 19,         # pch stands for point character --> Use solid circles for points
  main = "Iris: Petal Length vs. Petal Width", # headline of the graph
  xlab = "Petal Length",
  ylab = "Petal Width")

Basic graphics - bar chart

Open the dataset. Now to create a bar chart first has to change the format. For instance if you have a list with the names of students which contains their age and you want to make a plot that shows how many people are 21 years old you first need to transform your data. In this case you do that by putting it into a table:

barplot(mtcars$cyl)             # Doesn't work

# Need a table with frequencies for each category
cylinders <- table(mtcars$cyl)  # Create table
barplot(cylinders)              # Bar chart
plot(cylinders)                 # Default X-Y plot (lines)

Basic graphics - Histogramm

Good for looking at Shape, Gaps, Outliers or Symmetry.

# BASIC HISTOGRAMS #########################################

hist(iris$Sepal.Length)
hist(iris$Sepal.Width)
hist(iris$Petal.Length)
hist(iris$Petal.Width)

# HISTOGRAM BY GROUP #######################################

# Put graphs in 3 rows and 1 column aka this allows you to put 3 plots in 1 column ergo above one another, if you want to plot the 3 different species of iris this is helpful
par(mfrow = c(3, 1))

# Histograms for each species using options
hist(iris$Petal.Width [iris$Species == "setosa"],
  xlim = c(0, 3),
  breaks = 9,
  main = "Petal Width for Setosa",
  xlab = "",
  col = "red")

Basic graphics - Scatterplot

Visualize the association between two quantitative variables. Check whether the associations is linear, is the spread even, outliers and correlation between the two variables.

plot(mtcars$wt, mtcars$mpg) # plot miles against miles per gallon

Overlaying plots

Creates increased information density. For example you can overlay your distribution with a normal distribution using the mean and standard deviation of your own data.

# Default
hist(lynx)

# Add a normal distribution
curve(dnorm(x, mean = mean(lynx), sd = sd(lynx)),
      col = "thistle4",  # Color of curve
      lwd = 2,           # Line width of 2 pixels
      add = TRUE)        # Superimpose on previous graph

# Add two kernel density estimators
lines(density(lynx), col = "blue", lwd = 2)
lines(density(lynx, adjust = 3), col = "purple", lwd = 2)

# Add a rug plot
rug(lynx, lwd = 2, col = "gray")

Above you can also see how to add a kernel density estimator or a rug plot.

Basic statistics

summary()

You can get a summary that includes parameters like mean, quartiles and min and max.

summary(iris$Species)       # Categorical variable
summary(iris$Sepal.Length)  # Quantitative variable
summary(iris)   

describe()

This a contributed package from the psych package. Gives you: n, SD, median, 10% trimmed mean, MAD (median absolute deviation), min/max, range, skewness, kurtosis, SE (standard errors).

# For quantitative variables only.
describe(iris$Sepal.Length)  # One quantitative variable
describe(iris)               # Entire data frame

selected classes

For example make a Histogramm only for the species versicosa. Just select versicosa. You can also use a pair of selectors simultanuosly.

# Short Virginica petals only
hist(iris$Petal.Length[iris$Species == "virginica" & 
  iris$Petal.Length < 5.5],
  main = "Petal Length: Short Virginica")

You can also create a subset of your data, f.i. if you plan to only focus on one species.

# Format: data[rows, columns]
# Leave rows or columns blank to select all
i.setosa <- iris[iris$Species == "setosa", ]

Data formats

  • Data types
    • numeric (integer, single and double character, logical, complex and raw)
      • Vector: One or more numbers in a 1d array, all from the same data type.
      • Matrix /array: two dimensional of the same length and data classc, columns named by index numbers
      • Data frame: vectors of multiple types, all with the same length –> the closest to an analogue spreadsheet
      • list: most flexible data format, any class, length or structure, can include lists in itself

Numeric, character, Logical, Vector

Here are some examples:

# Numeric
n1 <- 15  # Double precision by default
n2 <- 1.5

# Character
c1 <- "c"
c2 <- "a string of text"

# Logical
l1 <- TRUE
l2 <- F

# DATA STRUCTURES ##########################################

## Vector ##################################################
v1 <- c(1, 2, 3, 4, 5)
v2 <- c("a", "b", "c")
v3 <- c(TRUE, TRUE, FALSE, FALSE, TRUE)

Data can also be saved as a matrix:`

The output of this:

m1 <- matrix(c(T, T, F, F, T, F), nrow = 2)

looks like this:

    [,1]  [,2]  [,3]
[1,] TRUE FALSE  TRUE
[2,] TRUE FALSE FALSE

If you want it to be ordered by row then you need this:

m2 <- matrix(c("a", "b", 
               "c", "d"), 
               nrow = 2,
               byrow = T)

Which produces the following output:

     [,1] [,2]
[1,] "a"  "b" 
[2,] "c"  "d" 

Array

To produce an array give the data (n this case 1 till 24) and put them in an array with 4 rows, 3 columns and 2 tables (because it is 3 dimensional):

a1 <- array(c( 1:24), c(4, 3, 2))

The array looks like this:

, , 1

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

, , 2

     [,1] [,2] [,3]
[1,]   13   17   21
[2,]   14   18   22
[3,]   15   19   23
[4,]   16   20   24

Data Frame

Data frame makes is similiar to a spreadsheet were you can combine different types of data formats:

# Can combine vectors of the same length

vNumeric   <- c(1, 2, 3)
vCharacter <- c("a", "b", "c")
vLogical   <- c(T, F, T)

# Makes a data frame with three different data types. CAVE: as.data.frame is necessary!
df <- as.data.frame(cbind(vNumeric, vCharacter, vLogical))

Lists

Here an example for creating a list:

o1 <- c(1, 2, 3)
o2 <- c("a", "b", "c", "d")
o3 <- c(T, F, T, T, F)

list1 <- list(o1, o2, o3)

list2 <- list(o1, o2, o3, list1)  # Lists within lists!

Coercing

R has the ability to automatically coerce you data to the least restricitve data type: This will be automatically coerced to charachter:

(coerce1 <- c(1, "b", TRUE))

If you want to coerce your data into a specific data format yourself you can do the following:

(coerce3 <- as.integer(5))
(coerce5 <- as.numeric(c("1", "2", "3")))
 
 ## Coerce matrix to data frame #############################

(coerce6 <- matrix(1:9, nrow= 3))
(coerce7 <- as.data.frame(matrix(1:9, nrow= 3)))

Factors

A factor is an attribute of a vector that specifies the possible values and their order. It is basically values with value lables. This is easier to understand if you just look at the example:

x4  <- c(1:3)
df4 <- cbind.data.frame(x4, y)
df4$x4 <- factor(df4$x4,
  levels = c(1, 2, 3),
  labels = c("macOS", "Windows", "Linux"))

The output is as follows:

> str(df4)
'data.frame':	9 obs. of  2 variables:
 $ x4: Factor w/ 3 levels "macOS","Windows",..: 1 2 3 1 2 3 1 2 3
 $ y : int  1 2 3 4 5 6 7 8 9

Entering Data

Small amount of data that can be typed in. There are many methods:

The assignment operator This conists of an arrow pointing to the left. The shortcut for this is Alt + - <-

The ways data can be entered are varied but easy, just google there is a possibility for almost all types of it:

# Assigns number 0 through 10 to x1
x1 <- 0:10

# Descending order
x2 <- 10:0

# SEQ ######################################################
# Ascending values (duplicates 1:10)
(x3 <- seq(10))

# Specify change in values
(x4 <- seq(30, 0, by = -3))

# ENTER MULTIPLE VALUES WITH C #############################
# c = concatenate (or combine or collect)
x5 <- c(5, 4, 1, 6, 7, 2, 2, 3, 2, 8)

# SCAN #####################################################
x6 <- scan()  # After running this command, go to console
# Hit enter after each number
# Hit enter twice to stop

# REP ###################################################
x7 <- rep(TRUE, 5)

# Repeats set
x8 <- rep(c(TRUE, FALSE), 5)

# Repeats items in set
x9 <- rep(c(TRUE, FALSE), each = 5)

Importing data

  • csv (comma seperated values)
  • txt
  • xlsx (Excel) -> Do not use this data type if avoidable.
  • JSON (Java script object notation)

RIO (R import export) is a package that combines all of R’s import functions into one simple utility with consistent syntax and functionality. You need to import RIO first.

library(datasets)  # Load built-in datasets

# SUMMARIZE DATA ###########################################

head(iris)         # Show the first six lines of iris data
summary(iris)      # Summary statistics for iris data
plot(iris)         # Scatterplot matrix for iris data

# CLEAN UP #################################################

# Clear packages
detach("package:datasets", unload = TRUE)  # For base

# Clear plots
dev.off()  # But only if there IS a plot
```R
# CSV, this will also be sortable in the viewer!
rio_csv <- import("~/Desktop/mbb.csv")

# TXT
rio_txt <- import("~/Desktop/mbb.txt")

# Excel XLSX
rio_xlsx <- import("~/Desktop/mbb.xlsx")

Without RIO you need to use the commands of R itself yourself.

Modeling Data - Hierarchical clustering

Like with like. This depends on criteria for similarity or distance.

  • Hierarchical vs. set k
  • measures of distance
  • divisive or agglomerative

The most common is euclidean distance, hierarchical clustering, divisive method.

# LOAD DATA ################################################

?mtcars
head(mtcars)
cars <- mtcars[, c(1:4, 6:7, 9:11)]  # Select variables
head(cars)

# COMPUTE AND PLOT CLUSTERS ################################
# Save hierarchical clustering to "hc." This codes uses
# pipes from dplyr.
hc <- cars   %>%  # Get cars data %>% means pipe
      dist   %>%  # Compute distance/dissimilarity matrix
      hclust      # Computer hierarchical clusters
  
plot(hc)          # Plot dendrogram

# ADD BOXES TO PLOT for better view ###############################

rect.hclust(hc, k = 2, border = "gray")
rect.hclust(hc, k = 3, border = "blue")
rect.hclust(hc, k = 4, border = "green4")
rect.hclust(hc, k = 5, border = "darkred")

PCA (Principal component analysis)

Extract the principle components. Steps:

  • Two variables, draw regression line
  • Perpendicular distance
  • collaps by sliding each line down to the regression line
  • rotate so it is flat

=> From 2D to 1D while the most important data was kept.

This is complicated and best understood and studied when needed in my opinion.

This is an example that shos the essence of PCA:

# LOAD DATA ################################################
head(mtcars)
cars <- mtcars[, c(1:4, 6:7, 9:11)]  # Select variables
head(cars)

# COMPUTE PCA ##############################################
# For entire data frame ####################################
pc <- prcomp(cars,
        center = TRUE,  # Centers means to 0 (optional)
        scale = TRUE)   # Sets unit variance (helpful)

Regression

Out of many variables, one variable. Use many variables to predict scores on one variable.