Data Science with R

Introduction

R programming language is becoming highly popular these days and is heavily used in machine learning and data science projects. I have been learning data science from an online course at Coursera, and here are some of the concepts about R programming that I explored.

Here I assume, that the reader has basic knowledge of programming, since I do not dive deep in the grassroot level of explanation like variables, functions, etc. Basic knowledge of any programming language would be sufficient to get hold of following material. I have used command line R for executing following scripts, however even R Studio can be used.

Setting up the workspace

getwd() for getting present directory
Syntax for defining function

foo <- function() {
	x <- rnorm(100)	#rnorm generates a set of numbers from random normal distribution
	mean(x)
}

To pass a variable, simply define so in the paranthesis of function()
To load R scripts in console, use source("script_name.R")
To view loaded functions use, ls()
Call function for a range of values like this foo(4:10). This will call foo for all values from 4 to 10.
Print something with, print(x)
x <- 1:20 makes a vector

Data Types

Atomic classes
- Character
- Numeric (double precision real numbers)
- Integer
- Complex
- Logical
Most basic object is a vector, which containes objects of same class, except list
Vectors can be initialized with vector()
By default numbers are numeric. Can be explicitly declared as Integer by 1L
Two specials, Inf and NaN
Each objects have attributes(). Like names, dimensions, length, class, etc.
c() can be used to create vectors, its a way of concatenating. Similarly vector() can also be used.

x <- c(0.5,1)
x <- vector("numeric",length=10)

Different objects can be mixed. Coersion will occur, so that eveything is of same class.
Getting class of object using class(x). Convert to numeric using, as.numeric(x). Same for character, etc.
Converting character to numeric gives ‘NA’
x <- list(1,"a", TRUE) creates a list.
Matrices are special types of vectors with attribute dimension. m <- matrix(nrow=2, ncol=2)
dim(m) returns num of [rows , columns]
Matrix are cconstructed column wise ‘m <- (1:6, nrow = 2, ncol = 3)’
Creating matrices from vectors.

m <- 1:10
dim(m) <- c(2,5)

cbind(a,b) will bind vectors a and b column wise. Similarly rbind()
Factors can be used to represent categorical data, can be ordered or unordered. Treated specially using modelling functions like glm() or lm()
Factors can be thought of as integer vector with levels.
A NaN value is also an NA, but converse not true.
is.na() and is.nan() can be used. Note that NaN is mathematically not defined and NA is pretty much everything else.
Data Frames, used to store tabular data.
- Each element is a column of same length, but can be different type of objects
- row.names is an attribute
- read.table() or read.csv() function creates a dataframe
- can be converted to data.matrix(). Coersion may occur
- ncol() and nrow() can be used for data frames
R Objects can also have names() attribute.

x <- 1:3
names(x) <- c("foo","bar","nor")

# Also list elements can have names
x <- list(a=1,b=2,c=3)

# Matrix elements can also have names, called dimnames()

http://swirlstats.com <- Learn R in R :p (Debug tip - sudo apt-get install libcurl4-openssl-dev libssl-dev before installing swirl)
- Download the learning code in swirl and get going.

Reading and Writing data into R

read.table and read.csv gives a datafrom from a file.(opp is write.table)
readLines gives text as a character vector (write.lines)
source will read R code. (Opposite of dump)
Some other functions are dget, dput, load, save, unserialize.

data <- read.table("foo.txt")
# Each line with # will be skipped

Dataset must be smaller than the RAM Size, else we need to optimize it. Also set nrows. See help page for hints. (Generally colClasses fasterns the process. See sapply() function)
Calulate the size of your database. Rough estimate (8bytes/numeric value)
dput saves the metadata. It writes R code which regenerates the data.

Interface to Outside world

file, gzfile, bzfile, url
Generally connections are handled implicitly.
For reading url.

con <- url("http://eunotech.com")
x <- readLines(con)
head(x) #For printing few lines

Some more basics

Two types, numeric indexed and logical indexed
’[’ returns objects of same class. Can select more than one elements (one exception -> matrix, which behaviour can be turned off by x[1,2,drop = False])
’[[’ returns an element may not be of same class
’$’ used to extract elements by name (from list or dataframe)
R is ‘1’ indexed (not zero)
logical subnetting can be done x[x > "a"]. Or u <- x > "a" returns a boolean vector.
x$foo returns a value of an element that contains foo from list x
x[c(1,3)] returns 1st and 3rd valued from x
Partial matching can be allowed using x[["a", exact = False]]. Works with $ also.
x[[name]], here name can be the index existing, or a varible pointing to an index of x. While x$name will work only is x containes a key ‘name’.
x[c(1,3)] and x[[1]][[3]] means same thing -> Nested access.
In matrix subnetting, x[1:] is valid, which means 1st row.
To remove missing values

bad <- is.na(x)
x[!bad] # Returns a vector with only elements that has values

good <- complete.cases(x,y) # returns a boolean vector which is true when both x and y element are present (not NA) --- Also can be used in dataframes

x+y arithmatically adds vectors element-wise. Similarly, x > 2 can be used to get a boolean vector and so on.
For matrix multipication x %*% y must be used. Since x * y gives element-wise multiplication.
Use vectorized operations whenever possible to make the code simpler and optimized.

Control Structures

Generally used in R script, and not in interactive sessions

if condition

if(<condition>){
	
}
else if {
	
} else {
	
}

Also if structure can be used in this way

y <- if(x>3){
	10
} else {
	0
}

else is optional
For loops can be defined as follows:

for(i in 1:10){
	print(i)
}

seq_along() function can be used to generate a vector that containes sequence based on the varible passed to it.
If only single line, you can omit curly braces and write everything in single line
While loop can be used as follows

count <- 0
while(count < 10 && <condition2>){
	print(count)
	count <- count + 1
}

repeat initiates an infinite loop. can be exited using break.

repeat {
	if (){
		break
		# Break on some condition
	}
}

next can be used to skip and iterations. [like continue]
return signals that a function should exit and return the given value.

Functions

Write functions in files and not in interactive shell

add2 <- function(x,y) {
	x + y
}
# can be called with add2(3,5)

Here is how to take a vector as a paramater

above10 <- function(x, n=10){ #default values can be used
	use <- x > n # Returns a logical vector
	x[use] # Automatically last values is returned
}

Have a look at this function

column_mean <- function(y, removeNA = TRUE){
	nc <- ncol(y)
	means <- numeric(nc) # Initilization of vector (default zeros)
	for (i in 1:nc){
		means[i] <- mean(y[,i], na.rm = removeNA) # Mean function takes na.rm arg
	}
	means # this will get returned
}

Functions in R ae “first class objects”. They are R objects of class “function”
Functions can be passed as arguments inside other function, and can be nested also.
The arguments passed are known as ‘formal arguments’. Can be seen with formals()
Arguments can be named sd(x <- mydata) and sd(mydata) are same.
args(lm) gives arguments taken by function lm()
Mixing named arguments with un-named can be done. Whatever are named, will be removed and rest are matched in the order of their definition.
order of operationd for arguments
- Check for exact match
- Check for partial match
- Check for positional match
Null is NULL, which means Nothing. Generally used for default value of formal arguments.
Arguments to functions are evaluated lazily.

f <- function(a,b){
	a^2
}
f(2)
# Here argument b is never actually evaluated. Argument a is posinally matched and hence function will work.

Special argument ..., can be used when extending another function, when you don’t want to copy the entire argument list of the original function

myplot <- function(x,y,type = "l", ...){
	plot(x, y, type=type, ...) # All other arguments are passed as it is
}
# Also generic functions use ... so that extra arguments can be passed to methods.

... is necessary when the number of arguments passed to the function cannot be known in advance. One example that uses is args()
Note, arguments that appear after ... must be explicitly named, no partial matching allowed

Scoping Rules

You can create a function with name that already exists. For instance lm <- function(x) { x + 1 }
See search(). First .GlobalEnv is checked for existing functions in there. Then it looks in the other packages.
Hence the order of the packages on the search list matters.
Note that R has separate namespaces for functions and non-functions so it’s possible to have an object named c and a function named c.
R use lexical scoping or static scoping. An alternative has dynamic scoping.
Free variables are those used inside a function but not passed
The values of free variables are searched for in the environment in which the function was defined. AN environment is a collection of (symbol, value) pairs. There is heirarchy of such environments.
A function + an environment = a closure of function closure
First the value of a symbol is searched in current environment, if not found, goes to parent env and so on.
Generally global env is parent of all, for package the namespace is the parent of all.
In R, functions can be inside other functions. Hence, the env is of other function.

make.power <- function(n){
	pow <- function(x){
		x^n
	}
	pow
}

cube <- make.power(3) # Returns a function that gives a cube
square <- make.power(2)

cube(3) # Gives 27
square(2) $ Gives 4

See this example, value of f(3) will be different with lexical and dynamic scoping

y <- 10
f <- function(x){
	y <- 2
	y^2 + g(x)
}

g <- function(x){
	x*y # this y will be 10 in lexical scoping (since value of y in the env in which g is defined is seen, but in dynamic scoping value of y will be 2 since the value in the 'calling env' is so.
	# Calling env is also known as parent frame
}

# Note this happens only if function is called inside another function. If everything is defined and called globally, then it behaves like dynamic scoping. It looks like so. 

Languages that support lexical scoping.
- Scheme
- Perl
- Python
- Common Lisp (all language converges to List)
Few optmization routins in R are, optim, nlm, optimize, takes a function and tries to find maximum or minimum.

Coding Standards

Use text editors / text files
Indent code (4 spaces)
Limit the width of your code (80 columns?)
Limit the length of functions

Dates and Times in R

Dates are represented by the Date class.
Times are represented by POSIXct or the POSIXlt class
Dates are stored internally as the number of days since 1970-01-01. Time, in seconds from the same date.
unclass(x) is used to remove a class from an object x
as.Date("1970-01-01") can be used.
POSIX is a family of standards for representing data.
Check ?strptime for details. You can do operations on date time objects.

x <- Sys.time()
print(x)

p <- as.POSIXlt(x)
names(unclass(p))
## Will five sec, min, hour, mday, etc.
p$sec

Loop functions

lapply : loop over a list and evaluate a function on each element
sapply : also simplifies
Others are, apply, tapply, mapply, also split

I hope that was helpful. Comment below for hugs and bugs.