In this first chapter, we’ll start by establishing a common language for models and taking a deep view of the predictive modeling process. Much of the predictive modeling involves the key concepts of statistics and machine learning, and this chapter will provide a brief tour of the core distinctions of these fields that are essential knowledge for a predictive modeler. In particular, we’ll emphasize the importance of knowing how to evaluate a model that is appropriate to the type of problem we are trying to solve. Finally, we will showcase our first model, the k-nearest neighbors model, as well as a caret, a very useful R package for predictive modelers.
Models are at the heart of predictive analytics and for this reason, we’ll begin our journey by talking about models and what they look like. In simple terms, a model is a representation of a state, process, or system that we want to understand and reason about. We make models so that we can draw inferences from them and, more importantly for us in this book, make predictions about the world. Models come in a multitude of different formats and flavors, and we will explore some of this diversity in this book. Models can be equations linking quantities that we can observe or measure; they can also be a set of rules. A simple model with which most of us are familiar with school is Newton’s Second Law of Motion. This states that the net sum of force acting on an object causes the object to accelerate in the direction of the force applied and at a rate proportional to the resulting magnitude of the force and inversely proportional to the object’s mass.
We often summarize this information via an equation using the letters F, m, and a for the quantities involved. We also use the capital Greek letter sigma (Σ) to indicate that we are summing over the force and arrows above the letters that are vector quantities (that is, quantities that have both magnitude and direction):
This simple but powerful model allows us to make some predictions about the world. For example, if we apply a known force to an object with a known mass, we can use the model to predict how much it will accelerate. Like most models, this model makes some assumptions and generalizations. For example, it assumes that the color of the object, the temperature of the environment it is in, and its precise coordinates in space are all irrelevant to how the three quantities specified by the model interact with each other. Thus, models abstract away the myriad of details of a specific instance of a process or system in question, in this case, the particular object in whose motion we are interested, and limit our focus only to properties that matter.
Newton’s Second Law is not the only possible model to describe the motion of objects. Students of physics soon discover other more complex models, such as those taking into account relativistic mass. In general, models are considered more complex if they take a larger number of quantities into account or if their structure is more complex. Nonlinear models are generally more complex than linear models for example. Determining which model to use in practice isn’t as simple as picking a more complex model over a simpler model. In fact, this is a central theme that we will revisit time and again as we progress through the many different models in this book. To build our intuition as to why this is so, consider the case where our instruments that measure the mass of the object and the applied force are very noisy. Under these circumstances, it might not make sense to invest in using a more complicated model, as we know that the additional accuracy in the prediction won’t make a difference because of the noise in the inputs. Another situation where we may want to use the simpler model is if in our application we simply don’t need the extra accuracy. A third situation arises where a more complex model involves a quantity that we have no way of measuring. Finally, we might not want to use a more complex model if it turns out that it takes too long to train or make a prediction because of its complexity.
In this book, the models we will study have two important and defining characteristics. The first of these is that we will not use mathematical reasoning or logical induction to produce a model from known facts, nor will we build models from technical specifications or business rules; instead, the field of predictive analytics builds models from data. More specifically, we will assume that for any given predictive task that we want to accomplish, we will start with some data that is in some way related to or derived from the task at hand. For example, if we want to build a model to predict annual rainfall in various parts of a country, we might have collected (or have the means to collect) data on rainfall at different locations, while measuring potential quantities of interest, such as the height above sea level, latitude, and longitude. The power of building a model to perform our predictive task stems from the fact that we will use examples of rainfall measurements at a finite list of locations to predict the rainfall in places where we did not collect any data.
The second important characteristic of the problems for which we will build models is that during the process of building a model from some data to describe a particular phenomenon, we are bound to encounter some source of randomness. We will refer to this as the stochastic or nondeterministic component of the model. It may be the case that the system itself that we are trying to model doesn’t have any inherent randomness in it, but it is the data that contains a random component. A good example of a source of randomness in data is the measurement of the errors from the readings taken for quantities such as temperature. A model that contains no inherent stochastic component is known as a deterministic model, Newton’s Second Law is a good example of this. A stochastic model is one that assumes that there is an intrinsic source of randomness to the process being modeled. Sometimes, the source of this randomness arises from the fact that it is impossible to measure all the variables that are most likely impacting a system, and we simply choose to model this using probability. A well-known example of a purely stochastic model is rolling an unbiased six-sided die. Recall that in probability, we use the term random variable to describe the value of a particular outcome of an experiment or of a random process. In our die example, we can define the random variable, Y, as the number of dots on the side that lands face up after a single roll of the die, resulting in the following model:
This model tells us that the probability of rolling a particular digit, say, three is one in six. Notice that we are not making a definite prediction on the outcome of a particular roll of the die; instead, we are saying that each outcome is equally likely.
Probability is a term that is commonly used in everyday speech, but at the same time, sometimes results in confusion with regard to its actual interpretation. It turns out that there are a number of different ways of interpreting probability. Two commonly cited interpretations are the Frequentist probability and the Bayesian probability. Frequentist probability is associated with repeatable experiments, such as rolling a one-sided die. In this case, the probability of seeing the digit three, is just the relative proportion of the digit three coming up if this experiment were to be repeated an infinite number of times. Bayesian probability is associated with a subjective degree of belief or surprise in seeing a particular outcome and can, therefore, be used to give meaning to one-off events, such as the probability of a presidential candidate winning an election. In our die rolling experiment, we are equally surprised to see the number three come up as with any other number. Note that in both cases, we are still talking about the same probability numerically (1/6), only the interpretation differs.
In the case of the die model, there aren’t any variables that we have to measure. In most cases, however, we’ll be looking at predictive models that involve a number of independent variables that are measured, and these will be used to predict a dependent variable. Predictive modeling draws on many diverse fields and as a result, depending on the particular literature you consult, you will often find different names for these. Let’s load a dataset into R before we expand on this point. R comes with a number of commonly cited data sets already loaded, and we’ll pick what is probably the most famous of all, the iris data set:
To see what other data sets come bundled with R, we can use the data() command to obtain a list of data sets along with a short description of each. If we modify the data from a data set, we can reload it by providing the name of the data set in question as an input parameter to the data() command, for example, data(iris) reloads the iris data set.
head(iris, n = 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
The iris data set consists of measurements made on a total of 150 flower samples of three different species of iris. In the preceding code, we can see that there are four measurements made on each sample, namely the lengths and widths of the flower petals and sepals. The iris data set is often used as a typical benchmark for different models that can predict the species of an iris flower sample, given the four previously mentioned measurements. Collectively, the sepal length, sepal width, petal length, and petal width are referred to as features, attributes, predictors, dimensions, or independent variables in literature. In this book, we prefer to use the word feature, but other terms are equally valid. Similarly, the species column in the data frame is what we are trying to predict with our model, and so it is referred to as the dependent variable, output, or target. Again, in this book, we will prefer one form for consistency, and will use output. Each row in the data frame corresponding to a single data point is referred to as an observation, though it typically involves observing the values of a number of features.
As we will be using data sets, such as the iris data described earlier, to build our predictive models, it also helps to establish some symbol conventions. Here, the conventions are quite common in most of the literature. We’ll use the capital letter, Y, to refer to the output variable, and subscripted capital letter, Xi, to denote the ith feature. For example, in our iris data set, we have four features that we could refer to as X1 through X4. We will use lower case letters for individual observations so that x1 corresponds to the first observation. Note that x1 itself is a vector of feature components, xij, so that x12 refers to the value of the second feature in the first observation. We’ll try to use double suffixes sparingly and we won’t use arrows or any other form of vector notation for simplicity. Most often, we will be discussing either observations or features and so the case of the variable will make it clear to the reader which of these two is being referenced.
When thinking about a predictive model using a data set, we are generally making the assumption that for a model with n features, there is a true or ideal function, f, that maps the features to the output:
We’ll refer to this function as our target function. In practice, as we train our model using the data available to us, we will produce our own function that we hope is a good estimate for the target function. We can represent this by using a caret on top of the symbol f to denote our predicted function, and also for the output, Y, since the output of our predicted function is the predicted output. Our predicted output will, unfortunately, not always agree with the actual output for all observations (in our data or in general):
Given this, we can essentially summarize the process of predictive modeling as a process that produces a function to predict a quantity, while minimizing the error it makes compared to the target function. A good question we can ask at this point is, where does the error come from? Put differently, why are we generally not able to exactly reproduce the underlying target function by analyzing a data set?
The answer to this question is that in reality there are several potential sources of error that we must deal with. Remember that each observation in our data set contains values for n features, and so we can think about our observations geometrically as points in an n-dimensional feature space. In this space, our underlying target function should pass through these points by the very definition of the target function. If we now think about this general problem of fitting a function to a finite set of points, we will quickly realize that there are actually infinite functions that could pass through the same set of points. The process of predictive modeling involves making a choice in the type of model that we will use for the data thereby constraining the range of possible target functions to which we can fit our data. At the same time, the data’s inherent randomness cannot be removed no matter what model we select. These ideas lead us to an important distinction in the types of error that we encounter during modeling, namely the reducible error and the irreducible error respectively.
The reducible error essentially refers to the error that we as predictive modelers can minimize by selecting a model structure that makes valid assumptions about the process being modeled and whose predicted function takes the same form as the underlying target function. For example, as we shall see in the next chapter, a linear model imposes a linear relationship between the features in order to compose the output. This restrictive assumption means that no matter what training method we use, how much data we have, and how much computational power we throw in, if the features aren’t linearly related in the real world, then our model will necessarily produce an error for at least some possible observations. By contrast, an example of an irreducible error arises when trying to build a model with an insufficient feature set. This is typically the norm and not the exception. Often, discovering what features to use is one of the most time-consuming activities of building an accurate model.
Sometimes, we may not be able to directly measure a feature that we know is important. At other times, collecting the data for too many features may simply be impractical or too costly. Furthermore, the solution to this problem is not simply an issue of adding as many features as possible. Adding more features to a model makes it more complex and we run the risk of adding a feature that is unrelated to the output thus introducing noise in our model. This also means that our model function will have more inputs and will, therefore, be a function in a higher dimensional space. Some of the potential practical consequences of adding more features to a model include increasing the time it will take to train the model, making convergence on a final solution harder, and actually reducing model accuracy under certain circumstances, such as with highly correlated features. Finally, another source of an irreducible error that we must live with is the error in measuring our features so that the data itself may be noisy.
Reducible errors can be minimized not only through selecting the right model but also by ensuring that the model is trained correctly. Thus, reducible errors can also come from not finding the right specific function to use, given the model assumptions. For example, even when we have correctly chosen to train a linear model, there are infinitely many linear combinations of the features that we could use. Choosing the model parameters correctly, which in this case would be the coefficients of the linear model, is also an aspect of minimizing the reducible error. Of course, a large part of training a model correctly involves using a good optimization procedure to fit the model. In this book, we will at least give a high-level intuition of how each model that we study is trained. We generally avoid delving deep into the mathematics of how optimization procedures work but we do give pointers to the relevant literature for the interested reader to find out more.
So far we’ve established some central notions behind models and a common language to talk about data. In this section, we’ll look at what the core components of a statistical model are. The primary components are typical:
As we’ll see in this book, most models, such as neural networks, linear regression, and support vector machines have certain parameterized equations that describe them. Let’s look at a linear model attempting to predict the output, Y, from three input features, which we will call X1, X2, and X3:
This model has exactly one equation describing it and this equation provides the linear structure of the model. The equation is parameterized by four parameters, known as coefficients in this case, and they are the four β parameters. In the next chapter, we will see exactly what roles these play, but for this discussion, it is important to note that a linear model is an example of a parameterized model. The set of parameters is typically much smaller than the amount of data available.
Given a set of equations and some data, we then talk about training the model. This involves assigning values to the model’s parameters so that the model describes the data more accurately. We typically employ certain standard measures that describe a model’s goodness of fit to the data, which is how well the model describes the training data. The training process is usually an iterative procedure that involves performing computations on the data so that new values for the parameters can be computed in order to increase the model’s goodness of fit. For example, a model can have an objective or error function. By differentiating this and setting it to zero, we can find the combination of parameters that give us the minimum error. Once we finish this process, we refer to the model as a trained model and say that the model has learned from the data. These terms are derived from the machine learning literature, although there is often a parallel made with statistics, a field that has its own nomenclature for this process. We will mostly use the terms from machine learning in this book.
Our first model: k-nearest neighbors
In order to put some of the ideas in this chapter into perspective, we will present our first model for this book, k-nearest neighbors, which is commonly abbreviated as kNN. In a nutshell, this simple approach actually avoids building an explicit model to describe how the features in our data combine to produce a target function. Instead, it relies on the notion that if we are trying to make a prediction on a data point that we have never seen before, we will look inside our original training data and find the k observations that are most similar to our new data point. We can then use some kind of averaging technique on the known value of the target function for these k neighbors to compute a prediction. Let’s use our iris data set to understand this by way of an example. Suppose that we collect a new unidentified sample of an iris flower with the following measurements:
Sepal.Length Sepal.Width Petal.Length Petal.Width
4.8 2.9 3.7 1.7
We would like to use the kNN algorithm in order to predict which species of flower we should use to identify our new sample. The first step in using the kNN algorithm is to determine the k-nearest neighbors of our new sample. In order to do this, we will have to give a more precise definition of what it means for two observations to be similar to each other. A common approach is to compute a numerical distance between two observations in the feature space. The intuition is that two observations that are similar will be close to each other in the feature space and therefore, the distance between them will be small. To compute the distance between two observations in the feature space, we often use the Euclidean distance, which is the length of a straight line between two points. The Euclidean distance between two observations, x1 and x2, is computed as follows:
Recall that the second suffix, j, in the preceding formula corresponds to the jth feature. So, what this formula is essentially telling us is that for every feature, take the square of the difference in values of the two observations, sum up all these squared differences, and then take the square root of the result. There are many other possible definitions of distance, but this is one of the most frequently encountered in the kNN setting. We’ll see more distance metrics in Chapter 11, Recommendation Systems.
In order to find the nearest neighbors of our new sample iris flower, we’ll have to compute the distance to every point in the iris data set and then sort the results. First, we’ll begin by subsetting the iris data frame to include only our features, thus excluding the species column, which is what we are trying to predict. We’ll then define our own function to compute the Euclidean distance. Next, we’ll use this to compute the distance to every iris observation in our data frame using the apply() function. Finally, we’ll use the sort() function of R with the index. return parameter set to TRUE, so that we also get back the indexes of the row numbers in our iris data frame corresponding to each distance computed:
List of 2
$ x : num [1:150] 0.574 0.9 0.9 0.949 0.954 …
$ ix: int [1:150] 60 65 107 90 58 89 85 94 95 99 …
The $x attribute contains the actual values of the distances computed between our sample iris flower and the observations in the iris data frame. The $ix attribute contains the row numbers of the corresponding observations. If we want to find the five nearest neighbors, we can subset our original iris data frame using the first five entries from the $ix attribute as the row numbers:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
60 5.2 2.7 3.9 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor
107 4.9 2.5 4.5 1.7 virginica
90 5.5 2.5 4.0 1.3 versicolor
58 4.9 2.4 3.3 1.0 versicolor
As we can see, four of the five nearest neighbors to our sample are the Versicolor species, while the remaining one is the virginica species. For this type of problem where we are picking a class label, we can use a majority vote as our averaging technique to make our final prediction. Consequently, we would label our new sample as belonging to the versicolor species. Notice that setting the value of k to an odd number is a good idea because it makes it less likely that we will have to contend with tie votes (and completely eliminates ties when the number of output labels is two). In the case of a tie, the convention is usually to just resolve it by randomly picking among the tied labels. Notice that nowhere in this process have we made any attempt to describe how our four features are related to our output. As a result, we often refer to the kNN model as a lazy learner because essentially, all it has done is memorize the training data and use it directly during a prediction. We’ll have more to say about our kNN model, but first, we’ll return to our general discussion on models and discuss different ways to classify them.
With a broad idea of the basic components of a model, we are ready to explore some of the common distinctions that modelers use to categorize different models.
We’ve already looked at the iris data set, which consisted of four features and one output variable, namely the species variable. Having the output variable available for all the observations in the training data is the defining characteristic of the supervised learning setting, which represents the most frequent scenario encountered. In a nutshell, the advantage of training a model under the supervised learning setting is that we have the correct answer that we should be predicting for the data points in our training data. As we saw in the previous section, kNN is a model that uses supervised learning, because the model makes its prediction for an input point by combining the values of the output variable for a small number of neighbors to that point. In this book, we will primarily focus on supervised learning.
Using the availability of the value of the output variable as a way to discriminate between different models, we can also envisage a second scenario in which the output variable is not specified. This is known as the unsupervised learning setting. An unsupervised version of the iris data set would consist of only the four features. If we don’t have the species output variable available to us, then we clearly have no idea as to which species each observation refers. Indeed, we won’t know how many species of flower are represented in the dataset, or how many observations belong to each species. At first glance, it would seem that without this information, no useful predictive task could be carried out. In fact, what we can do is examine the data and create groups of observations based on how similar they are to each other, using the four features available to us. This process is known as clustering. One benefit of clustering is that we can discover natural groups of data points in our data; for example, we might be able to discover that the flower samples in an unsupervised version of our iris set form three distinct groups which correspond to three different species.
Between unsupervised and supervised methods, which are two absolutes in terms of the availability of the output variable, reside the semi-supervised and reinforcement learning settings. Semi-supervised models are built using data for which a (typically quite small) fraction contains the values for the output variable, while the rest of the data is completely unlabeled. Many such models first use the labeled portion of the data set in order to train the model coarsely, then incorporate the unlabeled data by projecting labels predicted by the model trained up this point.
In a reinforcement learning setting the output, variable is not available, but other information that is directly linked with the output variable is provided. One example is predicting the next best move to win a chess game, based on data from complete chess games. Individual chess moves do not have output values in the training data, but for every game, the collective sequence of moves for each player resulted in either a win or a loss. Due to space constraints, semi-supervised and reinforcement settings aren’t covered in this book.
In a previous section, we noted how most of the models we will encounter are parametric models, and we saw an example of a simple linear model. Parametric models have the characteristic that they tend to define a functional form. This means that they reduce the problem of selecting between all possible functions for the target function to a particular family of functions that form a parameter set. Selecting the specific function that will define the model essentially involves selecting precise values for the parameters. So, returning to our example of a three feature linear model, we can see that we have the two following possible choices of parameters (the choices are infinite, of course; here we just demonstrate two specific ones):
Here, we have used a subscript on the output Y variable to denote the two different possible models. Which of these might be a better choice? The answer is that it depends on the data. If we apply each of our models on the observations in our data set, we will get the predicted output for every observation. With supervised learning, every observation in our training data is labeled with the correct value of the output variable. To assess our model’s goodness of fit, we can define an error function that measures the degree to which our predicted outputs differ from the correct outputs. We then use this to pick between our two candidate models in this case, but more generally to iteratively improve a model by moving through a sequence of progressively better candidate models.
Some parametric models are more flexible than linear models, meaning that they can be used to capture a greater variety of possible functions. Linear models, which require that the output be a linearly weighted combination of the input features, are considered strict. We can intuitively see that a more flexible model is more likely to allow us to approximate our input data with greater accuracy; however, when we look at overfitting, we’ll see that this is not always a good thing. Models that are more flexible also tend to be more complex and, thus, training them often proves to be harder than training less flexible models.
Models are not necessarily parameterized, in fact, the class of models that have no parameters is known (unsurprisingly) as nonparametric models. Nonparametric models generally make no assumptions on the particular form of the output function. There are different ways of constructing a target function without parameters. Splines are a common example of a nonparametric model. The key idea behind splines is that we envisage the output function, whose form is unknown to us, as being defined exactly at the points that correspond to all the observations in our training data. Between the points, the function is locally interpolated using smooth polynomial functions. Essentially, the output function is built in a piecewise manner in the space between the points in our training data. Unlike most scenarios, splines will guarantee 100 percent accuracy on the training data, whereas, it is perfectly normal to have some errors in our training data. Another good example of a nonparametric model is the k-nearest neighbor algorithm that we’ve already seen.
The distinction between regression and classification models has to do with the type of output we are trying to predict, and is generally relevant to supervised learning. Regression models try to predict a numerical or quantitative value, such as the stock market index, the amount of rainfall, or the cost of a project. Classification models try to predict a value from a finite (though still possibly large) set of classes or categories. Examples of this include predicting the topic of a website, the next word that will be typed by a user, a person’s gender, or whether a patient has a particular disease given a series of symptoms. The majority of models that we will study in this book fall quite neatly into one of these two categories, although a few, such as neural networks can be adapted to solve both types of problems. It is important to stress here that the distinction made is on the output only, and not on whether the feature values that are used to predict the output are quantitative or qualitative themselves. In general, features can be encoded in a way that allows both qualitative and quantitative features to be used in regression and classification models alike. Earlier, when we built a kNN model to predict the species of iris based on measurements of flower samples, we were solving a classification problem as our species output variable could take only one of three distinct labels. The kNN approach can also be used in a regression setting; in this case, the model combines the numerical values of the output variable for the selected nearest neighbors by taking the mean or median in order to make its final prediction. Thus, kNN is also a model that can be used in both regression and classification settings.
R may be classified as a complete analytical environment for the following reasons.
Multiple platforms and interfaces to input commands: R has multiple interfaces ranging from command line to numerous specialized graphical user interfaces (GUIs) (Chap. 2) for working on desktops. For clusters, cloud computing, and remote server environments, R now has extensive packages including SNOW, RApache, RMpi, R Web, and Rserve.
Software compatibility: Official commercial interfaces to R have been developed by numerous commercial vendors including software makers who had previously thought of R as a challenger in the analytical space (Chap. 4). Oracle, ODBC, Microsoft Excel, PostgreSQL, MySQL, SPSS, Oracle Data Miner, SAS/IML, JMP, Pentaho Kettle, and Jaspersoft BI are just a few examples of commercial software that are compatible with R usage. In terms of the basic SAS language, a WPS software reseller offers a separate add-on called the Bridge to R. Revolution Analytics offers primarily analytical products licensed in the R language, but other small companies have built successful R packages and applications commercially.
Interoperability of data: Data from various file formats as well as various databases can be used directly in R, connected via a package, or reduced to an intermediate format for importing into R (Chap. 2).
Extensive data visualization capabilities: These include much better animation and graphing than other software (Chap. 5).
Largest and fastest growing open source statistical library: The current number of statistical packages and the rate of growth at which new packages continue to be upgraded ensures the continuity of R as a long-term solution to analytical problems.
A wide range of solutions from the R package library for statistical, analytical, data mining, dashboard, data visualization, and online applications make it the broadest analytical platform in the field.
So what all is extra in R? The list below shows some of the additional features in R that make it superior to other analytical software.
R’s source code is designed to ensure complete custom solutions and embedding for a particular application. Open source code has the advantage of being extensively peer-reviewed in journals and the scientific literature. This means bugs will found, information about them shared, and solutions delivered transparently.
A wide range of training material in the form of books is available for the R analytical platform (Chap. 12).
R offers the best data visualization tools in analytical software (apart from Tableau Software’s latest version). The extensive data visualization available in R comprises a wide variety of customizable graphics as well as animation. The principal reason why third-party software initially started creating interfaces to R is that the graphical library of packages in R was more advanced and was acquiring more features by the day.
An R license is free for academics and thus budget friendly for small and large analytical teams.
R offers flexible programming for your data environment. This includes packages that ensure compatibility with Java, Python, and C.
It is easy to migrate from other analytical platforms to the R platform. It is relatively easy for a non-R platform user to migrate to the R platform, and there is no danger of vendor lock-in due to the GPL nature of the source code and the open community, the GPL can be seen at HTTP://WWW.GNU.ORG/COPYLEFT/GPL.HTML.
The latest and broadest range of statistical algorithms are available in R. This is due to R’s package structure in which it is rather easier for developers to create new packages than in any other comparable analytics platform.
Sometimes the distinction between statistical computing and analytics does come up. While statistics is a tool- and technique-based approach, analytics is more concerned with business objectives. Statistics are basically numbers that inform (descriptive), advise (prescriptive), or forecast (predictive). Analytics is a decision-making-assistance tool. Analytics on which no decision is to be made or is being considered can be classified as purely statistical and nonanalytical. Thus the ease with which a correct decision can be made separates a good analytical platform from a not-so-good one. The distinction is likely to be disputed by people of either background, and business analysis requires more emphasis on how practical or actionable the results are and less emphasis on the statistical metrics in a particular data analysis task. I believe one way in which business analytics differs from statistical analysis is the cost of perfect information (data costs in the real world) and the opportunity cost of delayed and distorted decision making.
The only cost of using R is the time spent learning it. The lack of a package or application marketplace in which developers can be rewarded for creating new packages hinders the professional mainstream programmer’s interest in R to the degree that several other platforms like iOS and Android and Salesforce offer better commercial opportunities to coding professionals. However, given the existing enthusiasm and engagement of the vast numbers of mostly academia-supported R developers, the number of R packages has grown exponentially over the past several years. The following list enumerates the advantages of R by business analytics, data mining, and business intelligence/data visualization as these have three different domains in the data sciences.
R is available for free download.
1. R is one of the few analytical platforms that work on Mac OS.
2. Its results have been established in journals like the Journal of Statistical Software, in places such as LinkedIn and Google, and by Facebook’s analytical teams.
3. It has open source code for customization as per GPL and adequate intellectual protection for developers wanting to create commercial packages.
4. It also has a flexible option for enterprise users from commercial vendors like Revolution Analytics (who support 64-bit Windows and now Linux) as well as big data processing through its RevoScaleR package.
5. It has interfaces from almost all other analytical software including SAS, SPSS, JMP, Oracle Data Mining, and RapidMiner. Exist huge library of packages is available for regression, time series, finance, and modeling.
6. High-quality data visualization packages are available for use with R.
As a computing platform, R is better suited to the needs of data mining for the following reasons.
1. R has a vast array of packages covering standard regression, decision trees, association rules, cluster analysis, machine learning, neural networks, and exotic specialized algorithms like those based on chaos models.
2. R provides flexibility in tweaking a standard algorithm by allowing one to see the source code.
3. The Rattle GUI remains the standard GUI for data miners using R. This GUI offers easy access to a wide variety of data mining techniques. It was created and developed in Australia by Prof. Graham Williams. Rattle offers a very powerful and convenient free and open source alternative to data mining software.
Business dashboards and reporting are an essential piece of business intelligence and decision making systems in organizations.
1. R offers data visualization through ggplot, and GUIs such as Deducer, GrapheR, and Red-R can help even business analysts who know none or very little of the R language in creating a metrics dashboard.
2. For online dashboards, R has packages like RWeb, RServe, and R Apache that, in combination with data visualization packages, offer powerful dashboard capabilities. Well-known examples of these will be shown later.
3. R can also be combined with Microsoft Excel using the R Excel package to enable R capabilities for importing within Excel. Thus an Excel user with no knowledge of R can use the GUI within the R Excel plug-in to take advantage of the powerful graphical and statistical capabilities.
4. R has extensive capabilities to interact with and pull data from databases including those by Oracle, MySQL, PostGresSQL, and Hadoop-based data. This ability to connect to databases enables R to pull data and summarize them for processing in the previsualization stage.
What follows is a brief collection of resources that describe how to use SAS Institute products and R: Base SAS, SAS/Stat, SAS/Graph.
An indicator of the long way R has come from being a niche player to a broadly accepted statistical computing platform is the SAS Institute’s acceptance of R as a complementary language. What follows is a brief extract from a February 2012 interview with researcher Kelci Miclaus from the JMP division at SAS Institute that includes a case study on how adding R can help analytics organizations even more.
How has JMP been integrating with R? What has been the feedback from customers so far? Is there a single case study you can point to where the combination of JMP and R was better than either one of them alone?
Feedback from customers has been very positive. Some customers use JMP to foster collaboration between SAS and R modelers within their organizations. Many use JMP’s interactive visualization to complement their use of R. Many SAS and JMP users use JMP’s integration with R to experiment with more bleeding-edge methods not yet available in commercial software. It can be used simply to smooth the transition with regard to sending data between the two tools or to build complete custom applications that take advantage of both JMP and R.
One customer has been using JMP and R together for Bayesian analysis. He uses R to create MCMC chains and has found that JMP is a great tool for preparing data for analysis and for displaying the results of the MCMC simulation. For example, the control chart and bubble plot platforms in JMP can be used to quickly verify convergence of an algorithm. The use of both tools together can increase productivity since the results of an analysis can be achieved faster than through scripting and static graphics alone.
I, along with a few other JMP developers, have written applications that use JMP scripting to call out to R packages and perform analysis like multidimensional scaling, bootstrapping, support vector machines, and modern variable selection methods. These really show the benefit of interactive visual analysis coupled with modern statistical algorithms. We’ve packaged these scripts as JMP add-ins and made them freely available on our JMP User Community file exchange. Customers can download them and employ these methods as they would a regular JMP platform. We hope that our customers familiar with scripting will also begin to contribute their own add-ins so a wider audience can take advantage of these new tools (see HTTP://WWW.DECISIONSTATS.COM/JMP-AND-R-RSTATS/).
How is R a complementary fit to JMP’s technical capabilities?
R has an incredible breadth of capabilities. JMP has extensive interactive, dynamic visualization intrinsic to its largely visual analysis paradigm, in addition to a strong core of statistical platforms. Since our brains are designed to visually process pictures and animated graphics more efficiently than numbers and text, this environment is all about supporting faster discovery. Of course, JMP also has a scripting language (JSL) that allows you to incorporate SAS code and R code and to build analytical applications for others to leverage SAS, R, and other applications for users who don’t code or who don’t want to code. JSL is a powerful scripting language on its own.
It can be used for dialog creation, automation of JMP statistical platforms, and custom graphic scripting. In other ways, JSL is very similar to the R language. It can also be used for data and matrix manipulation and to create new analysis functions. With the scripting capabilities of JMP, you can create custom applications that provide both a user interface and an interactive visual backend to R functionality. Alternatively, you could create a dashboard using statistical or graphical platforms in JMP to explore the data and, with the click of a button, send a portion of the data to R for further analysis.
Another JMP feature that complements R is the add-in architecture, which is similar to how R packages work. If you’ve written a cool script or analysis workflow, you can package it into a JMP add-in file and send it to your colleagues so they can easily use it.
What is the official view of R at your organization? Do you think it is a threat or a complementary product or statistical platform that coexists with your offerings?
Most definitely, we view R as complementary. R contributors provide a tremendous service to practitioners, allowing them to try a wide variety of methods in the pursuit of more insight and better results. The R community as a whole provides a valued role to the greater analytical community by focusing attention on newer methods that hold the most promise in so many application areas. Data analysts should be encouraged to use the tools available to them in order to drive discovery, and JMP can help with that by providing an analytic hub that supports both SAS and R integration.
Since you do use R, are there any plans to give back something to the R community in terms of your involvement and participation (say at use R events) or sponsoring contests?
We are certainly open to participating in use of R groups. At Predictive Analytics World in New York last October, they didn’t have a local use R group, but they did have a Predictive Analytics meet-up group comprised of many R users. We were happy to sponsor this. Some of us within the JMP division have joined local R user groups, myself included. Given that some local R user groups have entertained topics like Excel and R, Python and R, databases and R, we would be happy to participate more fully here. I also hope to attend the useR annual meeting later this year to gain more insight on how we can continue to provide tools to help both the JMP and R communities with their work. We are also exploring options to sponsor contests and would invite participants to use their favorite tools, languages, etc.