Supervised Learning

A version of this tutorial has been included as a section in the following book chapter:

Machine Learning in Materials Science: Recent Progress and Emerging Applications
Reviews in Computational Chemistry 29, 186 (2016)
Edited by Abby L. Parrill and Kenneth B. Lipkowitz

I would like to thank my co-authors and editors, as well as Hariharan Iyer and David LaVan at NIST, for their valuable comments and assistance in editing this text. I would also like to thank Alberto Hernandez-Valle for converting this material to HTML. This work was supported by the National Science Foundation under grant DMR-1352373. If you find this tutorial useful, please cite the above chapter. If you have any feedback, please let me know.

One of the fundamental goals of science is the development of theories that can be used to make accurate predictions. Predictive theories are generated through the scientific method, in which existing knowledge is used to formulate hypotheses, which are then used to make predictions that can be empirically tested with the goal of identifying the hypotheses that make the most accurate predictions. The scientific method can be expressed mathematically by considering predictive theory as a function f that maps a set of input data x to a predicted outcome y. The function may be relatively simple, as in Newton’s laws of motion, or it may be complex, as in models that predict the weather based on meteorological observations. The collection of known input (x) and output (y) values, called training data, may be generated through observations or controlled experiments. The goal of the scientist is to use such training data, as well as any other prior knowledge, to identify a function that is able to predict the output value for a new set of input data accurately. The process of identifying such a function from a set of known x and y values is called supervised learning. If the allowed output values y form a continuous range (e.g., melting points), then the process of searching for a function is known as regression. If the allowed output values form a discrete set (e.g., space groups), the process is then known as classification.

Figure 1 Four different functions (blue lines) fit to the same set of training data (black dots). a), b), and c) reproduce the data, and d) does not.

The hypothesis space contains all hypotheses (i.e., functions) that could be returned by the learning algorithm. An appealing choice for a hypothesis space might be the space in which every possible function is considered to be equally viable. However there will be many functions in this space that exactly reproduce the training data (Figure 1), and there will be no way to determine which functions would make the most accurate predictions. Thus, to identify a predictive function, it is necessary to use a hypothesis space that is constrained in a way that excludes some functions from consideration and/or is weighted to favor some functions more than others. For example, it is common to constrain the hypothesis space so that only functions expressing a linear relationship between the input and output values are considered (Figure 1d).

There may be no function in the hypothesis space that produces the observed output values for all possible sets of input values. This could happen for some combination of several reasons:

• The hypothesis space has been constrained in a way that excludes the function that perfectly maps the input values to the output values. For example, the hypothesis space might be constrained to include only linear functions, whereas no linear function of the input variables can reproduce the observed output values (Figure 1d).
• The observed output values are the result of a process that is inherently non-deterministic.
• Some input data that are relevant to calculating the correct output values are missing and/or unknown.

In these situations, any function in the hypothesis space will result in some error in the predicted values. To account for this error, the output values can be expressed as:

 $y = f\left( {\textbf{x}} \right) + E$ [1]

where $f$ is a function contained in the hypothesis space and $E$ is a random error. We will let $g$ represent the probability distribution from which $E$ is drawn. In other words, $g(a)$ is the probability density that $y - f({\bf{x}}) = a$. The distribution $g$ may depend on the input data, but for simplicity we will generally assume that it does not. In general, both the function $f$ and the probability distribution $g$ are unknown.

For a given probability distribution $g$, we can estimate the probability density that a function $f$ satisfies equation [1]. This probability density is expressed as $P\left( {f|{\bf{D}},g} \right)$, where ${\bf{D}}$ is the observed training data. This probability density can be evaluated using Bayes’ rule.1

Bayes’ rule is a fundamental statistical theorem that can be derived from the fact that the probability of two events, $A$ and $B$, occurring, is given by the probability of $B$ occurring times the conditional probability that $A$ occurs given that $B$ has occurred. Mathematically, this is written as:

 $P\left( {A,B} \right) = P\left( B \right)P\left( {A|B} \right)$ [2]

Similarly,

 $P\left( {A,B} \right) = P\left( A \right)P\left( {B|A} \right)$ [3]

Combining equations [2] and [3] yields Bayes’ rule:

 $P\left( {A|B} \right) = \frac{{P\left( {B|A} \right)}}{{P\left( B \right)}}P\left( A \right)$ [4]

In the context of supervised learning, Bayes’ rule yields:

 $P\left( {f|{\bf{D}},g} \right) = \frac{{P\left( {{\bf{D}}|f,g} \right)}}{{P\left( {{\bf{D}}|g} \right)}}P\left( {f|g} \right)$ [5]

The probability distribution $P\left( {f|g} \right)$ gives the probability density that a function $f$ satisfies equation [1] given $g$ and prior to the observation of any training data. For this reason, it is known as the prior probability distribution, which is sometimes simply referred to as the prior. The probability distribution $P\left( {f|{\bf{D}},g} \right)$ represents the probability of the same event after accounting for the training data. It is known as the posterior probability distribution. The distribution $P\left( {{\bf{D}}|f,g} \right)$, commonly known as the likelihood function, is the probability of observing the training data ${\bf{D}}$ given $f$ and $g$. The remaining term in equation [5], $P\left( {{\bf{D}}|g} \right)$, does not depend on $f$ and can effectively be treated as a normalization constant.

Bayes’ rule provides a natural and intuitive framework (or basis) for understanding learning. Initially, the hypothesis space is constrained and / or weighted through the prior probability distribution. All functions that are excluded from the hypothesis space are assigned a probability of zero, and functions that are not excluded are assigned non-zero prior probabilities. These probabilities represent the prior belief that any particular function satisfies equation [1]. As training data are observed, these probabilities are updated to account for the new knowledge, resulting in the posterior probability distribution – this is the learning step. If additional training data (${{\bf{D}}_{\bf{2}}}$) were to be observed, a re-application of Bayes’ rule could be used to update the probability distributions further:

 $P\left( {f|{{\bf{D}}_2},{\bf{D}},g} \right) = \frac{{P\left( {{{\bf{D}}_2}|f,{\bf{D}},g} \right)}}{{P\left( {{{\bf{D}}_2}|{\bf{D}},g} \right)}}P\left( {f|{\bf{D}},g} \right)$ [6]

where the posterior distribution in equation [5] has become the prior distribution in equation [6]. Thus repeated application of Bayes’ rule can be used to update the likelihood that a particular hypothesis is best as new data come in. An example of how the posterior distributions change with new data points is shown in Figure 2.

Figure 2 An example of how the posterior probability density changes with the addition of new training data points. In this example, $f$ is a constant value indicated by the black triangle and $g$ is a normal distribution with zero mean and known variance. The diamonds represent training data. Assuming a uniform prior in which all values for $f$ are considered equally likely, the posterior probability densities for $f$ after the addition of the first (orange), second (red), third (purple), fourth (blue) and fifth (green) training data points are shown.

Because it can be awkward to deal with a probability distribution defined over many functions, learning algorithms will commonly return the single function $\hat f$ that maximizes the posterior probability density:

 $\hat f = \mathop {\arg \max }\limits_f P\left( {f|{\bf{D}},g} \right)$ [7]

Combining equation [7] with equation [5] yields:

 $\hat f = \mathop {\arg \max }\limits_f \left[ {P\left( {{\bf{D}}|f,g} \right)P\left( {f|g} \right)} \right]$ [8]

Because the natural log function is monotonically increasing, equation [8] can be equivalently written

 $\hat f = \mathop {\arg \min }\limits_f \left[ { - \ln \left( {P\left( {{\bf{D}}|f,g} \right)} \right) - \ln \left( {P\left( {f|g} \right)} \right)} \right]$ [9]

The term in square brackets on the right side of equation [9] is the objective function to be minimized. The function with the maximum posterior probability of satisfying equation [1] is the function that minimizes the objective function. The idea of finding a function that minimizes an objective function is common to most machine learning algorithms, and the Bayesian analysis enables interpretation of the components of the objective function. The first term, $- \ln \left( {P\left( {{\bf{D}}|f,g} \right)} \right)$, represents the empirical risk, a measure of how well a given function reproduces the training data. The second term, $- \ln \left( {P\left( {f|g} \right)} \right)$, is the regularization term, representing the constraints and weights that are put on the hypothesis space before the training data are known.

Equation [9] is a commonly used framework for supervised learning algorithms. To determine the function $\hat f$ that is returned by the algorithm, there are five choices that typically need to be made. It is necessary to determine both what the hypothesis space should be and how the empirical risk will be calculated. There must also be an optimization algorithm that is capable of selecting a function from the hypothesis space (e.g., by minimizing the objective function). In addition, it is sometimes necessary to select the input values to be included in the training data and to estimate the prediction error for the selected function. In the following sections, each of these five choices is discussed in more detail.

It is through the prior probability distribution $P\left( {f|g} \right)$ that the hypothesis space is constrained and / or weighted. Thus selecting the prior probability distribution is equivalent to deciding which hypotheses will be considered and to what extent some hypotheses should be preferred over others. Hypotheses often take the form of models, or parameterized functions, for which the parameters are unknown. Often the prior distribution is implicitly specified simply by choice of the model to be used; for example, the assumption that a function is linear (as in Figure 1d) implicitly assigns zero prior probability to all non-linear functions.

The choice of the prior probability distribution can impact the effectiveness of the learning process significantly. Functions that are assigned a prior probability of zero are excluded from the hypothesis space and will not be considered no matter how strongly they are supported by the training data. On the other hand, a function that is assigned a small, but non-zero, prior probability could have a large posterior probability, provided it predicts the training data with relatively high accuracy. It is advantageous to choose prior distributions that assign the highest probability to the most accurate functions, as these distributions will generally result in an accurate choice of $\hat f$ with little (or no) additional training data.

The prior probability distribution is incorporated into the objective function through the regularization term, $- \ln \left( {P\left( {f|g} \right)} \right)$. This term is often described as a way to penalize functions that are unlikely to make good predictions, and the benefit of a regularization term can be derived independently from Bayesian analysis.2 The use of a regularization term can make an ill-posed problem, in which there is no unique solution and/or the solution does not change continuously with the training data, into a problem that is well-posed.

The act of choosing a prior probability distribution can be controversial because it can introduce a subjective element into the learning process. However the choice of a prior must be made, either implicitly (e.g., by constraining the hypothesis space to only include certain functions) or explicitly. Five common strategies for selecting the prior probability distribution / regularization term are described below.

If no information is known about the process being modeled, a natural choice for the prior distribution would be one that makes the fewest assumptions about the relative merits of different candidate functions. Such prior probability distributions are known as uninformative priors. An uninformative prior distribution that assigns equal probabilities to all possible functions would make it impossible to differentiate between functions that perfectly reproduce the training data. Thus if an uninformative prior distribution is to be used among functions in the hypothesis space, it is necessary to constrain the hypothesis space to only include certain functions.

The appropriate choice of an uninformative prior may depend on the problem being modeled and the way in which the functions in the hypothesis space are parameterized. A number of different strategies have been proposed, and a thorough review of these can be found in reference 3. One strategy for choosing uninformative priors is the principle of maximum entropy, championed by E. T. Jaynes.4 This principle states that the prior probability distribution should be chosen in a way that maximizes the information entropy within the constraints of existing knowledge, where the information entropy of a probability density $p$ is a measure of uncertainty defined as

 $S\left( p \right) = - \int\limits_{ - \infty }^\infty {p\left( x \right)\ln \left( {p\left( x \right)} \right)dx}$ [10]

Jaynes has made the case that the distribution that maximizes the information entropy is the “maximally noncommittal” probability distribution.4 For example, if a single scalar parameter is to be determined, and nothing is known about it prior to the observation of the training data, then the principle of maximum entropy would state that the appropriate prior distribution is the uniform prior in which all possible values are considered equally likely.

A uniform prior that assigns equal prior probabilities to an infinite number of functions is an improper prior, in that it cannot be normalized to 1. In practice, such improper priors are widely used, as they may result in a proper (i.e., normalizable) posterior distribution.

One strategy for selecting a prior distribution is to evaluate multiple possible priors and choose the one giving the posterior distribution with the lowest expected prediction error. To this end, the methods described in the section “Estimating the Prediction Error” may be employed. A common variation of this approach is model selection, in which the learning process is broken into two steps. In the first step, a parameterized function (a.k.a model) is selected, effectively assigning zero prior probability to all other functions. (Different types of models are discussed in the section on “Supervised Learning Algorithms”; common examples include linear models, neural networks, etc.) In the second step the parameter values that minimize the objective function for the selected model are determined.

A general approach to model selection can be derived through a Bayesian analysis. For a given model and a given set of training data, different values of the model parameters will result in different values for the empirical risk. Let $r$ represent the minimum value of the empirical risk achievable within a given model for a given set of training data. Schwartz demonstrated that under a wide range of conditions, the model with the greatest expected posterior probability is the one for which the following value is smallest:

 $2r + k\ln \left( n \right)$ [11]

where $k$ is the number of parameters in the model and $n$ is the number of elements (${\bf{x}},y$ pairs) in the training data.5 The expression in equation [11] is commonly known as the Bayesian information criterion (BIC). Separately, Akaike derived a similar term to be minimized for model selection, commonly known as the Akaike information criterion (AIC):6

 $2r + 2k$ [12]

Both the Bayesian information criterion and Akaike information criterion are valuable for revealing a general rule for model selection: among models that reproduce the training data equally well, the one with the fewest parameters can be expected to have the lowest prediction error. This result is similar to Occam’s razor, a commonly used heuristic that all else being equal, simple hypotheses should be favored over more complex ones. Intuitively, this insight can be understood by considering that if there are more parameters in a model, there is greater risk of selecting a function that happens to reproduce the training data well but has little predictive ability. This is known as overfitting the training data.

Determining the best way to constrain and/or weight the hypothesis space can be accomplished by incorporating prior knowledge about the process being modeled into the prior distribution. For example, physical arguments might suggest that the output value should be a linear function of a particular input value, or the expected magnitudes of some of the parameters that define the function might be known. The prior probability distribution can be constructed in a way that accounts for this knowledge, directing the learning process towards functions that are expected to be reasonable even before the observation of the training data. The use of a prior probability distribution that effectively takes into account external knowledge can significantly accelerate the learning process by making use of all available knowledge.

The principle of maximum entropy can be combined with prior knowledge to create a prior distribution that incorporates existing knowledge in a “maximally noncommittal” way. For example, if estimates for the mean and the variance of a parameter’s value are available, then a Gaussian distribution over possible parameter values will maximize the information entropy and would therefore be the appropriate choice of a prior distribution for the parameter value under the principle of maximum entropy.

Prior knowledge can be especially useful for setting the mean of the prior probability distribution. If the prior probability distribution has a non-zero mean $\bar f$, then the learning process can be re-cast in terms of a prior probability distribution with zero mean by replacing the function $f$ with

 $\Delta f = f - \bar f$ [13]

Accordingly, each value ${y_i}$ in the training data is replaced with

 $\Delta {y_i} = {y_i} - \bar f\left( {{{\bf{x}}_i}} \right)$ [14]

For example, if $f$ calculates the energy of a compound, then $\bar f$ might be the composition-weighted average of the energies of the constitutive elements, and $\Delta f$ would be the formation energy.

The function $\Delta f$ represents the difference between the actual function $f$ and the expected function $\bar f$. Because $f$ is expected to resemble $\bar f$, it can be expected that the norm of $\Delta f$, represented by $\left\| {\Delta f} \right\|$, is more likely to be small than it is to be large. For this reason, it is common to use a term that monotonically increases with $\left\| {\Delta f} \right\|$ as the regularization term.

A variety of different norms can be used, and some of the most popular take the form of the ${L^p}$ norm, defined by

 $\left\| f \right\| = {\left( {\int {{{\left| {f\left( {\bf{x}} \right)} \right|}^p}d{\bf{x}}} } \right)^{1/p}}$ [15]

where p 1 and the integral is over all values of x. The ${L^1}$ and ${L^2}$ norms are commonly used, as are smoothing norms that favor functions that vary smoothly with the input values (i.e., their higher-order derivatives are small). An example of a widely used smoothing norm is the ${L^2}$ norm of the second derivative of $\Delta f$.

It is generally a good idea to transform $f$ to $\Delta f$ if prior knowledge can be used to make a reasonable estimate for $\bar f$; otherwise $\bar f = 0$ is implicitly used. The approaches to supervised learning discussed in this chapter are equally applicable to $f$ and $\Delta f$.

An alternative approach to selecting a particular prior distribution is to assign a probability distribution over the space of possible prior distributions. Such a probability distribution is known as a hyperprior. For example, if a Gaussian distribution with zero mean is used as the prior, a hyperprior could be constructed as a probability distribution over possible values of the variance of the Gaussian. The posterior distribution can then be calculated as a weighted average over prior distributions:

 $P\left( {f|{\bf{D}},g} \right) = \int {\frac{{P\left( {{\bf{D}}|f,g} \right)}}{{P\left( {{\bf{D}}|g} \right)}}P\left( {f|g} \right)} P\left( {P\left( {f|g} \right)} \right)$ [16]

where $P\left( {f|g} \right)$ is the prior, $P\left( {P\left( {f|g} \right)} \right)$ is the hyperprior, and the integral is over all possible prior distributions.

Many of the same challenges for determining a prior exist for determining a hyperprior, and there is an extra integration step that needs to be performed to arrive at the posterior. However the hyperprior allows for an extra layer of abstraction in situations in which the posterior may be particularly sensitive to the choice of a prior. It is possible to similarly define hyperhyperpriors, etc, but in practice this is rarely done.

The empirical risk represents the negative log probability of observing the training data for a given $f$ and $g$. Assuming that all of the observations in the training data are independent, from equation [1] and the definition of $g$, we can write:

 $P\left( {{\bf{D}}|f,g} \right) = \prod\limits_i {g\left( {{y_i} - f\left( {{{\bf{x}}_i}} \right)} \right)}$ [17]

where ${{\bf{x}}_i}$ is the ${i^{th}}$ set of input values in the training set, ${y_i}$ is the ${i^{th}}$ output value, and the product is over all elements in the training set. Thus, under the assumption that the observations in the training set are independent of each other, the empirical risk can be written as:

 $- \ln \left( {P\left( {{\bf{D}}|f,g} \right)} \right) = \sum\limits_i { - \ln \left( {g\left( {{y_i} - f\left( {{{\bf{x}}_i}} \right)} \right)} \right)}$ [18]

where the sum is over all elements in the training set. For example, if $g$ is assumed to be Gaussian, then the empirical risk would depend on the sum of the squared differences between ${y_i}$ and $f\left( {{{\bf{x}}_i}} \right)$. This approach leads to least-squares fitting, discussed in more detail in the section on “Regularized Least Squares”.

The empirical risk is sometimes written more generally as

 $\sum\limits_i {L\left( {{y_i},f\left( {{{\bf{x}}_i}} \right)} \right)}$ [19]

where $L$ is a loss function that calculates the penalty (a.k.a loss) for large differences between ${y_i}$ and$f\left( {{{\bf{x}}_i}} \right)$. In practice many commonly used loss functions can be written in the form of equation [18], as a function of ${y_i} - f\left( {{{\bf{x}}_i}} \right)$.

Unlike the prior probability distribution, the empirical risk depends on the function$g$, which is in general unknown. As both $f$ and $g$ are unknown, it might make sense to treat the two similarly and to search for the pair of functions that together are most likely to satisfy equation [1]. The posterior and prior distributions would then be defined for the pair, $(f,g)$, and application of Bayes’ rule would yield:

 $P\left( {f,g|{\bf{D}}} \right) = \frac{{P\left( {{\bf{D}}|f,g} \right)}}{{P\left( {\bf{D}} \right)}}P\left( {f,g} \right)$ [20]

as an alternative to equation [5]. However such an approach is not commonly used. Instead, it is more common to make the prior assumption that $g$, and hence the loss function, is known.

If a uniform prior probability distribution is used for all functions in the hypothesis spaces, then the optimization of the objective function (equation [9]) can be accomplished by minimizing the empirical risk. This approach is known empirical risk minimization. Empirical risk minimization is equivalent to selecting the function that maximizes the likelihood function $P\left( {{\bf{D}}|f,g} \right)$ (i.e., the function that best reproduces the training data). Although empirical risk minimization is a widely used method, there is a risk of overfitting training data. This risk can often be mitigated by replacing the uniform prior used in empirical risk minimization with a prior distribution that favors simple functions or takes into account prior knowledge.

Many machine learning algorithms involve the optimization of an objective function, as in equation [9]. For practical purposes, the functions in the hypothesis space are typically characterized by a set of unknown parameters; e.g., the functions may be expressed as a linear combination of basis functions, in which the linear expansion coefficients are unknown parameters. Thus the problem of searching for the optimal function becomes a problem of finding the set of parameters that minimize the objective function. Many algorithms have been developed to address such optimization problems (e.g., gradient descent approaches, simulated annealing, etc.), and the field of general optimization algorithms is too large to discuss here. Instead we refer the reader to some of the many comprehensive books on the subject (e.g., references 7-9).

For some objective functions, there may be no known algorithm that is able to find the globally optimal function and / or verify whether a particular function is globally optimal with a reasonable computational cost. However many machine learning algorithms use an objective function and hypothesis space that have been designed to facilitate the rapid identification of the globally optimal function. The ways in which this is done are described in the context of individual machine learning algorithms in the section on “Supervised Learning Algorithms”.

The training data may be generated by observations of external events that cannot easily be controlled, such as climatological data used in weather forecasting. However in many cases it is possible to generate training data through controlled experiments. In each experiment, a set of input values are evaluated, and the corresponding output value becomes known once the experiment is complete. Because the generation of training data can be an expensive and/or time-consuming step in the learning process, it is desirable to minimize the total number of experiments that must be performed to achieve an acceptable level of prediction error. The field of active learning, also known as design of experiments, deals with determining the best set of experiments to perform (i.e. determining the best elements to include in training data) to minimize the total cost of generating the training data.

There are many different approaches to active learning, and we will not review them all here. Good overviews can be found in references 10-12. A common approach to active learning is uncertainty sampling,13 in which the next experiment is performed on input values for which there is a large amount of uncertainty in the predicted output value. A related approach is query by committee,14 in which several different models are trained on the same data, and the next experiment is performed on a data point about which there is the least agreement among the models. However this approach can result in the sampling of outlier data points that are not representative of the space of possible input values. Alternatively, if the distribution of all possible input values is known, the input values can be selected in a way that takes this distribution into account. Such density-weighted methods can result in significant performance improvements over methods that do not account for the distribution of possible input values.11

If the hypothesis space consists of parameterized functions, the training data may be chosen in a way that minimizes some measure of the variance of the estimated parameter values. This is often accomplished by optimizing the observed information matrix, sometimes called simply the information matrix.15 The observed information matrix is the Hessian of the empirical risk with respect to the function parameters, and it is often evaluated at the parameter values that minimize the empirical risk. It is an indicator of how informative the current training data are about the parameter values.

A number of different criteria have been developed to optimize the observed information matrix. Among the most common are A-optimality, in which the trace of the inverse information matrix is minimized, and D-optimality, in which the determinant of the information matrix is maximized.16-18 An overview of these and many other optimality criteria can be found in reference 19. As the information matrix does not take into account the prior probability distribution over possible parameter values, an alternative approach is to use the Hessian of the objective function in place of the information matrix. In this approach, sometimes referred to as Bayesian experimental design,20, 21 the same matrix optimality criteria (A-optimality, D-optimality, etc.) may be used. In materials science, optimization of the information matrix has been used to select training data for cluster expansion models.

The goal of a machine learning algorithm is to identify a function that makes accurate predictions for input values that are not included in the training data. Thus to evaluate the results of a learning algorithm, it is not sufficient to evaluate how well the function reproduces the training data (i.e., the empirical risk). Rather it is best to use a method that is capable of estimating the prediction error of a function over the distribution of all possible input values.

The estimation of the prediction error can be accomplished using resampling methods, in which functions are trained on one or more subsamples of the training data using the same learning algorithm that is used for the entire set of training data. These functions are then evaluated using subsamples of the training data on which they were not trained, providing an estimate of the predictive ability of the functions identified by the learning algorithm.

A common resampling technique is cross-validation,22-25 in which the set of known observations are partitioned into two subsets. The first subset, the training set, is used to identify a likely function. The predictive power of this function is evaluated by calculating its prediction error when applied to the second subset, known as the test set. Averaging the cross-validation prediction error over multiple different partitions provides a measure of the estimated prediction error for a function trained on the entire known set of observations. A common variation of cross-validation approach is k-fold cross validation, in which the set of observations are partitioned into k different subsets, and the prediction error on each subset is evaluated for a function trained on the k-1 remaining subsets. When k equals the number of samples in the training set, this approach is known as leave-one-out cross validation.

Another popular resampling technique is bootstrapping,26-28 in which the subsamples are drawn from the set of training data with replacement, meaning the same element can appear in the subsample more than once. Functions trained on these subsamples are then compared to a function trained on the entire set of training data. Bootstrapping is commonly used to estimate statistics such as the bias and variance in the output of the learning algorithm. Additional details about cross-validation, bootstrapping, and other resampling methods can be found in references 29-33.

Resampling methods provide the best estimates of prediction error when the distribution of input values in the set of known observations is representative of the distribution of all possible input values. Training data that consist of uncontrollable empirical observations generally fit this description, provided the observations are effectively randomly drawn from the distribution of all possible input values. Alternatively, density-weighted active learning methods can generate a set of observations that are representative of the distribution of possible input values and are well-suited for use in cross-validation algorithms.

Many supervised learning algorithms have been developed, and it is not feasible to describe them all here. Instead, a brief overview of some of the most common approaches is provided. These approaches are all described in the context of the framework presented in the Introduction. In each of these approaches, there is a hypothesis space, objective function, and optimization algorithm. There are also sometimes algorithm-specific active learning methods used to generate an efficient set of training data. Approaches for estimating prediction error are fairly universal, so they will not be described in detail here.

One of the most widely used methods for fitting a function to data is least squares regression. The least-squares approach is characterized by the assumption that $g$ is Gaussian:

 $g\left( {{y_i} - f\left( {{{\bf{x}}_i}} \right)} \right) = \frac{1}{{\sigma \sqrt {2\pi } }}{e^{\frac{{{{\left( {{y_i} - f\left( {{{\bf{x}}_i}} \right)} \right)}^2}}}{{2{\sigma ^2}}}}}$ [21]

The empirical risk is therefore:

 $- \ln \left( {P\left( {{\bf{D}}|f,g} \right)} \right) = \frac{1}{{2{\sigma ^2}}}\sum\limits_i {\left[ {{{\left( {{y_i} - f\left( {{{\bf{x}}_i}} \right)} \right)}^2} - \ln \left( {\frac{1}{{\sigma \sqrt {2\pi } }}} \right)} \right]}$ [22]

where the sum is over all elements in the training set. The loss function depends on the squared difference between the observed and predicted output values, and empirical risk minimization yields:

 $\hat f = \mathop {\arg \min }\limits_f \sum\limits_i {{{\left( {{y_i} - f\left( {{{\bf{x}}_i}} \right)} \right)}^2}}$ [23]

Equation [23] describes a least-squares fit, in which the selected function minimizes the sum of the squared errors over all of the elements in the training set. The loss function in equation [23], known as the squared error loss, is commonly used in machine learning algorithms. Under the assumptions that the errors are normally distributed with zero mean and constant variance (equation [21]), the least squares fit returns the function that maximizes the probability of observing the training data. Similarly, if a uniform prior is assumed for the functions in the hypothesis space, the least-squares fit returns the function that maximizes the posterior probability distribution.

The least-squares fit is often a justifiable choice for function fitting. The assumption that $g$ is Gaussian can be justified using the principle of maximum entropy, and the use of a uniform prior can be justified on the grounds that it is uninformative and hence will not bias the results. Perhaps most importantly from a practical perspective, the least-squares fit is conceptually simple and easy to implement. In a common implementation, known as linear least squares or ordinary least squares, the hypothesis space is restricted to include only linear functions of the input values. The optimal set of coefficients, ${\bf{\hat \beta }}$, is given by

 ${\bf{\hat \beta }} = \mathop {\arg \min }\limits_\beta \sum\limits_i {{{\left( {{y_i} - {{\bf{x}}_i}{\bf{\beta }}} \right)}^2}}$ [24]

where ${{\bf{x}}_i}$ is a row-vector containing the input values and ${\bf{\beta }}$ is a column-vector containing the unknown input values. From equation [24], the exact optimal solution can be calculated:

 ${\bf{\hat \beta }} = {\left( {{{\bf{X}}^{\bf{T}}}{\bf{X}}} \right)^{ - 1}}{{\bf{X}}^{\bf{T}}}{\bf{y}}$ [25]

where ${\bf{y}}$ is a column-vector in which the ith element is ${y_i}$, and ${\bf{X}}$ is a matrix in which the ith row is ${{\bf{x}}_i}$. A unique solution to equation [25] only exists if the matrix ${{\bf{X}}^{\bf{T}}}{\bf{X}}$ is non-singular.

It is often possible to improve upon least squares fits by using a non-uniform prior distribution. For example, consider the situation in which a multivariate normal distribution is used as the prior:

 $P\left( {{\bf{\beta }}|g} \right) \propto {e^{\frac{{ - {{\bf{\beta }}^T}{\bf{\Lambda \beta }}}}{{2{\sigma ^2}}}}}$ [26]

where ${\bf{\Lambda }}$ is a positive-definite matrix. The set of coefficients that minimize the objective function is given by

 ${\bf{\hat \beta }} = {\left( {{{\bf{X}}^{\bf{T}}}{\bf{X}} + {\bf{\Lambda }}} \right)^{ - 1}}{{\bf{X}}^{\bf{T}}}{\bf{y}}$ [27]

Equation [27] represents a type of regularized least-squares fit known as Tikhonov regularization.2 When a normal least-squares fit is ill-posed (e.g., when ${{\bf{X}}^{\bf{T}}}{\bf{X}}$ is singular), Tikhonov regularization can make the problem well-posed, such that a unique solution is guaranteed. Variants of Tikhonov regularization are popular in part because they are often easy to justify, robust, and the optimal solution can be found with only slightly more computational cost than a least-squares fit.

When ${\bf{\Lambda }} = \lambda {\bf{I}}$ in equation [27], where ${\bf{I}}$ is the identity matrix, it is known as ridge regression.34 Ridge regression is equivalent to using the squared ${\ell ^2}$ norm of the coefficients as the regularization term, where the ${\ell ^p}$ norm for a vector is defined as

 $\left\| {\bf{\beta }} \right\| = {\left( {\sum\limits_i {{\beta _i}^p} } \right)^{1/p}}$ [28]

for p 1. (It is the discrete version of the ${L^p}$ norm described by equation [15].) Another popular form of regularized least-squares includes the use of an ${\ell ^1}$ norm instead of an ${\ell ^2}$ norm. The use of the ${\ell ^1}$ norm often results in a minimizing vector of coefficients, ${\bf{\hat \beta }}$, in which many elements are identically 0. This approach is sometimes known as the Lasso estimator or compressive sensing,35, 36 and it is useful when the solution is sparse. It is in general not as straightforward as using the ${\ell ^2}$ norm, but efficient algorithms for finding the optimal set of coefficients exist.37

Linear least squares is a particularly convenient method for active learning approaches (described in the section on "Training Data"), as the information matrix is independent of the values of the coefficients and proportional to ${{\bf{X}}^{\bf{T}}}{\bf{X}}$. For Bayesian experimental design, the matrix ${{\bf{X}}^{\bf{T}}}{\bf{X}} + {\bf{\Lambda }}$ may be used in place of ${{\bf{X}}^{\bf{T}}}{\bf{X}}$.

Classification problems differ from regression problems in that the set of allowed output values is discrete, with each allowed output value corresponding to a different class. The same general learning framework could be used for classification problems as for regression problems, but in practice it can be difficult to work with discontinuous output values. Many classification problems can be simplified by recognizing that different classes often correspond to different regions in input-value space (Figure 3). Thus instead of searching for a discontinuous function that predicts the output values directly, it is possible to search instead for a continuous dividing surface between the regions corresponding to different classes.

For example, consider a situation in which there are two classes, corresponding to $y = 1$ and $y = - 1$. If the prior assumption is made that there is a linear hyperplane in the space of input values that divides the two different classes, then the classification function takes the form:

 $f\left( {\bf{x}} \right) = \left\{ {\begin{array}{*{20}{c}} {1,} & {{\rm{if\hspace{1em}}}{\bf{w}} \cdot {\bf{x}} - b > 0}\\ { - 1,} & {{\rm{if\hspace{1em}}}{\bf{w}} \cdot {\bf{x}} - b < 0} \end{array}} \right.$ [29]

where the vector of coefficients ${\bf{w}}$ (commonly known as weights) and the offset $b$ define the hyperplane that divides the two classes (Figure 3). Thus the discrete classification function, $f\left( {\bf{x}} \right)$, can be determined by learning the continuous function ${\bf{w}} \cdot {\bf{x}} - b$. We will call this continuous function $h\left( {\bf{x}} \right)$.

The training data are linearly separable if there is at least one hyperplane that perfectly separates the two classes. There are usually many competing hyperplanes that can separate linearly separable data, and the goal of the learning algorithm is to find the one that is most likely to correctly separate data points that are not in the training set. One approach is to find the hyperplane that is farthest from the input values in the training data. In other words, the best hyperplane is the one with the longest support vectors, which are defined as the shortest vectors between the hyperplane and the nearest training data point of each class (Figure 3). Support vector machines are widely used supervised learning algorithms that identify such a hyperplane.38 If there are more than two different classes present, support vector machines can be used to find the hyperplanes separating all pairs of classes. A brief introduction to support vector machines is presented below, and a more comprehensive review of different support vector machine approaches can be found in references 39 and 40.

Figure 3 A linearly separable data set containing two classes (grey and white). Each point corresponds to coordinates $\left( {{x_1},{x_2}} \right)$ given by the input values, and the colors are determined by the corresponding $y$ values. The dashed line shows the dividing plane, and the short thick lines are the support vectors.

Hyperplanes are defined by the unknown parameters ${\bf{w}}$ and $b$, and the optimal values for these parameters can be found by minimizing an objective function similar to the one in equation [9]. The loss function ( equation [19]) is given by

 $L\left( {{y_i},h\left( {{{\bf{x}}_i}} \right)} \right) = \left\{ {\begin{array}{*{20}{c}} {0,} & {{\rm{if\hspace{1em} }}{y_i}h\left( {{{\bf{x}}_i}} \right) \ge 1}\\ {\infty ,} & {{\rm{if\hspace{1em}}}{y_i}h\left( {{{\bf{x}}_i}} \right) < 1} \end{array}} \right.$ [30]

and the regularization term is simply $\frac{1}{2}{\left\| {\bf{w}} \right\|^2}$. The loss function ensures that the plane separates the two classes, and the regularization term is minimized for the plane with the longest support vectors. This is a constrained quadratic optimization problem that can be solved using quadratic programming.41, 42 A similar approach, least-squares support vector machines (LS-SVM),43 enables the calculation of the optimal weights by solving a linear system.

If the training data are not linearly separable, then the loss function in equation [30] will always be infinite for at least one element of the training set, and no optimal set of parameters will be found. This problem can be addressed by introducing non-negative slack variables, ${\xi _i}$, that allow for some data points to be misclassified.38 The slack variables effectively measure the distance between the hyperplane and the misclassified data points. The loss function becomes:

 $L\left( {{y_i},h\left( {{{\bf{x}}_i}} \right)} \right) = \left\{ {\begin{array}{*{20}{c}} {0,} & {{\rm{if\hspace{1em}}}{y_i}h\left( {{{\bf{x}}_i}} \right) \ge 1 - {\xi _i}}\\ {\infty ,} & {{\rm{if\hspace{1em}}}{y_i}h\left( {{{\bf{x}}_i}} \right) < 1 - {\xi _i}} \end{array}} \right.$ [31]

and the regularization term is $\frac{1}{2}{\left\| {\bf{w}} \right\|^2} + C\sum\limits_i {{\xi _i}}$, where $C$ is an adjustable parameter that determines how severely misclassification should be penalized. An alternative equivalent formulation is to use the hinge loss:

 $L\left( {{y_i},h\left( {{{\bf{x}}_i}} \right)} \right) = \left\{ {\begin{array}{*{20}{c}} {0,} & {{\rm{if\hspace{1em} }}{y_i}h\left( {{{\bf{x}}_i}} \right) \ge 1}\\ {C\left( {1 - {y_i}h\left( {{{\bf{x}}_i}} \right)} \right),} & {{\rm{if\hspace{1em}}}{y_i}h\left( {{{\bf{x}}_i}} \right) < 1} \end{array}} \right.$ [32]

where the corresponding regularization term is $\frac{1}{2}{\left\| {\bf{w}} \right\|^2}$. By comparing the loss functions in equations [30] and [32], it can be seen that the hinge loss simply replaces the infinite penalty for misclassification with a penalty that scales linearly with the degree of misclassification.

The optimal set of weights for support vector machines can be found be solving the dual problem, in which the weights are written as

 ${\bf{w}} = \sum\limits_i {{\alpha _i}} {y_i}{{\bf{x}}_i}$ [33]

where the sum is over all elements in the training set. The vector of coefficients ${\bf{\alpha }}$ is given by

 $\mathop {\arg \max }\limits_{\bf{\alpha }} \left( {\sum\limits_i {{\alpha _i}} - \sum\limits_i {\sum\limits_j {{\alpha _i}{\alpha _j}{y_i}{y_j}\left( {{{\bf{x}}_i} \cdot {{\bf{x}}_j}} \right)} } } \right)$ [34]

subject to the constraints

 $\begin{array}{l} 0 \le {\alpha _i} \le C,\\ \sum\limits_i {{\alpha _i}{y_i}} = 0 \end{array}$ [35]

where each sum is over all elements in the training set. The dual problem formulation makes it possible to use support vector machines to classify data that are not linearly separable using the kernel trick, in which the input variables are transformed in a way that makes the data linearly separable.

Linear least squares and support vector machines are popular due to their speed and simplicity, which come from the underlying assumption that the solution must be a linear function of the input variables. However in many cases this assumption is unrealistic. For example, consider the classification problem shown in Figure 4a. There is no linear plane that will separate the data, which means that a linear support vector machine will fail to find a good solution. However it is possible to transform the original one-dimensional input value $\left( {{x_1}} \right)$ to a two-dimensional feature space, $\left( {{x_1},{x_1}^2} \right)$, in which linear separation is possible (Figure 4b). Such transformations of the input variables are known as feature maps.

Figure 4 a) A one-dimensional data set that is not linearly separable. b) The same data set mapped to two dimensions using a feature map that makes it linearly separable. The dashed line is the separating plane.

Feature maps can be used in machine learning algorithms by simply substituting the transformed input variables, $\varphi \left( {\bf{x}} \right)$, for the original input variables ${\bf{x}}$. However in some cases, such as when the feature space has an infinite number of dimensions, it may not be practical to explicitly make such a transformation. Alternatively, many algorithms, including support vector machines (equation [34]) and linear least squares, can be expressed in a way that depends only on the input variables through inner products such as ${{\bf{x}}_1} \cdot {{\bf{x}}_2}$. In such cases, it is sufficient to find a kernel $k\left( {{{\bf{x}}_i},{{\bf{x}}_j}} \right)$ that returns the dot product between $\varphi \left( {{{\bf{x}}_i}} \right)$ and $\varphi \left( {{{\bf{x}}_j}} \right)$:

 $k\left( {{{\bf{x}}_i},{{\bf{x}}_j}} \right) \equiv \varphi \left( {{{\bf{x}}_i}} \right) \cdot \varphi \left( {{{\bf{x}}_j}} \right)$ [36]

To use a learning algorithm in a given feature space, all that needs to be done is to use the kernel $k\left( {{{\bf{x}}_i},{{\bf{x}}_j}} \right)$ in place of the inner product ${{\bf{x}}_i} \cdot {{\bf{x}}_j}$ throughout the learning algorithm. This is known as the kernel trick.44

The direct use of the kernel function saves the trouble of having to apply a feature map and calculate the inner products explicitly. When using the kernel trick, it is not even necessary to know which feature maps produce the kernel -- any symmetric, continuous, positive definite kernel can be used, as every such kernel corresponds to an inner product in some feature space.45 (A positive definite kernel is defined as a kernel for which $\sum\limits_{i = 1}^n {\sum\limits_{j = 1}^n {{c_i}{c_j}k({{\bf{x}}_{\bf{i}}},{{\bf{x}}_{\bf{j}}})} } \ge 0$ for any real-valued $\left\{ {{c_1},\;...\;,{c_n}} \right\}$and $\left\{ {{{\bf{x}}_1},\;...\;,{{\bf{x}}_n}} \right\}$.) Such kernels are known as Mercer kernels or reproducing kernels. Some examples of reproducing kernels are shown in Table I.

 Name $k({{\bf{x}}_i},{{\bf{x}}_j})$ Linear ${{\bf{x}}_i} \cdot {{\bf{x}}_j} + c$ Polynomial kernel with degree $d$ ${\left( {{c_1}\left( {{{\bf{x}}_i} \cdot {{\bf{x}}_j}} \right) + {c_2}} \right)^d}$ Gaussian ${e^{\frac{{ - {{\left\| {{{\bf{x}}_i} - {{\bf{x}}_j}} \right\|}^2}}}{{2{\sigma ^2}}}}}$ Laplacian ${e^{\frac{{ - \left\| {{{\bf{x}}_i} - {{\bf{x}}_j}} \right\|}}{\sigma }}}$

Table I Examples of symmetric, continuous, positive definite kernels. The scalar parameters $c$, ${c_1}$, ${c_2}$, and $\sigma$ are all adjustable, with the constraints that $\sigma$ and ${c_1}$ are positive and $d$ is a positive integer.

For a reproducing kernel$k\left( {{{\bf{x}}_1},{{\bf{x}}_2}} \right)$, consider the space of functions $f$ that are given by

 $f\left( {\bf{x}} \right) \equiv \sum\limits_j {{\alpha _j}k\left( {{{\bf{x}}_j},{\bf{x}}} \right)}$ [37]

where ${\alpha _j}$ are scalar coefficients and the sum may contain any number of terms. The inner product ${\left\langle {} \right\rangle _H}$ on this space is defined such that:

 ${\left\langle {k\left( {{{\bf{x}}_i},{\bf{x}}} \right),k\left( {{{\bf{x}}_j},{\bf{x}}} \right)} \right\rangle _H} = k\left( {{{\bf{x}}_i},{{\bf{x}}_j}} \right)$ [38]

This function space is known as a reproducing kernel Hilbert space (RKHS) 46. Each RKHS has a norm, ${\left\| f \right\|_H}$, which is defined as

 ${\left\| f \right\|_H} \equiv \sqrt {{{\left\langle {f,f} \right\rangle }_H}}$ [39]

The norm of the RKHS depends on the underlying kernel. For example, the norm of the linear RKHS is given by

 $\left\| f \right\|_{_H}^2 = {w^2}$ [40]

where $f\left( x \right) = wx$. Thus in a linear RKHS, functions with a steeper slope will have larger norms. The norm of the Gaussian RKHS is

 $\left\| f \right\|_{_H}^2 = \int\limits_{ - \infty }^\infty {F{{\left( \omega \right)}^2}{e^{\frac{{{\sigma ^2}{\omega ^2}}}{2}}}d\omega }$ [41]

where $F\left( \omega \right)$ is the Fourier transform of$f\left( x \right)$. Thus in the Gaussian RKHS, functions that fluctuate more rapidly will have larger norms.

Reproducing kernel Hilbert spaces have special properties that make them particularly useful for supervised machine learning. Consider a situation in which the hypothesis space is a RKHS. If a Gaussian kernel is used, for example, the hypothesis space would consist of linear combinations of Gaussian functions. Let the regularization term be given by $r\left( {{{\left\| f \right\|}_H}} \right)$, where $r$ is a monotonically increasing function. Let the empirical risk take the general form of $\sum\limits_i {L\left( {{y_i},f\left( {{{\bf{x}}_i}} \right)} \right)}$, where the sum is over all data points in the training set. The objective function is therefore

 $\sum\limits_i {L\left( {{y_i},f\left( {{{\bf{x}}_i}} \right)} \right)} + r\left( {{{\left\| {\Delta f} \right\|}_H}} \right)$ [42]

The representer theorem then states that the function that minimizes this objective function must take the form

 $\hat f\left( {\bf{x}} \right) = \sum\limits_i {{c_i}k\left( {{{\bf{x}}_{\bf{i}}},{\bf{x}}} \right)}$ [43]

where the sum is over all elements of the training set.47, 48 Thus the problem of finding $\hat f$ is reduced to the problem of finding the coefficients ${c_i}$ that minimize the objective function.

If a squared loss function is used and $r\left( {{{\left\| {\Delta f} \right\|}_H}} \right) = \lambda \left\| {\Delta f} \right\|_H^2$ for some positive scalar $\lambda$, then it is straightforward to show that the objective function can be written as:

 $\frac{1}{2}{\left\| {{\bf{y - Kc}}} \right\|^2} + \lambda {{\bf{c}}^{\bf{T}}}{\bf{Kc}}$ [44]

where ${\bf{K}}$ is a matrix in which ${K_{ij}} = k\left( {{{\bf{x}}_{\bf{i}}},{{\bf{x}}_{\bf{j}}}} \right)$ for some ${{\bf{x}}_{\bf{i}}}$ and ${{\bf{x}}_{\bf{j}}}$ in the training data, and ${\bf{c}}$ is a column vector in which the ${i^{th}}$ element is ${c_i}$. The unique solution to this problem is

 ${\bf{\hat c}} = {\left( {{\bf{K}} + 2\lambda {\bf{I}}} \right)^{ - 1}}$ [45]

This approach is known as kernel ridge regression.49 It is similar to regularized linear least squares (equation [27]) with two major differences. The first is that it is no longer necessary to work in a hypothesis space of linear functions – this solution holds for any RKHS. The second is that the number of rows and columns in the matrix to be inverted is now equal to the number of elements in the training set, rather than the number of input variables. Thus although the calculations may take longer in situations in which there are a lot of training data, this approach can be used for a much wider variety of hypothesis spaces.

The flexibility and simplicity of kernel ridge regression has made it a popular tool in materials science. It has been used for a variety of ends, including the prediction of material properties from descriptors, development of model Hamiltonians, and generation of density functionals.

An intuitive and effective approach to machine learning is to mimic the biological brain. This is the idea behind a class of machine learning algorithms known as neural networks.50 In a neural network, artificial neurons (a.k.a. nodes) are linked together in a way that resembles the connections between neurons in the brain (Figure 5). The input values (${\bf{x}}$) are passed directly into a set of neurons that comprise the first layer of the neural network, and these neurons use activation functions to calculate output values that are then used as input values by the next set of neurons. This process proceeds throughout the network until reaching a neuron that produce the final output value ($y$). Some networks may be constructed to output multiple values.

Figure 5 a) A perceptron b) A multi-layer neural network containing many perceptron-like nodes. Nodes representing input variables (${x_1}$, ${x_2}$, …) are grey, and nodes with activation functions (${a_1}$, ${a_2}$, …) are black.

The hypothesis space in a neural network is defined by the topology of the connections between nodes and the parameterized activation functions used by the nodes. One of the simplest neural networks, consisting only of a single node, is known as a perceptron (Figure 5a).51 The activation function of a perceptron compares a weighted sum of the input values to a threshold value. If the weighted sum is larger than the threshold value, the perceptron produces “1” as the output value. If it is lower, the perceptron produces “0” as the output value. Mathematically, we write this activation function as:

 $a\left( {\bf{x}} \right) = \left\{ {\begin{array}{*{20}{c}} 1 & {{\rm{if\hspace{1em} }}{\bf{w}} \cdot {\bf{x}} > b}\\ 0 & {{\rm{if\hspace{1em} }}{\bf{w}} \cdot {\bf{x}} \le b} \end{array}} \right.$ [46]

where $b$ is the threshold value and ${\bf{w}}$ is the vector of weights (Figure 6a). The perceptron is a linear classifier that is similar to a support vector machine, where the plane that separates two classes is determined by ${\bf{w}}$. The weights in a perceptron are typically optimized using a gradient descent algorithm to minimize the squared loss.

A perceptron is an example of a feed-forward network, in which there are no loops. More complicated feed-forward neural networks can be created by combining multiple perceptron-like nodes in a multi-layer network (Figure 5b). Such networks are sometimes referred to as multi-layer perceptrons. The optimization of the weights of a multi-layer perceptron is more complicated than that of single node, but it can be accomplished efficiently using a backpropagation algorithm that effectively minimizes the squared loss using gradient descent.52 However the backpropagation algorithm requires that the activation function is differentiable, which is not the case when using the function in equation [46]. To avoid this problem, in multi-layer neural networks the discontinuous step function used in equation [46] is replaced by a continuous sigmoid function such as a logistic function (Figure 6b). The use of continuous activation functions results in neural networks that can output a continuous range of output values, making neural networks valuable tools for regression as well as classification.53, 54

Figure 6 a) The discontinuous activation function in equation [46] b) A sigmoid activation function.

Multilayer perceptrons with sigmoid activation functions parameterized using backpropagation have been widely and successfully used for classification and regression. However there are a variety of alternatives to this approach. One alternative is recurrent neural networks, in which loops are allowed in the network, enabling the network to model dynamic processes. Different activation functions and optimization algorithms have also been developed to improve the performance of neural networks. A more extensive discussion of the different types of neural networks can be found in references 55 and 56. Across all types of neural networks, regularization is typically done by penalizing the complexity of the network as measured by factors such as the number of nodes in the network and the norm of the network weights.

Neural networks have had a long history of success in materials science and engineering, especially in the development of accurate interatomic potentials, and in the mapping of complex materials behavior (flow stress, fatigue behavior, microstructure, etc.) to materials processing parameters (heat treatment, deformation, cold working, etc.).

Decision trees are among the oldest approaches to machine learning, particularly for classification problems. Historically they have been among the most widely studied machine learning methods, and more comprehensive reviews of decision trees can be found in references 57-60. Within materials science, decision trees have been recently used to predict tribological properties (specifically, the coefficient of friction) of various materials based on easily accessible properties (or descriptors) of the materials and their constituents (e.g., melting point, Madelung constant, density, etc.).61 They have also been used to classify zeolite structures based on topological descriptors.62 Here we provide a brief overview of the idea behind decision trees and some common implementations.

Decision trees are similar to neural networks in that the function is represented as a network of connected nodes. However in a decision tree, the network takes a hierarchical tree-like structure, in which each node may only have a single parent node (Figure 7). The evaluation of the function starts at the top-most parent node, known as the root node, and proceeds down through the tree until reaching a node with no children, known as the leaf node. At each node along the way there are multiple possible branches, each of which leads to a different child node. The choice of which branch to follow at each node is determined by value of one of the input variables. Thus the set of all input values determines the path through the tree, and the output value is determined by the leaf node that is reached at the end of the path. Decision trees are commonly used for classification, in which each leaf node corresponds to a different class. However they may also be used for regression, in which each leaf node corresponds to a different numerical value.

Figure 7 An example of a simple decision tree. The leaf nodes are in black.

There will generally be many different decision trees that are capable of reproducing the training data. The most efficient trees will be those that have, on average, the shortest path between the root and leaf nodes. Unfortunately the problem of finding such trees is NP-complete,63 meaning that it is unlikely that the globally optimal solution can be found using an algorithm that scales as a polynomial of the number of possible output values. Given the computational cost of finding a globally optimal solution, decision trees are commonly constructed using a greedy algorithm that finds a solution by making a locally optimal choice at each node. Some of the most successful algorithms are based on an approach known as Top-Down Induction of Decision Trees (TDIDT).64 In the TDIDT approach the tree is recursively built from the training data, starting at the root node. At each node an input variable is selected, and child nodes are created for each possible value of that variable. The training data are then divided into subsets based on the value of the input variable and passed to the appropriate child nodes. The process is then repeated at each child node, with each of the subsets divided and passed on to the next level of child nodes. If any of the subsets are empty, then a child node is not created, and if only one child node is created then the node becomes a leaf node. A common constraint is that there can be no path through the tree in which the same variable is evaluated more than once.

The TDIDT approach provides a general framework for creating decision trees, but it does not specify the order in which the input variables should be selected. A common approach for classification problems is to choose the input variable that produces subsets with the lowest average information entropy, where the information entropy for a subset is given by:

 $- N\sum\limits_i {{p_i}\ln {p_i}}$ [47]

In equation [47] the sum is over all classes in the subset, $N$ is the total number of elements in the subset, and ${p_i}$ is the fraction of elements in the subset that belong to the ${i^{th}}$ class. Information entropy is minimized for subsets that contain only one class and maximized for subsets that contain equal numbers of every class, and as a result this approach facilitates the rapid subdivision of the training data into subsets that are pure. A similar approach is to use the Gini Index in place of the information entropy,65 where the Gini Index is defined as:

 $N\left( {1 - \sum\limits_i {{p_i}^2} } \right)$ [48]

When using a decision tree for regression there is a continuous range of allowed output values, and it is common for every element in the training set to have a different output value. The objective of the decision tree is to provide an estimate of the output value that is close to the true output value. At each node in the tree, an estimate of the output value can be calculated based on the average of all output values in the leaf nodes below that node (Figure 8). The tree can then be constructed in a way that attempts to minimize the average estimation error along the path. One way to accomplish this within the TDIDT framework is to choose at each node the attribute that minimizes the average variance in output values within each of the subsets.65, 66

Figure 8 An illustration of how the output value could be estimated at each node in a regression decision tree trained on four output values.

There are two general strategies that can be used to improve the predictive accuracy of decision trees. The first is a regularization approach, in which the complexity of the tree is reduced by pruning, or removing, branches of the tree.65, 67, 68 The pruned branches are typically those that were poorly represented in the training data and / or are many layers from the root node. For regression, the pruned branches can be replaced by a function (e.g., using linear regression) that estimates the output values for the remaining set of input values. Resampling methods (described in the section on “Estimating the Prediction Error”) may be used to determine the optimal degree of pruning. An alternative approach to improving the prediction power is to generate ensembles of decision trees. For example, in the random forest approach, an ensemble of decision trees is created by introducing a stochastic element into the algorithm for constructing trees.69 The prediction is then based on the mode (for classification) or average (for regression) of predictions made by the members of the ensemble. Variations of this approach, known as ensemble learning, are described in the next section.

In ensemble learning, the function $\hat f$ is created by combining the outputs from an ensemble of different learning algorithms (e.g., by taking the average). This approach can be a simple and effective way to increase the size of the hypothesis space, and it often results in a function with greater predictive accuracy than any of the individual algorithms in the ensemble. It is particularly popular for classification problems.

Various approaches to ensemble learning are well-reviewed in references 70-72. Two of the most common classes of algorithms are bagging algorithms,73 in which the members of the ensemble are created by randomly resampling the training data, and boosting algorithms,74, 75 in which the training data is resampled in a way that assigns extra weight to input values for which the losses (described in the section on “The Empirical Risk”) are particularly high. The performance of some boosting algorithms can be adversely sensitive to random classification noise (i.e., erroneous classifications) in the training data.76

It is often desirable to express a function symbolically, as a simple combination of variables and basic mathematical operators. For example, the simple formula $F = ma$ is generally preferable to expressing force as a linear combination of many basis functions. The field of identifying such simple formulas that best predict the output values is known as symbolic regression. In symbolic regression the hypothesis space contains all symbolic formulas that combine the input variables with a set of mathematical operators, and the space is typically regularized in a way that favors simple formulas (e.g., formulas that contain fewer variables or operators). Functions are typically represented in a tree-like structure,77 where input variables and constant values are represented by the leaf nodes and the remaining nodes are mathematical operators or simple functions (e.g., trigonometric functions). An example of such a diagram is given in Figure 9.

Figure 9 A tree-like representation of the expression $1.8 + \frac{2}{x}$.

As with other supervised learning methods, symbolic regression is typically accomplished by searching through hypothesis space to find the formula that minimizes an objective function. A common way to do this is through genetic programming,78, 79 in which the search for the optimal function is performed using a genetic algorithm. In genetic programming a population of candidate formulas evolves in a way that mimics the process of natural selection, favoring the formulas that are most fit, i.e., formulas that produce the lowest values for the objective function. In a typical implementation, an initial population of candidate functions is created, and those that are least fit are discarded. The remaining functions are used to generate a new generation of functions by using crossover operations, in which features of the functions are combined to create “children”, and mutation operations, in which a feature of the function is randomly changed. Examples of crossover and mutation are shown in Figure 10. Repetition of this process results in a series of increasingly fit generations of functions.

Figure 10 a) An example of a crossover operation b) An example of a mutation operation.

The genetic programming algorithm will generate a set of candidate functions with varying degrees of fitness and complexity. The complexity of the function may be measured by factors such as the number of nodes in the tree-like representation of the function, and it is common for complexity to be measured in a way that penalizes advanced operators such as acos more than simple operators like addition. Although it is possible to use a complexity-dependent regularization term in the objective function,78, 80 an alternative approach is to generate a Pareto frontier of the functions with respect to fitness and complexity.81 The Pareto frontier is defined as the set of all candidate functions for which there is no other known function that is both more fit and less complex. The Pareto frontier approach enables a transparent evaluation of the fitness-complexity trade-off in the population of functions generated by the genetic programming algorithm (Figure 11).

Figure 11 The Pareto frontier for a set of functions with different levels of fitness and complexity. Each dot represents a function, and the red dots (lower left dots connected by dashed line) are the ones on the Pareto Frontier.

In materials science and engineering, genetic programming has been used by several researchers to develop predictive models of the properties of cement and concrete,82-84 asphalt,85 and the effects of processing parameters on metal alloys.86-88 It has also recently been applied at the atomic scale, to determine the structural features of hydrogenated amorphous silicon that most strongly influence hole trap depths.89 A thorough review of genetic programming can be found in reference 90.

There are many comprehensive resources available for readers interested in learning more about machine learning algorithms and analysis. Two in particular are:

1. T Hastie, J Friedman, and R Tibshirani, The Elements of Statistical Learning, Springer, 2009.
2. CM Bishop, Pattern Recognition and Machine Learning, Springer, 2007.

Open source machine learning packages exist for all the algorithms described here. The open-source Weka software library is a collection of machine learning algorithms implemented in Java. The Python package scikit-learn has a library of methods with appropriate documentation. The commercial product MATLAB also has a library of methods as part of the toolboxes such as Statistics, Optimization, Neural Networks, Signal Processing, Image Processing, and Computer Vision. These methods come with extensive documentation including examples. For methods specific to image and video processing, the package OpenCV is extensive but may require background knowledge in machine learning. The commercial software package Eureqa provides a straightforward interface for genetic programming.