In terms of the Bias/Variance decomposition, very flexible (i.e., complex) Supervised Machine Learning systems may lead to unbiased estimators but with high variance. A rigid model, in contrast, may lead to small variance but high bias. There is a trade-off between the bias and variance contributions to the error, where the optimal performance is achieved.
In this work we present three schemes related to the control of the Bias/Variance decomposition for Feed-forward Neural Networks (FNNs) with the (sometimes modified) quadratic loss function: 1. An algorithm for sequential approximation with FNNs, named Sequential Approximation with Optimal Coefficients and Interacting Frequencies (SAOCIF). Most of the sequential approximations proposed in the literature select the new frequencies (the non-linear weights) guided by the approximation of the residue of the partial approximation. We propose a sequential algorithm where the new frequency is selected taking into account its interactions with the previously selected ones. The interactions are discovered by means of their optimal coefficients (the linear weights). A number of heuristics can be used to select the new frequencies. The aim is that the same level of approximation may be achieved with less hidden units than if we only try to match the residue as best as possible. In terms of the Bias/Variance decomposition, it will be possible to obtain simpler models with the same bias. The idea behind SAOCIF can be extended to approximation in Hilbert spaces, maintaining orthogonal-like properties. In this case, the importance of the interacting frequencies lies in the expectation of increasing the rate of approximation. Experimental results show that the idea of interacting frequencies allows to construct better approximations than matching the residue.
2. A study and comparison of different criteria to perform Feature Selection (FS) with Multi-Layer Perceptrons (MLPs) and the Sequential Backward Selection (SBS) procedure within the wrapper approach. FS procedures control the Bias/Variance decomposition by means of the input dimension, establishing a clear connection with the curse of dimensionality. Several critical decision points are studied and compared. First, the stopping criterion. Second, the data set where the value of the loss function is measured. Finally, we also compare two ways of computing the saliency (i.e., the relative importance) of a feature: either first train a network and then remove temporarily every feature or train a different network with every feature temporarily removed. The experiments are performed for linear and non-linear models. Experimental results suggest that the increase in the computational cost associated with retraining a different network with every feature temporarily removed previous to computing the saliency can be rewarded with a significant performance improvement, specially if non-linear models are used. Although this idea could be thought as very intuitive, it has been hardly used in practice. Regarding the data set where the value of the loss function is measured, it seems clear that the SBS procedure for MLPs takes profit from measuring the loss function in a validation set. A somewhat non-intuitive conclusion is drawn looking at the stopping criterion, where it can be seen that forcing overtraining may be as useful as early stopping.
3. A modification of the quadratic loss function for classification problems, inspired in Support Vector Machines (SVMs) and the AdaBoost algorithm, named Weighted Quadratic Loss (WQL) function. The modification consists in weighting the contribution of every example to the total error. In the linearly separable case, the solution of the hard margin SVM also minimizes the proposed loss function. The hardness of the resulting solution can be controlled, as in SVMs, so that this scheme may also be used for the non-linearly separable case. The error weighting proposed in WQL forces the training procedure to pay more attention to the points with a smaller margin. Therefore, variance tries to be controlled by not attempting to overfit the points that are already well classified. The model shares several properties with the SVMs framework, with some additional advantages. On the one hand, the final solution is neither restricted to have an architecture with so many hidden units as points (or support vectors) in the data set nor to use kernel functions. The frequencies are not restricted to be a subset of the data set. On the other hand, it allows to deal with multiclass and multilabel problems in a natural way. Experimental results are shown confirming these claims.
A wide experimental work has been done with the proposed schemes, including artificial data sets, well-known benchmark data sets and two real-world problems from the Natural Language Processing domain. In addition to widely used activation functions, such as the hyperbolic tangent or the Gaussian function, other activation functions have been tested. In particular, sinusoidal MLPs showed a very good behavior. The experimental results can be considered as very satisfactory. The schemes presented in this work have been found to be very competitive when compared to other existing schemes described in the literature. In addition, they can be combined among them, since they deal with complementary aspects of the whole learning process.