olhon.info Religion Predictive Modeling Using Logistic Regression Course Notes Pdf

PREDICTIVE MODELING USING LOGISTIC REGRESSION COURSE NOTES PDF

Tuesday, May 28, 2019


regression course notes pdf. Here are five things that you will always find in her bag What is the purpose of that It makes - Predictive Modeling Using Logistic. Predictive modeling using logistic regression course notes pdf. Here are five things that you will always find in her bag What is the purpose of that It makes. Download as PDF, TXT or read online from Scribd. Flag for Predictive Modeling Using Logistic Regression Course Notes was developed by William J. E.


Predictive Modeling Using Logistic Regression Course Notes Pdf

Author:TROY ATTRIDGE
Language:English, Spanish, German
Country:Czech Republic
Genre:Technology
Pages:707
Published (Last):09.02.2016
ISBN:710-5-55098-667-2
ePub File Size:26.53 MB
PDF File Size:9.28 MB
Distribution:Free* [*Regsitration Required]
Downloads:47873
Uploaded by: DARCY

logistic regression - sas - this course covers predictive model-ing using modeling using logistic regression course notes pdf. here are five things that you will. Predictive Modeling Using Logistic Regression Course. Notes analysis of customer churn prediction in logistic industry - analysis of customer churn prediction. predictive modeling using logistic regression - for your information v course description jacob zahavi the alberto vitale visiting course notes pdf using logistic.

This course covers advanced topics using SAS Enterprise Miner including The course continues the development of predictive models that begins in have completed a statistics course that covers linear regression and logistic regression.

Predictive Modeling. Logistic Regression. Applications of predictive modelling are unlimited; some examples of areas of. SAS Institute Inc. Predictive modelling using logistic regression SAS Institute. Predictive models; data mining; supervised learning; propensity to buy; logistic training validation and test data for better modelling and accuracy Christie et al. As the cutoff decreases, more and more cases are allocated to class 1, hence the sensitivity increases and specificity decreases.

As the cutoff increases, more and more cases are allocated to class 0, hence the sensitivity decreases and specificity increases.

Consequently, the ROC curve intersects 0,0 and 1,1. If the posterior probabilities were arbitrarily assigned to the cases, then the ratio of false positives to true positives would be the same as the ratio of the total actual negatives to the total actual positives.

Consequently, the baseline random model is a 45 angle going through the origin. As the ROC curve bows above the diagonal, the predictive power increases. A perfect model would reach the 0,1 point where both sensitivity and specificity equal 1. The cumulative gains chart displays the positive predicted value and depth for a range of cutoff values.

As the cutoff increases the depth decreases. If the posterior probabilities were arbitrarily assigned to the cases, then the gains chart would be a horizontal line at 1. The gains chart is widely used in database marketing to decide how deep in a database to go with a promotion. The simplest way to construct this curve is to sort and bin the predicted posterior probabilities for example, deciles. The gains chart is easily augmented with revenue and cost information.

A plot of sensitivity versus depth is sometimes called a Lorentz curve, concentration curve, or a lift curve although lift value is not explicitly displayed. This plot and the ROC curve are very similar because depth and 1specificity are monotonically related. If the proper adjustments were made when the model was fitted, then the predicted posterior probabilities are correct.

However, the confusion matrices would be incorrect with regard to the population because the event cases are over-represented. Sensitivity and specificity, however, are not affected by separate sampling because they do not depend on the proportion of each class in the sample.

For example, if the sample represented the population, then n1 cases are in class 1.

The proportion of those that were allocated to class 1 is Se. Thus, there are n1Se true positives. Convergence was not attained in 0 iterations for the intercept-only model. Results of fitting the intercept-only model are based on the last maximum likelihood iteration. Validity of the model fit is questionable. The validity of the model fit is questionable. Knowledge of the population priors and sensitivity and specificity is sufficient to fill in the confusion matrices.

Several additional statistics can be calculated in a DATA step: The selected cutoffs occur where values of the estimated posterior probability change, provided the posterior probabilities are more than. ROC data set and the various plots can be specified using the point-and-click interface.

To determine the optimal cutoff, a performance criterion needs to be defined. If the goal were to increase the sensitivity of the classifier, then the optimal classifier would allocate all cases to class 1. If the goal were to increase specificity, then the optimal classifier would be to allocate all cases to class 0.

For realistic data, there is a tradeoff between sensitivity and specificity. Higher cutoffs decrease sensitivity and increase specificity. Lower cutoffs decrease specificity and increase sensitivity. The decision-theoretic approach starts by assigning misclassification costs losses to each type of error false positives and false negatives.

The optimal decision rule minimizes the total expected cost risk.

The Bayes rule is the decision rule that minimizes the expected cost. In the two-class situation, the Bayes rule can be determined analytically. If you classify a case into class 1, then the cost is FP cost 1 p where p is the true posterior probability that a case belongs to class 1.

Solving for p gives the optimal cutoff probability. Since p must be estimated from the data, the plug-in Bayes rule is used in practice. Note that the Bayes rule only depends on the ratio of the costs, not on their actual values. If the misclassification costs are equal, then the Bayes rule corresponds to a cutoff of 0. Hand commented that The use of error rate often suggests insufficiently careful thought about the real objectives.

When the target event is rare, the cost of a false negative is usually greater than the cost of a false positive. The cost of not soliciting a responder is greater than the cost of sending a promotion to someone who does not respond. The cost of accepting an applicant who will default is greater than the cost of rejecting someone who would pay-off the loan.

The cost of approving a fraudulent transaction is greater than the cost of denying a legitimate one. Such considerations dictate cutoffs that are less often much less than. Examining the performance of a classifier over a range of cost ratios can be useful.

The central cutoff, 1, tends to maximizes the mean of sensitivity and specificity. Because increasing sensitivity usually corresponds to decreasing specificity, the central cutoff tends to equalize sensitivity and specificity. Statistics that summarize the performance of a classifier across a range of cutoffs can also be useful for assessing global discriminatory power. One approach is to measure the separation between the predicted posterior probabilities for each class.

The more the distributions overlap, the weaker the model. The simplest statistics are based on the difference between the means of the two distributions. In credit scoring, the divergence statistic is a scaled difference between the means Nelson Hand discusses several summary measures based on the difference between the means. The well-known t-test for comparing two distributions is based on the difference between the means. The t-test has many optimal properties when the two distributions are symmetric with equal variance and have light tails.

However, the distributions of the predicted posterior probabilities are typically asymmetric with unequal variance. Many other two-sample tests have been devised for non-normal distributions Conover The test statistic, D, is the maximum vertical difference between the cumulative distributions. If D equals zero, the distributions are everywhere identical.

The maximum value of the K-S statistic, 1, occurs when the distributions are perfectly separated. Use of the K-S statistic for comparing predictive models is popular in database marketing.

An oversampled validation data set does not affect D because the empirical distribution function is unchanged if each case represents more than one case in the population.

In the predictive modeling context, it could be argued that location differences are paramount. Because of its generality, the K-S test is not particularly powerful at detecting location differences. The most powerful nonparametric two-sample test is the Wilcoxon-Mann-Whitney test. The Wilcoxon version of this popular two-sample test is based on the ranks of the data. In the predictive modeling context, the predicted posterior probabilities would be ranked from smallest to largest. The test statistic is based on the sum of the ranks in the classes.

A perfect ROC curve would be a horizontal line at one that is, sensitivity and specificity would both equal one for all cutoffs. In this case, the c statistic would equal one. The c statistic technically ranges from zero to one, but in practice, it should not get much lower than. A perfectly random model, where the posterior probabilities were assigned arbitrarily, would give a 45 angle straight ROC curve that intersects the origin; hence, it would give a c statistic of 0.

Representation Used for Logistic Regression

Oversampling does not affect the area under the ROC curve because sensitivity and specificity are unaffected. The area under the ROC curve is also equivalent to the Gini coefficient, which is used to summarize the performance of a Lorentz curve Hand The results of the Wilcoxon test can be used to compute the c statistic.

To correct for the optimistic bias, a common strategy is to holdout a portion of the development data for assessment. Statistics that measure the predictive accuracy of the model include sensitivity and positive predicted value. Graphics such as the ROC curve, the gains chart, and the lift chart can also be used to assess the performance of the model.

If the assessment data set was obtained by splitting oversampled data, then the assessment data set needs to be adjusted. This can be accomplished by using the sensitivity, specificity, and the prior probabilities. In predictive modeling, the ultimate use of logistic regression is to allocate cases to classes. To determine the optimal cutoff probability, the plug-in Bayes rule can be used. The information you need is the ratio of the costs of false negatives to the cost of false positives.

This optimal cutoff will minimize the total expected cost. For example, the cost of not soliciting a responder is greater than the cost of sending a promotion to someone who does not respond.

Such considerations dictate cutoffs that are usually much less than. A popular statistic that summarizes the performance of a model across a range of cutoffs is the Kolmogorov-Smirnov statistic. However, this statistic is not as powerful in detecting location differences as the Wilcoxon-Mann- Whitney test.

Thus, the c statistic should be used to assess the performance of a model across a range of cutoffs. Decision trees Breiman et al. Trees recursively partition the input space in order to isolate regions where the class composition is homogenous.

Trees are flexible, interpretable, and scalable. When the target is binary, these plots are not very enlightening. A useful plot to detect nonlinear relationships is a plot of the empirical logits. A simple, scalable, and robust smoothing method is to plot empirical logits for quantiles of the input variables.

These logits use a minimax estimate of the proportion of events in each bin Duffy and Santner This eliminates the problem caused by zero counts and reduces variability. The number of bins determines the amount of smoothing the fewer bins, the more smoothing. One large bin would give a constant logit. For very large data sets and intervally-scaled inputs, bins often works well. If the standard logistic model were true, then the plots should be linear.

Sample variability can cause apparent deviations, particularly when the bin size is too small. However, serious nonlinearities, such as nonmonotonicity, are usually easy to detect. The bins will be equal size quantiles except when the number of tied values exceeds the bin size, in which case the bin will be enlarged to contain all the tied values. The empirical logits are plotted against the mean of the input variable in each bin.

This needs to be computed as well. BINS data set. Hand-Crafted New Input Variables 2. Polynomial Models 3. Flexible Multivariate Function Estimators 4. Do Nothing 1. Skilled and patient data analysts can accommodate nonlinearities in a logistic regression model by transforming or discretizing the input variables. This can become impractical with high-dimensional data and increases the risk of overfitting. Section 5. Methods such as classification trees, generalized additive models, projection pursuit, multivariate adaptive regression splines, radial basis function networks, and multilayer perceptrons Section 5.

Standard linear logistic regression can produce powerful and useful classifiers even when the estimates of the posterior probabilities are poor. Often more flexible approaches do not show enough improvement to warrant the effort. A linear model first-degree polynomial is planar.

A quadratic model second-degree polynomial is a paraboloid or a saddle surface. Squared terms allow the effect of the inputs to be nonmonotonic. Moreover, the effects of one variable can change linearly across the levels of the other variables. Cubic models third-degree polynomials allow for a minimum and a maximum in each dimension. Moreover, the effect of one variable can change quadraticly across the levels of the other variables.

Higher degree polynomials allow even more flexibility. A polynomial model can be specified by including new terms to the model.

The new terms are products of the original variables.

Curso Web y Data Mining 3 Predictive Modeling Using Logistic Regresion

A full polynomial model of degree d includes terms for all possible products involving up to d terms. The squared terms allow the effect of the inputs to be nonmonotonic. The cross product terms allow for interactions among the inputs. Cubic terms allow for even more flexibility. Chief among them is the curse of dimensionality. The number of terms in a d- degree polynomial increases at a rate proportional to the number of dimensions raised to the d power for example, the number of terms in a full quadratic model increases at a quadratic rate with dimension.

For example, the influence that individual cases have on the predicted value at a particular point the equivalent kernel does not necessarily increase the closer they are to that particular point. Higher-degree polynomials are not reliable smoothers. One suggested approach is to include all the significant input variables and use a forward selection to detect any two-factor interactions and squared terms.

Curso Web y Data Mining 3 Predictive Modeling Using Logistic Regresion

This approach will miss the interactions and squared terms with non-significant input variables, but the number of possible two-factor interactions and squared terms is too large for the LOGISTIC procedure to assess. The c statistic increased from. However, this model should be assessed using the validation data set because the inclusion of many higher order terms may increase the risk of overfitting. The input layer contains input units, one for each input dimension.

In a feed- forward neural network, the input layer can be connected to a hidden layer containing hidden units neurons. The hidden layer may be connected to other hidden layers, which are eventually connected to the output layer.

The hidden units transform the incoming inputs with an activation function. The output layer represents the expected target. The output activation function transforms its inputs so that the predictions are on an appropriate scale.

A neural network with a skip layer has the input layer connected directly to the output layer. Neural networks are universal approximators; that is, they can theoretically fit any nonlinear function not necessarily in practice. The price they pay for flexibility is incomprehensibility they are a black box with regard to interpretation. A network diagram is a graphical representation of an underlying mathematical model. Many popular neural networks can be viewed as extensions of ordinary logistic regression models.

A MLP is a feed-forward network where each hidden unit nonlinearly transforms a linear combination of the incoming inputs. The nonlinear transformation activation function is usually sigmoidal, such as the hyperbolic tangent function. When the target is binary, the appropriate output activation function is the logistic function the inverse logit. A skip layer represents a linear combination of the inputs that are untransformed by hidden units. Consequently, a MLP with a skip layer for a binary target is a standard logistic regression model that includes new variables representing the hidden units.

These new variables model the higher-order effects not accounted for by the linear model. A MLP with no hidden layers is just the standard logistic regression model a brain with no neurons. The parameters can be estimated in all the usual ways, including Bernoulli ML. However, the numerical aspects can be considerably more challenging. The sigmoidal surfaces are themselves combined with a planar surface skip layer. This combination is transformed by the logistic function to give the estimate of the posterior probability.

The first step is to drastically reduce the dimension using a decision tree. A decision tree is a flexible multivariate model that effectively defies the curse of dimensionality. Retaining only the variables selected by a decision tree is one way to reduce the dimension without ignoring nonlinearities and interactions. Set the model role of INS to target. Right click in the open area where the prior profiles are activated and select Add.

Select Prior vector and change the prior probabilities to represent the true proportions in the population 0. Right click on Prior vector and select Set to use. Close the target profiler and select Yes to save the changes.

The Data Replacement node does imputation for the missing values. Select the Imputation Methods tab and choose median as the method. A variable selection tree can be fit in the Variable Selection node using the chi-squared method under the Target Associations tab. Run the flow and note that 10 inputs were accepted. In the Neural Network node, select the Basic tab. For the Network architecture, select Multilayer Perceptron. Set the number of hidden neurons to 3 and Yes to Direct connections.

Also select 3 for Preliminary runs Default as the Training technique 2 hours for the Runtime limit. Run the flow from the Assessment node. After nonlinear relationships are identified, you can transform the input variables, create dummy variables, or fit a polynomial model.

The problem with hand-crafted input variables is the time and effort it takes. The problem with higher-degree polynomial models is their non-local behavior. Another solution to dealing with nonlinearities and interactions is to fit a neural network in Enterprise Miner.

Neural networks are a class of flexible nonlinear regression models that can theoretically fit any nonlinear function. However, the price neural networks pay for flexibility is incomprehensibility. Neural networks are useful, though, if the only goal is prediction, and understanding the model is of secondary importance.

Score new data using the final parameter estimates from a weighted logistic regression. Create a weight variable that adjusts for oversampling. The proportion of events in the population is. Calculate the probabilities and print out the first 25 observations.

Name the variable with the logit values INS1. Print out the first 25 observations. Compare this scoring method with the method in step d. Replace missing values using group-median imputation. The VAR statement lists the variables to be grouped. The objective is to cluster the variables and choose a representative one in each cluster. Alternatively, use the results of 2.

Use the Output Delivery System to print out the last iteration of the clustering algorithm. Use the TREE procedure to produce a dendrogram of the variable clustering. Which criteria created the fewest clusters? Which criteria created the most clusters? Submit the following code: Compare variable selection methods.

Use all of the numeric variables and the RES categorical variable. Use all of the numeric variables and the dummy variables for RES. Use the output delivery system to determine which model is the best according to the SBC criterion? Fit a stepwise linear regression model using the REG procedure: Submit the program below to compute the c statistic. Assess a weighted logistic regression using a validation data set.

Use a seed of To create a more meaningful graph, select an appropriate range for the cost ratio values. What is the cost ratio that optimizes both sensitivity and specificity?

Comparing the performance of different models on the validation data set. A-6 Appendix A Additional Resources d. What is the value of the c statistic? What is the value of the c statistic on the validation data set? Which model had the highest c statistic on the validation data set? Cohen, A. Conover, W. Donner, A. Duffy, T. Greenacre, M. Academic Press. Hand, D. Harrell, F. School of Medicine, University of Virginia.

Hastie, T. Chapman and Hall. Huber, P. AAAI Press. Jackson, J. Jones, M. Lawless, J. Mantel, N. Magee, L. McLachlan, G. Nelson, R. Prentice, R. Ripley, B. Sarle, W. SAS Institute Inc. Changes and Enhancements through Release 6. Scott, A. Flag for inappropriate content.

Related titles. Jump to Page. Search inside document. Ajayy Prakash. Shiva Kumar. Gagandeep Singh. Mahesh Kumar Joshi. Ricky Martin. Arvind Shukla. Gangadhar Kadam. Aditya Hriday Misra. More From Sunil. Atul Sharma. Rahul Shinde. Technical specifications The Enterprise Miner SVM support vector machine node was used with default settings, with one exception: the estimation method was set to least squares support vector machine as opposed to decomposed quadratic programming, as this setting failed to find a conclusive result on one of the six data sets the claim prediction data set.

Memory-based reasoning uses k-nearest-neighbour principles to classify observations in a data set. When a new observation is evaluated, the algorithm allows the k-nearest observations of the development set to 'vote' regarding that observation's classification their votes are based on the values of their target variables. These votes then represent the probabilities of the new observation belonging to that specific target value.

Technical specifications The memory-based reasoning node was used with the default settings provided by the SAS Enterprise Miner software. Decision trees are simple classifiers that produce prediction rules that are easy to interpret and apply and are commonly referred to as CART classification and regression trees.

For this reason they are also quite popular in the industry. Two changes were made to the default settings. The splitting criterion for nominal input variables was changed from chi-squared probability to Gini, as the Gini is the measure we used for model performance in this paper.

In addition, the number of branches or subsets that a splitting rule can produce was increased to six, which allows results that are more granular. Gradient boosting draws its concept from the greedy decision tree approach proposed by Friedman The algorithm creates a number of small decision trees on the development set, and these trees are combined to produce the model's output.

The technique can be linked to the techniques used in random forests21 in that a number of different trees are developed7. Technical specifications The gradient boosting node was used with its default setting in Enterprise Miner. Model performance In order to compare the model performance, each data set was first divided into two equal sets to form a development and a validation set. The development set, sometimes referred to as the training data, is used to develop the predictive models, whilst the validation set, alternatively known as the holdout data, is used to test the lift in model performance as measured by the Gini coefficient hereafter lift.

The Gini coefficient is one of the most popular measures to use in retail credit scoring and has the added advantage that it is a single number3. The development set and validation set were randomly sampled with even sizes i.

In order to measure the combined Gini of the segmented models on the validation set, the predicted probabilities of all segments were combined, and the Gini was calculated on the overall, combined set. In summary, the eight modelling techniques are shown in Table 1. Data sets The modelling techniques described above were compared on six different data sets. All explanatory variables were standardised i. Standardising data is a data pre-processing step applied to variables with the aim of scaling variables to a similar range.

The first data set 'direct marketing' analysed was obtained from one of South Africa's largest banks. The data set contains information about the bank's customers, the products they have with the bank, and their utilisation of and behaviour regarding those products.

The target variable was binary: whether or not the customer responded to a direct marketing campaign for a personal loan. This data set contains 24 explanatory variables and observations. The second data set 'protein structure' was obtained from the UCI Machine Learning Repository22 and contains results of experiments performed by the Protein Structure Prediction Centre23 on the latest protein structure prediction algorithms. These experiments were labelled the 'Critical Assessment of Protein Structure Prediction' experiments.

Structures are further influenced by a number of physico-chemical properties which further complicates the task of accurate prediction. One way of measuring such deviations is through the root-mean-square-deviation. The protein structure data set contains various physico-chemical properties of proteins, and the target variable is based on the root-mean-square-deviation measurement, indicating how much the predicted protein structures deviate from experimentally determined structures.

The binary target used was whether or not the root-mean-square-deviation had exceeded a certain value 7. Our goal was therefore to determine what physico-chemical properties cause protein structure prediction algorithms to deviate more than the norm from experimentally determined protein structures.

This data set contains nine explanatory variables and 45 observations. The third data set 'credit application' was obtained from the Kaggle website www. The data set contains 10 characteristics of customers who applied for credit, and the target variable is binary, indicating whether or not the customer experienced a day or longer delinquency. The data set is used in a number of studies covering various areas of predictive modelling.

The data set was collected between May and February The repository consisted of two data sets - one for white wines and one for red wines. For the purposes of this exercise, the two data sets were combined.

The data set has 11 explanatory variables and observations. The fifth data set 'chess king-rook vs king' is based on game theory and was obtained from the UCI Machine Learning Repository. In this endgame, first described by Clarke42, the white player has both its king and its rook left, whilst the black player only has its king left - it is widely known as the 'KRK endgame' and is still the focus of many studies The database stores the positions of each piece as well as the number of moves taken to finish the game from those positions assuming minimax-optimal play black to move first.

The target variable is binary, and indicates whether the game will be completed within 12 moves or less.

Logistic Regression for Machine Learning

Minimax-optimal play is an algorithm often used by computers to obtain the best combination of moves in a chess game and is based on the minimax game theory introduced by Neumann More information on this can be found in a number of texts, for example see Casti and Casti47 and Russell and Norvig To the 6 explanatory variables another 12 derived variables were added row distances, column distances, total distances and diagonal indicators.

This data set contains 28 observations.

The sixth data set 'insurance claim' , also obtained from the Kaggle website36, contains information about bodily injury liability insurance. The competition was named 'Claim Prediction Challenge Allstate ' and concluded in The binary target was whether or not a claim payment was made.

The independent variables have been hidden, but according to the website, the data set contains information about the vehicle to which the insurance applies as well as some particulars about the policy itself. The data set itself has many observations 7. Oversampling in cases in which events are rare is a common technique applied in the industry.

We compared the model performance achieved on linear modelling techniques when first segmenting the data to the accuracy of popular non-linear modelling techniques. Table 2 summarises the performance of the modelling techniques when applied to the 'direct marketing' data set as measured by the Gini coefficient calculated on the validation set.

The gradient boosting technique achieved the best result on this data set, with decision tree segmentation running a close second. Neural networks could not converge to a model without overfitting, and the resulting Gini on the validation set is therefore effectively equal to zero. What can be seen additionally from Table 2 is that segmentation-based techniques take in positions two through to four as ranked by the Gini coefficient on the validation set.

Table 3 summarises the Gini results of the various techniques as applied to the data set on 'protein tertiary structures'. As evidenced by the table, the ranking order of the techniques is completely different from the order seen in Table 2. As a start, gradient boosting ranks third from the bottom, at number six. The technique that achieves the best results in this case is memory-based reasoning.

In Table 2 , memory-based reasoning was ranked at position seven. Table 4 shows that, for the 'credit application' data set, neural networks outperform all other techniques. In Tables 2 and 3 , neural networks ranked last each time.

However, in this case the structure of the data set evidently suited the technique well. Similar to what was seen in Table 2 , segmentation-based techniques take up positions two to four for this data set, with supervised segmentation decision trees performing best.

At this point, a trend is emerging that segmentation-based techniques may not always render the best results, but seem to deliver results that are consistently amongst the top. Table 5 shows that for the 'wine quality' data set, segmentation-based techniques occupy the top two positions, with supervised segmentation decision trees in position four. The results are generally very close, with only decision trees and support vector machines not doing particularly well.

Table 6 shows that decision trees are best suited for the non-linear nature of the chess king-rook vs. This data set is the first for which segmentation-based techniques fail to be among the top two techniques, with supervised segmentation decision trees in third place.

Table 7 shows the results of the last data set to be analysed - the 'insurance claim prediction' data set. It can be seen from the table that the first two positions are again held by segmentation-based techniques, with SSSKMIV achieving the best results. The best non-segmentation-based technique is gradient boosting in position three followed by unsupervised k-means segmentation.

The Gini coefficients for this application are low, so the relative difference between the Conclusions Although it was not the focus of this paper to do an exhaustive comparison of modelling techniques, we provide an overview of how some of the more popular non-linear techniques perform when compared to segmented linear regression.

Perhaps because of the diverse nature of the data sets used in this paper, it was interesting to see that no single technique dominated the top position.Through conceptual and hands-on exercises, participants experience load forecasting for a variety of horizons from a few hours ahead to 30 years ahead.

Neural networks are a class of flexible nonlinear regression models that can theoretically fit any nonlinear function. Two of the inputs are nominally scaled; the others are interval or binary. Thus, four variables were eliminated from the analysis. Table 5 shows that for the 'wine quality' data set, segmentation-based techniques occupy the top two positions, with supervised segmentation decision trees in position four.

D2N: Distance to the native. The Gini coefficients on the validation set of eight modelling techniques were compared.