When creating a model with the Data Mining Wizard, you have a choice for each column to mark it as a key, input, or predictable column. In the Data Mining Designer, on the Mining Models pane, a column can be marked as Ignore, Input, Predict, or Predict Only. So what do these denominations mean and how can you use them in your modeling?
Keys
In general the “key” of a table in a data mining model uniquely identifies the row. For would case table, this is pretty self-explanatory – it’s simply the key of the table. However, for nested tables, what column represents the key can be confusing. Take for example you have a table you wish to model as a nested table that contains the following columns:
|
OrderID |
ProductName |
Promotion |
SalesPrice |
Quantity |
What would be the key of your nested table? Many people choose OrderID as it is the foreign key to a table of orders. This is incorrect. The foreign key to the case table is never the nested table key. The nested table key is the column that represents what you are modeling at the nested level on a case by case basis – the column usually isn’t a key at all in the relational table. In some cases that column may be part of a relational composite key. In this example, you have choices? Are you modeling sales and quantities of products? Or do you want to learn which promotions are being combined in a single order? In the first case, you would make the ProductName column the nested key, in the latter, the Promotion column.
So, while we’re on the topic, what happens when you make a nested table? A nested table column creates a collection of related attributes that are named by the nested key. For example, if you used the table above as a nested table column called “Products”, you may have attributes called “Products(Milk).Promotion,” “Products(Milk).SalesPrice”, etc.
Non-Key columns
All the other columns in the model are used for the algorithms to perform their work. In general a column that is “Input” is used by the algorithm to train towards its target and a column that is marked “Predict” or “Predict Only” is an output that is selectable from the model in a PREDICTION JOIN statement. If a column is marked “Predict” than the column is treated both as an input and an output, whereas “Predict Only” is, as it says, only an output.
For all algorithms, it is true that only predictable columns can be selected from the model on a prediction join statement. Also, for each algorithm, the value of the predictable attribute, when present, is not used to predict itself. I.e. if your model predicts “Gender” and you tell it in an input case that the “Gender” is “Female”, the algorithm ignores that information in its prediction. However, beyond those similarities, the semantics of how each algorithm treats the usage flags is different. Let’s run through the algorithms and see how they are applied.
Decision Trees
Decision Trees uses each input attribute as a split candidate or regressor to predict each output. The algorithm creates a tree for each predictable, or output, attribute. If you mark a column as “Predict Only” using the Decision Trees algorithm, a tree will be created for it, but it will not be considered when creating trees for other columns.
Clustering and Sequence Clustering
The clustering algorithm uses each input attribute to create the clusters inside of the model. Predict Only columns are not used in creating the clusters or determining cluster membership during PREDICTION JOIN statements. Setting a column as Predict Only causes the clustering algorithm to perform an additional data pass after it has completed the clustering operation to assign the distributions of the Predict Only attributes to the clusters. This allows you to see how attributes may be distributed across clusters created from different variables.
Association Rules
Association Rules uses only input attributes on the left-hand side of a rule. Output attributes only show up on the right-hand side of a rule. Therefore marking a column as “Predict” will allow it to show up on either side of a rule, whereas “Input” and “Predict Only” columns are limited to a single side.
Time Series
To predict a time series you mark the numerical column containing the series values as predictable. If you mark a column as “Predict”, that column will be used to predict itself and will be considered in predicting other series in the model. Marking a column as “Predict Only” implies that you do not want that column to be considered for cross-series prediction.
Neural Nets
Neural Nets creates a network for each set of independent output variables. That is, if all predictable columns are “Predict Only” the neural network algorithm can represent them all with a single net. If there are more than one attribute marked “Predict” then the algorithm has to split the networks so that each predictable attribute does not have itself as an input – in effect, creating an independent network for each such attribute.
Naïve Bayes
Naïve Bayes treats Predict, Input and Predict Only much in the same manner as Decision Trees. Predict and Input columns are used as inputs to determine the values of Predict and Predict Only columns.
Using Predict and Predict Only in Modeling
So now that you know how these usages are applied, how would you apply them? That, of course, depends on your problem. If you want to see how home ownership is related to your other variables when treated independently, you could create a clustering model and mark that column as Predict Only. If you want to see how product sales in March lead to those in April, you could create two nested tables filtered by month – the March table would be input, the April table Predict Only.