A common scenario in creating Mining Model is to predict their accuracy by running them against a test data set which is different from the training data. This prevents the problem of over fitting and gives a better measure of the accuracy of the generated models.
There are many different ways by which a single dataset can be split into a test and a training set. An important requirement for any split method is to split randomly i.e. no split should have a stronger pattern for any attribute that what was in the original dataset. Furthermore, the split has to be repeatable across queries, which means that we cannot split using a true random number generator. In this tip, we use a feature available in SQL Server 2005 to split a data table/view from a relational data source into two sets randomly. We’ll also include a test in the end to verify the randomness of the two sets.
We’ll illustrate doing this using the AdventureWorksDW database from the SQL Server 2005 samples. Some step details have been condensed for brevity.
-
Open SQL Server Business Intelligence development studio and create a new Analysis Server project.
-
Right click the “Data Source” and create a data source connection to the AdventureWorksDW database.
-
Right click on data source views and create a new data source view. Select the view “vTargetMail” from the data source.
-
Convert the “vTargetMail” into a training set:
-
Right click on the table and select “Replace table -> With new named query”.
-
In the query builder, add a column CHECKSUM (*) AS CHECKSUM to the SELECT list and a where clause (CHECKSUM (*) % 5 <> 0). This selects approximately 80% of the data from the table and since CHECKSUM is used in hashing, it’s random with respect to. the data columns. If you have any non base type in the view, modify the SQL statement to CHECKSUM (base1, base2, baseN), including only the base data type columns. You can modify the number “5” to change the split ratio.
-
Save the Named Query as “vwTargetMailTraining”.
-
Right click and select “Add New Table/View” and select “vwTargetMail” again. This time we’ll convert it into a test data set using the steps below:
-
Right click on the table and select “Replace table -> With new named query”.
-
In the query builder, add a column "CHECKSUM (*) AS CHECKSUM" to the SELECT list and a WHERE clause "(CHECKSUM (*) % 5 = 0)".
-
Save the Named Query as “vwTargetMailTest”.
-
We’ll also create a Named query that will help us measure the randomness between the training and test set using the steps below:
-
Add another instance of the view “vwTargetMail” into the Data Source View.
-
Right click on the table and select “Replace table -> With new named query”.
-
In the query builder, add the following columns to the data view:
i. CHECKSUM(*) as Checksum
ii. CASE WHEN (CHECKSUM(*) % 5 = 0) THEN 'Test' ELSE 'Training' END AS DataUsage
-
Save the Named Query as “vwTargetMailRandomnessTest”
-
Build the model and select the “vwTargetMailTraining” as the case table.
-
In the Mining Accuracy Chart tab, select the “vwTargetMailTest” table as the case table and generate the lift chart and the classification matrix. The accuracy numbers obtained are a good measure of how the model will perform on unseen data.
-
In order to test randomness of the two sets, follow the steps below. For more details, see the tip on verifying the randomness between training and testing sets.
-
Create a new mining structure with the following parameters:
i. From existing relational database or data warehouse
ii. “Microsoft Decision Trees” as the data mining technique
iii. Select the “vwTargetMailRandomnessTest” table and select all columns used in the model above, and also the column DataUsage. Make all columns an Input column and the DataUsage column a PredictOnly column.
iv. Set Algorithm Parameters MINIMUM_SUPPORT=0.001 and COMPLEXITY_PENALTY=0.001 to increase the number of splits.
v. Generate the decision tree model. If the split was close to random, the number of nodes generated in the tree would be low since the input columns have a very low dependency on the output column (DataUsage). For a true random split, the information gain for any split condition would be zero and a single root decision tree will be learned.
vi. In the viewer, select Background=Test and Background=Training alternately. The shading on all the nodes should be of the same color for a good random split which indicates that the percentage of Test and Training cases are similar on all nodes.
The following modifications can be used to change the split:
-
Including or excluding columns in the CHECKSUM () computation.
-
Changing the "Mod 5" to a different number.
-
Using a User-Defined Function (UDF) instead of MOD (%) to filter.