Typically, training is done by visually describing column
bindings in BI Developer Studio. Filtering can be performed by using a named query
which uses relational (SQL) WHERE clauses to filter the cases.
In DMX, this can be achieved by using INSERT INTO and adding
the WHERE clauses to the OPENQUERY part of the statement. We will present here
a method to perform advanced filtering on the training cases. In the first tip
in this series, we will explain and exemplify how to use a DMX query instead of
the OPENQUERY component of the INSERT INTO statement. In the next tip, we’ll
show how to use a stored procedure result as training set and the last tip of
the series will present a method for measuring the accuracy of such models, by
computing the lift chart.
We’ll start with a sample database. The SampleProject.xmla
file contains an XMLA deployment script which creates a clustering model based
on the sample AdventureWorksDW database coming with SQL Server 2005. We’ll show
how to build and train a Decision Trees mining model on top of one of the clusters
detected by the initial model.
To keep things simple, only a few of the attributes are used
in both the clustering and the decision trees model. Once the SampleProject.xmla
script is deployed on the server, and the clustering model is built, one can
select the cases belonging to Cluster 1 with a query like below:
SELECT
T.*
From
[TargetMailCL]
PREDICTION JOIN
OPENQUERY([Adventure
Works DW],
'SELECT
…
FROM
[dbo].[vTargetMail]
ORDER BY CustomerKey') AS
t
ON
…
WHERE Cluster() =
'Cluster 1'
This query can be used by itself as a data source for an
INSERT INTO statement.
Start by defining the decision trees model:
-- Create Mining model
CREATE MINING MODEL Cluster1BikeBuyer
(
[Customer Key] LONG
KEY,
Age LONG CONTINUOUS,
[Date First Purchase] TEXT
DISCRETE,
[Number Cars Owned] LONG
CONTINUOUS,
[Number Children at Home] LONG
CONTINUOUS,
[Bike Buyer] LONG DISCRETE PREDICT
)USING Microsoft_Decision_Trees
Then train the model by appending the cluster filtering
query described above:
-- Train only with cases in Cluster 1
INSERT INTO
Cluster1BikeBuyer
(
[Customer Key],
Age,
[Date First Purchase],
[Number Cars Owned],
[Number Children at Home],
[Bike Buyer]
)
SELECT
T.CustomerKey,
T.Age,
T.DateFirstPurchase,
T.NumberCarsOwned,
T.NumberChildrenAtHome,
T.BikeBuyer
From
[TargetMailCL]
PREDICTION JOIN
OPENQUERY([Adventure
Works DW],
'SELECT
[CustomerKey],
Age,
[DateFirstPurchase],
[NumberCarsOwned],
[NumberChildrenAtHome],
[BikeBuyer]
FROM
[dbo].[vTargetMail]
ORDER BY CustomerKey') AS
t
ON
[TargetMailCL].[Date First Purchase] =
t.[DateFirstPurchase] AND
[TargetMailCL].[Age] = t.[Age] AND
[TargetMailCL].[Number Cars Owned] =
t.[NumberCarsOwned] AND
[TargetMailCL].[Number Children At Home] =
t.[NumberChildrenAtHome]
WHERE Cluster() =
'Cluster 1'
Now, the Cluster1BikeBuyer mining model is created and
trained only with cases that belong to ‘Cluster 1’. This kind of filtering cannot
be applied directly at the relational level.
This procedure can be used for various purposes, including:
- Building predictive models for various clusters
- Data cleanup before training (by filtering out outliers with another clustering
model)
- Generating cluster rules in a Decision Tree style (by marking training a
decision tree and predicting the cluster based on the other attributes)
The next tip in this series will explain how to use a stored
procedure as a data source for INSERT INTO statements.