If you build a
Microsoft_Decision_Trees model with a continuous predictable attribute and one
or more continuous input attributes marked as regressors, the algorithm produces
regression trees with nodes that include a regression formula. You can see this
formula in the mining legend when you select a node in the decision tree viewer.
But how do you make use of this formula in your own app? That is the goal of
this article.
A
little background
Let's provide some
background first though, before we dive into the details.
A
regression tree is a mining model created using the Microsoft Decision Trees
algorithm, where the predictable attribute is continuous. In some cases, one or
more input attributes might also be continuous and can be used as
regressors for the output attribute. This creates decision
tree nodes that include a regression formula, which is a linear function of
the output attribute as a function of all the input attributes. The regression
formula of output Y and inputs X1, X2, X3 is expressed in the format:
Y = Ym + m1(X1 –
X1m) + m2(X2 – X2m) + …..
Where:
Ym: Mean
of Y
Xtm: Mean of Xt
mt: Slope of the
variable Xt with respect to Y
Take the example of
a model created from the AdventureWorksDW sample database
that ships with Analysis Services 2005, where “Age” is the
predictable column (continuous) and “Yearly Income” is an input column
(continuous). When the tree is viewed in the decision tree viewer, the resulting
regression tree has a node that includes the following regression formula:
Age = 33.348+0.00008*(Yearly
Income-51,481.123)
The
formula is interpreted as follows:
·
Mean for Age: 33.348
·
Mean for Yearly Income: 51,481.123
·
Gradient of the line: 0.00008
The
following content query for the node shows the distribution:
SELECT
FLATTENED
(
SELECT
ATTRIBUTE_NAME,
ATTRIBUTE_VALUE,
[SUPPORT],
[PROBABILITY],
VALUETYPE
FROM
NODE_DISTRIBUTION
)
as D
FROM
[Model].CONTENT
WHERE
NODE_UNIQUE_NAME= '<>'
Distribution:
|
D.ATTRIBUTE_NAME
|
D.ATTRIBUTE_VALUE
|
D.SUPPORT
|
D.PROBABILITY
|
D.VALUETYPE
|
|
Age
|
Missing
|
0
|
0.000193536
|
1
|
|
Age
|
33.5595353339787
|
5165
|
0.999806464
|
3
|
|
Yearly Income
|
8.4100185198634E-05
|
0
|
0
|
7
|
|
Yearly Income
|
689.231274498812
|
0
|
0
|
8
|
|
Yearly Income
|
51481.1229428848
|
0
|
0
|
9
|
|
29.2299633602484
|
0
|
0
|
11
|
The
VALUETYPE is defined as follows:
|
Missing |
1 |
|
Existing |
2 |
|
Continuous |
3 |
|
Discrete |
4 |
|
Discretized
|
5 |
|
Boolean |
6 |
|
Coefficient |
7 |
|
ScoreGain
|
8 |
|
RegressorStatistics
|
9 |
|
NodeUniqueName
|
10 |
|
Intercept |
11 |
|
Other |
12 |
Mean
for age will be calculated as follows:
For
each attribute with a coefficient (VALUETYPE=7), multiply the value of
co-efficient with the value of mean (VALUETYPE=9) for that attribute and sum
all the products.
51481.1229428848 * 8.4100185198634E-05 = 4.329571974
Add
this to the value of Intercept (VALUETYPE=11)
4.329571974 + 29.2299633602484 = 33.55953533
Mean
for each regressor attribute, “Yearly Income” in this
case is the value of mean (VALUETYPE=9) for that attribute.
51481.1229428848
Gradient
for each regressor attribute, “Yearly Income” in this
case is the value of coefficient (VALUETYPE=7) for that attribute.
8.4100185198634E-05
Using these three values and the sign
(+/-), we can construct the regression formula:
Age= 33.55953533 + 8.4100185198634E-05 * (Yearly Income -51481.1229428848)
Extract the regression formula programmatically
Now
that we know what numbers we want and where to get them from, let's apply
the corresponding DMX query below to build the regression formula in a
program:
|
SELECT FLATTENED
(
SELECT
ATTRIBUTE_NAME
AS
TARGET,
ATTRIBUTE_VALUE
AS
COEFFICIENT
FROM
NODE_DISTRIBUTION
WHERE
VALUETYPE= 3
) as D,
(
SELECT
ATTRIBUTE_NAME + ' '
AS
VARIABLE,
ATTRIBUTE_VALUE
AS
MEAN
FROM
NODE_DISTRIBUTION
WHERE
VALUETYPE=9
OR
VALUETYPE=7
) as D
FROM
[Target Mail Decision Tree].CONTENT
WHERE
NODE_UNIQUE_NAME='00000000000'
|
using System;
using
System.Collections.Generic;
using
System.Text;
using
Microsoft.AnalysisServices.AdomdClient;
namespace RegressionFormula
{
class Program
{
static void Main(string[]
args)
{
AdomdConnection
connection = null;
try
{
//Regression
Formula Variables
string
yName;
double
yMean;
bool
yMeanSign;
List<string> xName = new
List<string>();
List<double> xMean = new
List<double>();
List<bool> xMeanSign = new
List<bool>();
List<double> xSlope = new
List<double>();
List<bool> xSlopeSign = new
List<bool>();
//Open
a connection to the Analysis Server
connection = new AdomdConnection();
connection.ConnectionString = "Data Source=local;Initial
Catalog=SampleModels;";
connection.Open();
//Create
a command object to execute the DMX
string
dmxStatement = "SELECT FLATTENED "
+
" ( " +
" SELECT " +
" ATTRIBUTE_NAME AS TARGET, "
+
"
ATTRIBUTE_VALUE AS COEFFICIENT " +
" FROM " +
"
NODE_DISTRIBUTION " +
" WHERE " +
"
VALUETYPE= 3 "
+
") as D, " +
"( " +
" SELECT " +
"
ATTRIBUTE_NAME + ' ' AS VARIABLE, " +
"
ATTRIBUTE_VALUE AS MEAN " +
" FROM " +
"
NODE_DISTRIBUTION " +
" WHERE "
+
"
VALUETYPE=9 OR VALUETYPE=7 " +
" ) as D " +
"FROM " +
" [Target
Mail Decision Tree].CONTENT " +
"
WHERE " +
"
NODE_UNIQUE_NAME='00000000000'";
AdomdCommand
command = new AdomdCommand();
command.CommandText =
dmxStatement;
command.CommandType =
System.Data.CommandType.Text;
command.Connection =
connection;
//Execute
the DMX and get the reader
AdomdDataReader
reader = command.ExecuteReader();
//Obtain
the Y variable name and the mean
reader.Read();
yName = reader.GetString(0);
yMean = reader.GetDouble(1);
yMeanSign = (yMean >= 0);
yMean = Math.Abs(yMean);
//Obtain
all X Variables (Regressors)
while
(reader.Read())
{
xName.Add(reader.GetString(2));
xSlope.Add(reader.GetDouble(3));
xSlopeSign.Add(xSlope[xSlope.Count-1]>=0);
xSlope[xSlope.Count-1] = Math.Abs(xSlope[xSlope.Count-1]);
reader.Read();
xMean.Add(reader.GetDouble(3));
xMeanSign.Add(xMean[xMean.Count-1]<0);
xMean[xMean.Count-1] = Math.Abs(xMean[xMean.Count-1]);
}
//Construct
and print the formula
string
Formula = string.Empty;
Formula += yName;
Formula += " = ";
Formula += yMeanSign ? "" : " -
";
Formula += yMean;
for
(int iter = 0; iter < xName.Count; ++iter)
{
Formula += xSlopeSign[iter]
? " + " : "
- ";
Formula += xSlope[iter];
Formula += " *(";
Formula += xName[iter];
Formula += xMeanSign[iter]?" + ":" -
";
Formula += xMean[iter];
Formula += ")";
}
Console.WriteLine(Formula);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
finally
{
if
(connection != null)
{
connection.Dispose();
}
}
}
}
}
The
output for this sample shows the constructed regression formula:
Age
= 33.5595353339787 + 8.41001851986346E-05 *(Yearly Income - 51481.1229428848)