|
|
 |
|
|
|
|
|
|
|
|
|
Last updated by DMTeam on Mon 04/27/2009 @ 04:45
|
|
This articles explores how you can work with filtered models, a powerful new feature introduced in SQL Server 2008 that allows you build models on slices of a mining structure.
|
|
Explore Filtered Models in SQL Server 2008
Introduction
This
article describes how to use model filtering, a powerful new feature in SQL Server 2008
Data Mining. Filtering allows you to take full advantage of the mining
structure / mining model dichotomy, separating the data staging (in the structure)
from the modeling (in the mining model). This means that you can define and
process a single mining structure for your problem space and then build mining
models on specific slices of interest within that space.
SQL
Server 2008 Data Mining allows specifying, for each mining model, a filter to
be applied on the training data. The filter acts as a partitioning mechanism
inside the mining structure and it is applied on top of any training/testing
partitioning already existing at the structure level. Furthermore,
filters can be applied to data in nested tables. Filters allow you to build
different models for various partitions of the data with minimal effort, either
to obtain better overall accuracy (think of it as manually specifying the first
split in a decision tree), or to compare the patterns between different
partitions.
We will cover a lot of ground in
this article. Use the links below to jump straight to the sections you're most
interested in:
Creating Filtered Models
So,
let’s start with a simple scenario, using the same old
Adventure Works data -
building a mining model that predicts how likely a customer is to purchase a
bike based on various demographic information. Let’s also assume that you know -
based on your experience with the data - that the Geography factor has an
important role and you don’t want your North American patterns to hide (because
of volume) the peculiarities of the Pacific market.
In
this section, we will show you both the DMX and UI ways of working with
filtered models. Pick your path and follow it!
Suppose
you start with a mining structure defined like below:
|
DMX:
CREATE MINING STRUCTURE TestStructure
(
CustomerKey LONG KEY,
Region TEXT DISCRETE,
YearlyIncome DOUBLE DISCRETIZED,
EnglishOccupation TEXT DISCRETE,
CommuteDistance TEXT DISCRETE,
BikeBuyer BOOLEAN DISCRETE
) WITH HOLDOUT( 30 PERCENT )
Then,
train the mining structure with an INSERT INTO statement:
INSERT INTO MINING STRUCTURE TestStructure
(
CustomerKey,
Region,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer
)
OPENQUERY([Adventure
Works DW], ‘SELECT
CustomerKey,
Region,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer FROM dbo.vTargetMail‘)
|
UI:
To create the structure using BI
Development Studio, you just need to select “New Mining Structure”. On the
“Create the Mining Structure” page of the wizard select “Create mining
structure with no models”. Assuming you already have a data source view for
Adventure Works 2008, select it and then select vTargetMail in the “Specify table type” in the wizard. Select the
following columns and change the content and data types to match those as shown
below:

Now,
the modeling part. As usual, you would create a mining model as follows:
|
DMX:
ALTER MINING STRUCTURE TestStructure
ADD MINING MODEL BikeBuyerClass
(
CustomerKey,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH DRILLTHROUGH
|
UI:
In the UI, you can create the mining
model by selecting create related mining model. Name the model BikeBuyerClass
and select Decision Trees as the algorithm. Once the model is selected you will
have to set the BikeBuyer column to Predict
and the Region column to ignore as
shown below:

Now
you can add a mining model to the mining structure using only the data for the
Pacific area:
|
DMX:
ALTER MINING STRUCTURE TestStructure
ADD MINING MODEL DT_Pacific
(
CustomerKey,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH FILTER( Region=’Pacific’ ), DRILLTHROUGH
The
new syntax element is the FILTER construct, which instructs the model to use
in training only those cases where the Region column contains the ‘Pacific’
value.
|
UI:
To create the filtered model in the UI,
you will need to create the model first and then add the filter. To use fewer
steps, you can create the new model based on BikeBuyerClass model we just
created. Just right click on the BikeBuyerModel name and select “New Mining
Model” as shown below.

This
can also be accomplished by picking the mining model you want to clone and
using the Mining Model drop down from the file menu and selecting “New Mining
Model”.
Now,
to add the filter you have to select “Set Model Filter” from the Mining Model
menu while the mining model is selected or from the right click contextual
menu. This will bring the Model Filter dialog.
The
dialog presents data in tabular form. Each row is a condition. Each condition
can have up to four columns:
·
And
/ Or: This will do a logical And/Or between the different conditions. Valid
values are “And” and “Or”. The value can only be set from the second condition
on. Current version doesn’t allow condition grouping from the grid.
·
Mining
Structure Column: This can be any column from the mining structure, values are
prefilled so you only need to select a column name from a drop down. Only
columns that will not show are key columns since you can’t filter on keys.
·
Operator:
This is a drop down with valid operators based on the data type of the selected
mining structure column (more on this below).
·
Value:
This is the value used to compare against the structure column using the operator. Value is not always necessary
(more on this below).

As
you can see above, the same expression used in the DMX version of the filter is
displayed in the Expression portion
of the dialog.
Note:
An expression can be edited manually by selecting “Edit Query” in the dialog.
Be careful: once the expression is edited you will not be able to use the grid
to add elements to the filter. Everything will need to be added by hand in the
Expression text box.
Similarly,
add a new model for the North America region:
|
DMX:
ALTER MINING STRUCTURE TestStruct
ADD MINING MODEL DT_NorthAmerica
(
CustomerKey,
YearlyIncome,
EnglishOccupation,
CommuteDistance,
BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH FILTER(
Region=’North America’ ), DRILLTHROUGH
|
UI:
Clone
DT_Pacific and edit the filter to be North America instead of Pacific as shown below.

Note that if you clone the DT_Pacific
model to create the DT_NorthAmerica model using the steps described above, the
filters in the base mining model will also be cloned. This can be a real time-saver when creating multiple models with similar filters
- the time-saving adds up when you start building complex filters.
|
DMX Deep-Dive on
Filter Syntax
As you
notice, FILTER is syntactically a function, taking a Boolean expression.
The Boolean expression may be a simple predicate (of the
Column=Value kind), an EXISTS predicate or a Boolean combination
(using AND, OR and NOT operators) of multiple
predicates.
The
Column part of a simple predicate is always the name of a mining
structure column.
The
corresponding mining structure column does not have to be included in the
mining model. In the examples above, Region is no longer an interesting
attribute when all the training cases seen by a model have the same
value (e.g. Pacific), therefore it is not part of the model.
A
simple predicate is defined as <Column> , where
the accepted set of operators depends on the content type of the column,
and Value must always be a constant. Value generally has
to have the same type as the column but can also be the NULL
constant.
UI:
In the Filter dialog in BI Studio the NULL constant is
denoted by two added values in the operator drop downs as IS NULL and IS
NOT NULL to demote <Column> = NULL or <Column> <> NULL
respectively.
If the
column is:
-
A
case level KEY - then no operator can be applied (no key filters are
supported) . Nested KEY on the other hand is just a mapping between
case and nested tables and can be used to filter (more on nested
table filtering later)
-
DISCRETE or DISCRETIZED - supports the = and <> operators and also
IS NULL and IS NOT NULL in the Filter dialog
-
CONTINUOUS - supports the =, <>, <, <=, >, >= operators and also IS
NULL and IS NOT NULL in the Filter dialog
XML
serialization of the floating point double numbers as well as rounding
errors may result in losses of less significant digits, therefore using
an = or <> operator with a double value may result in unexpected
results, so it is safer to use a combination of other predicates.
In case
of discretized (numeric or date/time) values, the = and <> operators
have a semantic similar with the PREDICTION JOIN operator: they evaluate
the bucket that contains the value and then apply the predicate to the
bucket index. Therefore, for a discretized YearlyIncome column with
buckets (0, 10000], (10000, 20000] etc. the filters below mean exactly
the same thing:
-
(YearlyIncome=5000)
-
(YearlyIncome=9000)
-
(YearlyIncome=999.999)
Applying a filter on a mining model means, effectively, training the
mining model with the data returned by a query like below:
SELECT
… FROM MINING STRUCTURE TestStructure.CASES
WHERE IsTrainingCase() AND ()
For
those of you who use OLAP mining models, filters are somewhat similar
to defining a cube slice on a mining structure - but at the model level.
If your
filtered model has drillthrough enabled, then all the
structure cases that match the filter will be linked with a content
node, not only the training cases from the structure. Of course, only
the structure training cases matching the filter will be
used in training.
Therefore, the following queries are equivalent
SELECT
… FROM MINING MODEL DT_NorthAmerica.CASES
WHERE IsTrainingCase()
SELECT
… FROM MINING STRUCTURE TestStructure.CASES
WHERE IsTrainingCase()AND (Region=’North America’) |
Model Accuracy in Filtered Models
Accuracy
stored procedures can be applied on various data partitions (test or training).
Filters add a new dimension to the data partitions, which can now be filtered
test data or filtered training data. Accuracy stored procedures have syntax
like below:
CALL SystemGetLiftTable(BikeBuyerClass, 2, ‘BikeBuyer’, true)
// - to get the lift chart for the TRUE state of the BikeBuyer attribute
CALL SystemGetClassificationMatrix(BikeBuyerClass, 2,
‘BikeBuyer’) // - to get the full classification matrix for the BikeBuyer
attribute
The
second parameter (2) is a bitmask identifying the data partition(s) to be used
in the accuracy stored procedure. Here are some bit values:
- 1 identifies the training
partition
- 2 identifies the test
partition
- 3 (bitwise OR combination of
1 and 2) identifies ALL the data in the mining structure (training + test)
Filters
introduce another bit in the bitmask, with a value of 4. Therefore, here are
some new values for the second parameter of the accuracy stored procedures:
-
5 (4 bitwise OR 1) identifies the filtered training data seen by the model in
training
- 6 (4 bitwise OR 2) identifies the filtered test data - those test cases
from the mining structure where the model filter applies
-
7 (4 bitwise OR 1 bitwise OR 2) identifies all the cases (training or test) in
the structure where the model filter applies
Therefore
the following calls might return different lift tables:
CALL SystemGetLiftTable(DT_Pacific, 2, ‘BikeBuyer’, true) // -
to get the lift chart for the TRUE state of the BikeBuyer attribute on all the
test cases in the structure
CALL SystemGetLiftTable(DT_Pacific, 6, ‘BikeBuyer’, true) // -
to get the lift chart for the TRUE state of the BikeBuyer attribute on the test
cases in the structure that match the filter defined in DT_Pacific
Filtering Nested Tables
The
most common use of nested tables in SQL Server Data Mining is to model
transaction tables. That is, model Customer entities together with the “bag” of
products they purchased.
Nested
table filters primarily serve two purposes:
-
Use (in the
model) only those nested table rows with a certain property (e.g. consider
only products that are not “Coca Cola”, because most people do buy “Coca
Cola”, so it is not that interesting)
-
Use (in the
model) only those cases where the associated nested tables have certain
properties (e.g. - build a prediction model on customers that have
purchased Coca Cola)
Notice
the difference between the filters: while the first filter removes products
from shopping baskets before modeling, the second removes customers from
the model (those that have not purchased Coca Cola).
Let’s
start with a mining structure containing customers, some demographic
information and a list of products they bought:
CREATE MINING STRUCTURE TestStructure2
(
OrderNumber TEXT KEY,
IncomeGroup TEXT DISCRETE,
Products TABLE
(
Product TEXT KEY,
LineNumber LONG
DISCRETE
)
)
Next
step, train the mining structure:
INSERT INTO MINING STRUCTURE TestStructure2
(
OrderNumber,
IncomeGroup,
Products
(
SKIP,
Product,
LineNumber
)
)
SHAPE {OPENQUERY([Adventure
Works DW], ‘SELECT
OrderNumber,
IncomeGroup
FROM dbo.vAssocSeqOrders ORDER BY OrderNumber‘)
}
APPEND ({ OPENQUERY([Adventure
Works DW], ‘SELECT
OrderNumber,
Model,
LineNumber
FROM dbo.vAssocSeqLineItems ORDER BY
OrderNumber‘) }
RELATE OrderNumber TO
OrderNumber) AS Products
Now,
the modeling part: for the first kind of filter described in the beginning, the
goal is to build a model that predicts some, for example, demographics
(IncomeGroup) based on the products a customer purchased, without considering a
certain product (say, the ‘Mountain-500′ bike):
ALTER MINING STRUCTURE TestStructure2
ADD MINING MODEL IncomeGroupFromProducts
(
OrderNumber,
IncomeGroup PREDICT,
Products
(
Product
) WITH FILTER(Product<>’Mountain-500′)
)USING Microsoft_Decision_Trees
The
syntax of the nested table filter is very similar with the syntax of model
filters - it effectively specifies a WHERE clause to be applied on the rows of
the nested table. The training data for the model can be obtained with the
following drillthrough query:
SELECT
OrderNumber, IncomeGroup,
(SELECT Product FROM
Products WHERE Product<>’Mountain-500′)
FROM MINING STRUCTURE TestStructure2.CASES WHERE IsTrainingCase()
For
the UI create a structure with no models and add a new Model setting the column
LineNumber to ignore (as described in
the previous section). The main difference you will notice from the previous
example is that now if you select the nested table column a new element is selectable
in the Mining Model / Contextual menu: Set
Nested Table Row Filter. Select this option to open the Nested table filter
dialog. You will now you opened the correct dialog because it will say Nested
Table Row Filter for <Model>.
The only items that will be available in the filter are nested table columns.
The filter for IncomeGroupFromProducts is shown below.

The
syntax changes for the second filter described before. Now the goal is to build
a model predicting IncomeGroup only for those customers that did
buy a certain product:
ALTER MINING STRUCTURE TestStructure2
ADD MINING MODEL IncomeGroupFromMountain500
(
OrderNumber,
IncomeGroup PREDICT,
Products
(
Product
)
)USING Microsoft_Decision_Trees WITH
FILTER(
EXISTS (SELECT * FROM
Products WHERE Product=’Mountain-500′)
)
The
new syntax element is the EXISTS predicate, which evaluates to TRUE if the
query specified as argument returns at least one row.
Note
that the list of columns for the EXISTS query is never used, so * is as good as
anything else. The WHERE clause of the EXISTS predicate supports, in the case
of nested table filters, only the simple
predicates (or Boolean combinations of such predicates)
described before.
The
same filter can be applied using the Model Filter dialog in BI Studio. The main
difference will be the available mining structure columns where you can select
any of the structure’s columns or the nested table. You need to select the
nested table as the Structure Column. Valid operators for nested tables are: Contains which translates into
Exists and Not Contains which is the equivalent of Not Exists.
If
you select the nested table then a ‘plus’ sign will appear to the left of that
condition, this means you can expand the nested table filter. The nested table
conditions work pretty much as regular conditions with the same <Column> <Operator> <Value>
rules. Everything inside the nested conditions will be used to create an Exists clause (with or without the
Not). You can have multiple of these
clauses and use AND / OR to create Boolean combinations. You can see the filter
for IncomeGroupFromMountain500 below:

The
training data for the model can be obtained with the following drillthrough
query:
SELECT
OrderNumber, IncomeGroup,
(SELECT Product FROM
Products)
FROM MINING STRUCTURE TestStructure2.CASES
WHERE
IsTrainingCase() AND
EXISTS( SELECT * FROM Products
WHERE Product=’Mountain-500′)
The
query actually works in DMX, which implies that DMX contains now the EXISTS
function which evaluates to TRUE if the sub-query argument returns at least one
row. Note that, when used in a regular DMX query (i.e. not in a filter), the
EXISTS sub-query can be applied to any nested table (including function
results, such as PredictHistogram) and can take any WHERE clause that is valid
in that context, including UDFs (therefore, outside of a filter, the EXISTS
clause is not limited to structure table columns and simple predicates).
Combinations
of filters and other SQL Server 2008 DMX features can be used to have very
specialized data views, allowing complex modeling scenarios.
Multiple Filters on
Nested Tables
Notice
how the structure above contains a LineNumber column in the nested table, a
numeric field indicating the category of the product on the same line. The
model below detects cross-sales rules that lead from products in line 1 (bikes)
to products in any other category:
ALTER MINING STRUCTURE TestStructure2
ADD MINING MODEL LineNumber1Recommendations
(
OrderNumber,
Products AS LineNumber1Products
(
Product
) WITH FILTER(LineNumber=1),
Products AS OtherLineNumberProducts PREDICT_ONLY
(
Product
) WITH FILTER(LineNumber<>1)
)USING Microsoft_Association_Rules
WITH FILTER(
EXISTS(SELECT * FROM Products WHERE
LineNumber=1) AND
EXISTS(SELECT * FROM Products WHERE
LineNumber<>1)
)
Here
is a breakout of the filters and constructs:
-
LineNumber1Products
is an input nested table, based on data in the Products structure column,
and containing only products in the category denoted by LineNumber 1
-
OtherLineNumberProducts
is a predictable (PREDICT_ONLY) nested table, also based on data in the
Products structure column, but containing only products in the categories
other than the one denoted by LineNumber 1
-
The model is
trained only on those customers that have both at least one product in the
LineNumber 1category and at least one product in a different category,
because any other customers are not relevant for the problem the model is
trying to solve.
-
All the filters
(both the nested table row filters and the model case filter) are applied
on mining structure columns which do not necessarily show in the model
(the case of LineNumber). When they do show in the model under an alias,
the filter is applied on the source structure column name.
UI:
The same can be accomplished in BI Studio by
creating Nested Filters for each nested table and then on Model filter with
both Exist clauses.
[Acknowledgement: Most of the content for this
article comes from a two-part
post on this topic on Bogdan
Crivat’s blog.]
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Mon 02/09/2009 @ 11:57
|
|
This tip provides insight into how the Attribute Discrimination pane in the Naive-Bayes viewer uses a system stored procedure to get its data
|
|
Under the Hood: How the Naive-Bayes Attribute Discrimination Viewer Gets Its Data
If you have ever wondered about how
the DM viewers get the data to display on the screen, read on.
In many cases, what is displayed in the data mining viewers is the result of
built-in stored procedures which allow the processing required for the view to
be done on the server without requiring all of the model content to be brought
to the server. In this "Under the hood" tip, we will dive into one of
those stored procedures.
Naive-Bayes Attribute Discrimination View
The attribute discrimination view for the Naive-Bayes algorithm shows the differences in the input attributes across the states of
an output attribute. It generally looks like this:

The viewer doesn’t download all of the correlations in the Naive-Bayes content,
rather it calls the stored procedure GetAttributeDiscrimination like this:
CALL System.GetAttributeDiscrimination
('Classify CollegePlans NB',
'100000005',
'Plans to attend',
1,
'All other states',
2,
0.0005,
true)
The not-so-obvious parameters are, in order strModel, strPredictableNode,
strValue1, iValType1, strValue2, iValType2, dThreshold, and bNormalize.
Let’s go through these parameters:
strModel – The name of the model.
strPredictableNode – This one is a bit difficult, as it takes
the Node Unique Name of the target attribute instead
of just the string you see in the viewer. The Node Unique Name identifies
the attribute in the content rowset generated by the model. You can get
the list of predictable attributes and their Node Unique Names by calling
another stored procedure – like this CALL
System.GetPredictableAttributes('ModelName'). This stored
procedure returns two columns – one for the attribute name and one for the Node
Unique Name.
strValue1 – The name of the value you want to compare on the
left hand side. The usage of this parameter depends on the value of the
next parameter.
iValType1 – This parameter indicates how to treat strValue1.
It can have values 0,1, or 2. If this parameter is a 1, the value in
strValue1 is the actual state of the attribute. However, if this parameter
is a 0 or 2, the value in strValue1 is ignored. If the value is 0, the
left-hand value is considered to be the “missing state”. If the value is
2, the left hand value is considered to be “all other states.” In the
example above, “All other states” is specified only because it looks nice (and
it’s easier to just drop the combo box value into the function call even if it
will just be ignored).
strValue2 – Like strValue1, but for the right hand side.
iValType2 – Like iValType2, but for the right hand side.
dThreshold – A threshold value used to filter results, such
that small correlations don’t come back in the results.
Usually you set it to a really small number like 0.0005 in the example above.
bNormalize - Whether or not the result is normalized.
If this value is true, the results are normalized to a maximum absolute value of
100, giving a possible range of –100 to 100. All this does is take the
largest absolute value in the result and divide that into 100, and then multiple
all the other numbers by that amount. If set to false, the numbers are
whatever they are and you can figure it out yourself – it’s up to you, but the
NB viewer
always sets this to true.
The Results
Calling this routine returns a row for every differentiating attribute/value
pair with a score higher than the specified threshold. The row contains
the differentiating pair along with the score and some other columns and looks
somewhat like this:
The score column is the “important” one and is best explained
as if you did something like a C language compare routine e.g
int Compare(int v1,int v2) { return v1-v2; }
. That is, if the value is positive, it favors value1 and if the value is
negative, it favors value2. The other
columns are the actual counts of the correlations of
the discriminator against the inputs. The best way to understand them is
to look at the Mining Legend as you browse the model and click on rows.
For example, if you clicked on the first row of the result above (in either
picture), the Mining Legend would look like this:
Of course, once you have the result set you can use it wherever you want – in
Reporting Services, Integration Services, or in your custom program.
|
Sidebar: How Do I See What Queries Are Being Sent by the Viewers?
-
Run SQL Server Profiler.
-
Start a New Trace from the File Menu and connect to Analysis Services.
-
Leave all the defaults on for the Trace Properties dialog that appears.
-
Go to the data mining viewer you're interested in digging into, and browse a
model.
-
Profiler will show you the queries that are being sent. Click on a "Query Begin"
event line in the top pane to see the full query text in the bottom pane.
|
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Mon 12/29/2008 @ 06:04
|
|
Learn how to use a new SQL Server 2008 feature to get more accurate predictions when you don't have enough historical data for a series.
|
|
Predicting future steps when you do not have enough history
No History? No Worries!
Say you’re launching a new product and you want to
predict what sales might look like in the next few months. Classical time series prediction does not
work in this scenario because you don’t have historical sales data for the
product. However, new SQL Server 2008
extensions to the Microsoft_Time_Series algorithm and DMX allow you to easily
apply the patterns from another similarly-behaving time series to solve this
conundrum.
Predicting Post-IPO Stock Values for VISA
In this
example, we illustrate how to use the PredictTimeSeries method with the
parameter REPLACE_MODEL_CASES to obtain predictions for a time series for which
we do not have enough historic data. The power of this combination comes into
play when we have a time series with not enough historic data points to build a
realistic model, but we know that this series follows a pattern similar to
another time series for which we have enough historic data to build a model.
Here’s an Excel 2007 workbook
that has 73 historic data points representing post-IPO daily closing values for
the MasterCard stock and just 8 corresponding values for Visa (since we’re
doing this 8 days after the Visa IPO). Our goal is to use the MasterCard stock
history to derive better predictions for the Visa stock.
We will use the SQL
Server 2008 Data Mining Client Add-in for Excel to build and query the time series model:
1.
Make sure you have Excel 2007 with the DM
Client add-in installed.
2. Save the
workbook with
the MasterCard/Visa stock data to your local disk and open in Excel 2007.
3. To create a model for the MasterCard stock data
on the first worksheet, click on the “Data Mining” tab and select the
“Forecast” task.
4. Select “Next” on the first page of the forecast
wizard “Getting Started with the Forecast Wizard”. (Note: This page might not
appear if you previously selected the option to skip the welcome page.)
5. On the second page “Select Data Source”, select
the table we created previously and click on “Next” button.
6. On the “Forecasting” page, select the time stamp
column to be the first column, named “TimeStamp”.
7. In the input columns grid, de-select the
“TimeStamp” column and select the “MasterCard” column, then click “Next”.
8. On the last page of the wizard, rename the
structure “MasterCardStructure” and the model “MasterCardModel”, leave the
default selections to browse the model after it is created and to allow drill
through, and click “Finish” to end the wizard and proceed to build the model.
The MasterCard model historic data and the first 10
predicted values are illustrated in the following graph:

Now, use the same steps to create a time series model for
the Visa stock using the 8 historical data points on the second workbook sheet.
You will see right away that the model will not
generate meaningful predictions due to the lack of sufficient historic data
points. The VisaModel historic data and the next 10 predicted values are
illustrated in the following graph:

Better Predictions Using REPLACE_MODEL_CASES
A better approach is to use the knowledge that the Visa
and MasterCard stocks have a similar pattern and to use the model built for MasterCard
to obtain predictions for the Visa stock values. Here’s how (again using the
Data Mining Client Add-in for Excel):
1. Select the “Query” task from the “Data Mining”
ribbon and click "Next" on the
introductory page.
2. Select the “MasterCardModel” model and click the
“Advanced” button.
3. On the “Data Mining Advance Query Editor” page,
click on the button “Edit Query”, select Yes on the dialog asking to confirm
that “Any changes to the query text will not be preserved when you switch back
to the design mode.”
4. Type the following query:
SELECT
(SELECT $Time, [MasterCard] as
[Visa] FROM
PredictTimeSeries([MasterCardModel].[MasterCard],
10, REPLACE_MODEL_CASES)) as Predictions
From [MasterCardModel]
NATURAL PREDICTION
JOIN
(SELECT 1 AS [TimeStamp], 64.35 as
[MasterCard]
UNION
SELECT 2 AS
[TimeStamp], 62.76 as [MasterCard]
UNION
SELECT 3 AS
[TimeStamp], 64.48 as [MasterCard]
UNION
SELECT 4 AS
[TimeStamp], 66.11 as [MasterCard]
UNION
SELECT 5 AS
[TimeStamp], 69 as [MasterCard]
UNION
SELECT 6 AS
[TimeStamp], 75.1 as [MasterCard]
UNION
SELECT 7 AS
[TimeStamp], 82.75 as [MasterCard]
UNION
SELECT 8 AS
[TimeStamp], 82.86 as [MasterCard])
as t
5. Click “Finish” and select the results of the
query to be copied into a new worksheet.
The results should look like this:

When the REPLACE_MODEL_CASES parameter is used, the
PredictTimeSeries method will return the requested number of prediction
obtained by replacing the last historic points of the given model with the new
values provided in the query. In our case, the last 8 data points for the
MasterCardModel are replaced with the values we generate on the fly using the
SELECT and UNION options in the input set specified after the “NATURAL
PREDICTION JOIN” keywords. Then, the MasterCardModel equations are used to
predict the next 10 values for the Visa stock series.
To see the power of this method, we can compare the
predictions obtained using the MasterCard model (Predictions.Visa), with the
predictions generated by the VisaModel model obtained using only the limit sets
of 8 data points of the Visa stock values (Predictions.Visa2). The results are illustrated
in the following graph:

So there you go - you have a new tool in your arsenal when
you don’t have enough data to make accurate time series predictions. Enjoy!
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Wed 11/19/2008 @ 12:44
|
|
This tip shows you how to take the tedium out of putting together and verifying the correctness of DMX prediction queries for models with multiple nested tables - by generating them programmatically!
|
|
Scenario
Imagine you built a model with multiple nested tables and multiple predictable columns inside each nested table. In your application you want to run a prediction query against the model and gather prediction statistics for each predictable attribute. Such queries are usually long and tedious to write, plus they are hard to debug if you made a minor slip. This tip shows you how to programmatically generate such a query using C# and ADOMD.NET.
Let us consider a database schema with the following tables:
· Products (ProductID, ProductColor, ProductCategory, ProductPrice, ProductWeight)
· Stores ( StoreId, State, Size, Type)
· Manufacturers (ManufacturerId, Country, Price, ProductId)
· ProductToStores (StoreId, ProductId)
We construct the following model from the above schema:
CREATE MINING MODEL [Products] (
[Product ID] LONG KEY,
[Product Color New] TEXT DISCRETE,
[Product Category] TEXT DISCRETE PREDICT,
[Product Price] DOUBLE DISCRETIZED PREDICT,
[Product Weight] DOUBLE CONTINUOUS PREDICT,
[Stores With Products] TABLE
(
[Store ID] LONG KEY,
[State] TEXT DISCRETE,
[Size] LONG CONTINUOUS PREDICT,
[Type] LONG DISCRETE PREDICT
),
[Manufacturers] TABLE
(
[Manufacturer ID] LONG KEY,
[Country] TEXT DISCRETE,
[Price] DOUBLE CONTINUOUS PREDICT
)
)
USING Microsoft_Decision_Trees
Notice that the column names in the mining model have spaces (typical if you created the model using BI Development Studio). The column [Product Color New] has been renamed from the original column name. The nested table [Stores With Products] is created from a join of the two source tables. Our goal is to generate a PREDICTION JOIN statement which in the most generic form looks as follows:
SELECT [FLATTENED] [TOP n] select_expression_list
FROM model | sub_select [NATURAL] PREDICTION JOIN
source_data_query [ON join_mapping_list]
[WHERE condition_expression]
[ORDER BY expression [DESC|ASC]]
Building the Prediction Query
First let us consider the list of columns we want to generate in our select_expression_list for the Prediction Join query:
1. For each predictable column in the case table, we want to include the equivalent column from the data source if it exists. This enables us to compare the actual value to the predictable value in our application.
2. For each predictable column in the case table, we want to make available the following statistics:
a. Discrete Column: The histogram of state with probability for each state.
b. Continuous Column: The mean, STDEV and the probability
c. Discretized Column: The Min, Mid, Max and the probability.
From our example above, the query fragment for #1 and #2 would look as follows:
/*Discrete*/
T.[ProductCategory],
(SELECT Predict([Product Category]) as [Product Category],
PredictProbability([Product Category]) as [Prob]
FROM PredictHistogram([Product Category])) AS [Product Category Predicted]
/*Discretized*/
,T.[ProductPrice]
,(SELECT RangeMin([Product Price]) as [Min],
RangeMax([Product Price]) as [Max],
RangeMid([Product Price]) as [Mid],
PredictProbability([Product Price]) as [Prob]
FROM PredictHistogram([Product Price])) AS [Product Price Predicted]
/*Continuous*/
,T.[ProductWeight]
,(SELECT Predict([Product Weight]) as [Product Weight]
,PredictSTDEV([Product Weight]) as [STDEV]
,PredictProbability([Product Weight]) as [Prob]
FROM PredictHistogram([Product Weight])) AS [Product Weight Predicted]
3. For each nested table and for each predictable, we want to include the equivalent column from the data source if it exists. This enables us to compare the actual value to the predictable value in our application.
4. For each predictable column in the nested table and for each unique nested table key, we want to output the same data as #2.
From our example above, the query fragment for #3 and #4 would look as follows:
,(SELECT [ManufacturerId]
,[Price]
FROM T.[Manufacturers]) AS [Manufacturers]
,(SELECT [Manufacturer Id]
,(SELECT Predict([Price]) as [Price],
PredictSTDEV([Price]) as [STDEV],
PredictProbability([Price]) as [Prob]
FROM PredictHistogram([Price])) AS [Price Predicted]
,$PROBABILITY FROM Predict([Manufacturers], INCLUDE_STATISTICS, INCLUSIVE)) AS [Manufacturers_Predicted]
,(SELECT [StoreId]
,[Size]
,[Type]
FROM T.[Stores With Products]) AS [Stores With Products]
,(SELECT [Store Id]
,(SELECT Predict([Size]) as [Size],
PredictSTDEV([Size]) as [STDEV],
PredictProbability([Size]) as [Prob]
FROM PredictHistogram([Size])) AS [Size Predicted]
,(SELECT Predict([Type]) as [Type],
PredictProbability([Type]) as [Prob]
FROM PredictHistogram([Type])) AS [Type Predicted]
,$PROBABILITY
FROM Predict([Stores With Products], INCLUDE_STATISTICS, INCLUSIVE)) AS [Stores With Products_Predicted]
The last piece is the part source_data_query [ON join_mapping_list] . For our example, this looks as follows:
PREDICTION JOIN
SHAPE {
OPENQUERY([Test],'SELECT
[ProductCategory]
,[ProductColor]
,[ProductID]
,[ProductPrice]
,[ProductWeight]
FROM [Products] ')
}
APPEND
({
OPENQUERY([Test],'SELECT [ProductID]
,[Country]
,[ManufacturerId]
,[Price]
FROM [Manufacturers]
ORDER BY [ProductID]')
}
RELATE [ProductID] TO [ProductID]) AS [Manufacturers]
,({
OPENQUERY([Test],'SELECT [ProductID]
,[Size]
,[State]
,[StoreId]
,[Type]
FROM (SELECT dbo.Stores.StoreId, dbo.Stores.State, dbo.Stores.[Size], dbo.Stores.Type, dbo.ProductToStores.ProductId
FROM dbo.Stores INNER JOIN dbo.ProductToStores
ON dbo.Stores.StoreId = dbo.ProductToStores.StoreId)
AS [Stores With Products]
ORDER BY [ProductID]')
}
RELATE [ProductID] TO [ProductID]) AS [Stores With Products]
AS T
ON
T.[ProductCategory] = [Products].[Product Category]
AND T.[ProductColor] = [Products].[Product Color New]
AND T.[ProductID] = [Products].[Product ID]
AND T.[ProductPrice] = [Products].[Product Price]
AND T.[ProductWeight] = [Products].[Product Weight]
AND T.[Manufacturers].[Country]=[Products].[Manufacturers].[Country]
AND T.[Manufacturers].[ManufacturerId]=[Products].[Manufacturers].[Manufacturer Id]
AND T.[Manufacturers].[Price]=[Products].[Manufacturers].[Price]
AND T.[Stores With Products].[Size]=[Products].[Stores With Products].[Size]
AND T.[Stores With Products].[State]=[Products].[Stores With Products].[State]
AND T.[Stores With Products].[StoreId]=[Products].[Stores With Products].[Store Id]
AND T.[Stores With Products].[Type]=[Products].[Stores With Products].[Type]
Helper Class Outline
The helper class included with this tip generates all the above queries by reading the model metadata and sending PREPARE queries to the source to read the relational schema.
The usage pattern of the class is shown below:
string connString = "Data Source=localhost;Initial Catalog=Test";
PredictionQueryGenerator queryGen = new PredictionQueryGenerator(connString, "Test", "Products");
//Maps the model column [Product Color New] to DB Column [ProductColor]
queryGen.AddCaseTableColumnMapping("Product Color New", "ProductColor");
//Maps the nested table [Stores With Products] to a Query
queryGen.AddTableMapping("Stores With Products",
@"(SELECT dbo.Stores.StoreId, dbo.Stores.State, dbo.Stores.[Size], dbo.Stores.Type, dbo.ProductToStores.ProductId
FROM dbo.Stores INNER JOIN dbo.ProductToStores
ON dbo.Stores.StoreId = dbo.ProductToStores.StoreId) as [Stores With Products]");
//Initialized auto mapping
queryGen.InitAutoMapping();
//Generates the prints the query
Console.WriteLine(queryGen.DMXQuery());
Show Me the Code
Download the full source code for the helper class and play with it. The download also includes a SQL Server 2008 backup of the source database.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Fri 10/24/2008 @ 02:46
|
|
Learn how to query mining models from the SQL Server relational database engine using T-SQL and save the results into SQL tables.
|
|
Getting Data Mining Results into SQL Tables
Let Us Count the Ways
A common
customer question is “How do I get data mining results into SQL tables?” All
of the following methods work:
1.
Use the Save button from the Prediction
Query Builder in BI Development Studio. Note that this only works in the interactive,
single-user case - it cannot be used programmatically.
2.
Build an SSIS package containing
a data flow task that terminates in a SQL Server Destination, with a Data
Mining Query Transform in the middle.
3.
Create a linked server on the
target SQL Server relational database server that points to the Analysis Server
hosting your mining models.
This tip
will dive into the third way. We will show you how to use a linked server to
execute DMX queries from the SQL Server relational engine; once you can do
that, it’s simple to manipulate the data to your heart’s content using T-SQL
functions and save it to a table using SELECT-INTO.
Linking SQL Server to Analysis Services
Establish
a link to an AS server as follows:
EXEC sp_addlinkedserver
@server='LINKED_AS', -- local SQL name
given to the linked server
@srvproduct='',
-- not used (any value will do)
@provider='MSOLAP', -- Analysis
Services OLE DB provider
@datasrc='localhost', -- Analysis Server
name (machine name)
@catalog='MovieClick' -- default
catalog/database
Alternatively,
you can use SQL Server Management Studio to create the linked server:
-
Connect to your
SQL Server Database Engine instance using SQL Server Management Studio.
-
Expand the
Server Objects node under the top level node for your server in Object
Explorer.
-
Right-click on
Linked Servers and select “New Linked Server …”.
-
Enter a name for
the Linked server (“LINKED_AS”).
-
The “Other data
source” radio button should be selected by default for Server type.
-
Select the
latest version of the Microsoft OLE DB Provider for Analysis Services (10.0
for SQL Server 2008) for Provider.
-
Enter any name
in the Product name field (“DM” will work).
-
Enter the name
of the Analysis Server instance hosting your mining model(s) in the Data
source field.
-
Enter the name
of the AS database containing the model(s) you want to query in the
Catalog field.
-
Click OK.
(Note:
You can generate the “EXEC spaddlinkedserver”
statement by right-clicking on the newly-added linked server and selecting “Script
Linked Server as …”.)
Running DMX Queries from T-SQL
Now you can
do data mining queries from the SQL Server relational engine with T-SQL and insert the
results into a SQL table using OPENQUERY like this:
SELECT *
INTO DMResults FROM
OPENQUERY(LINKED_AS,
'SELECT Cluster() AS [Cluster], ClusterProbability() AS
[Prob]
FROM
[Customers - Clustering]
NATURAL
PREDICTION JOIN
OPENQUERY([Movie Click],''SELECT * FROM Customers'') AS t')
Of course,
you can also do all kinds of manipulations on the results that may not have been
possible in straight DMX. For example, finding the average cluster probability
of each cluster becomes simple. Well, almost – since the data type returned by
the Cluster() function is something that GROUP BY does
not support, you have to do some casting first. The actual query would
look like this:
SELECT Cluster,
AVG(Prob) FROM
(SELECT CAST(Cluster
AS Char(30)) AS Cluster, Prob FROM OPENQUERY(LINKED_AS,
'SELECT Cluster() AS [Cluster],
ClusterProbability() AS [Prob] FROM [Customers - Clustering]
NATURAL
PREDICTION JOIN
OPENQUERY([Movie Click],''SELECT * FROM Customers'') AS t')
) AS t
GROUP BY
Cluster
That will
give you a nice result showing you, in a way, the affinity of each cluster
based on the input set. That is, if you ran such a query against the
training data, you could say that the clusters with a higher probability are
"tighter" than the ones with low probabilities.
And this is
is just the starting point. We are sure you will unleash your creativity and
SQL skills to come up with page-long T-SQL queries on top of DMX.
Key Things to Remember
When you’re putting together the DMX to embed in
the OPENQUERY from SQL Server, keep the following in mind:
1. Double
your single quotes.
2. Flatten
nested results. Even if you are not explicitly including a sub-select or
table-returning expression in your query, remember that a simple “SELECT *”
might include a nested table. Example: “SELECT * FROM model.CONTENT”.
3. Alias
function calls like PredictProbability(). If you have multiple function
calls in your DMX statement, they will all have a default column name of
"Expression" and T-SQL will complain about that.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Fri 08/22/2008 @ 09:33
|
|
This article highlights a new downloadable sample that allows you to explore all the new Time Series enhancements in SQL Server 2008.
|
|
SQL Server 2008 has powerful new Time Series features that you will definitely want to explore and put to use. We have built the SQL Server 2008 Forecasting Sample app to help you get up to speed. The tool features readily-accessible time series data, an easy-to-use interface to model and play with the algorithm features and an explanation window with a link to the server-side DMX queries so you can understand what's going on under the hood.
Time Series Enhancements in SQL Server 2008
The sample highlights the following SQL Server 2008 Time Series enhancements:
-
Building time series models using either the ARTXP algorithm carried over from SQL Server 2005, the new ARIMA algorithm in SQL Server 2008, or MIXED mode that blends the two algorithms using a smoothing parameter
-
Prediction with new data using a NATURAL PREDICTION JOIN on a time series model
-
Adding new training data on the fly during prediction using EXTEND_MODEL_CASES prediction flag
-
Predicting a new series based on the model for a similar series using the REPLACE_MODEL_CASES flag
You can learn more about these features from SQL Server 2008 Books Online. Or if you prefer video, watch the Mastering Time Series with SQL Server 2008 Data Mining TechNet webcast.
Forecasting Sample Feature Highlights
-
Get easy access to external time series data (stock quotes) and built-in sample data
-
Upload your own data
-
Easily build, predict and compare time series models
-
Visually display dependencies of predictions on past data from the ARTXP and ARIMA equations
-
Explore various properties of predictions (standard deviation, ARIMA equation)
-
Perform what-if analysis by changing future predictions
-
Predict a data series by applying different models
-
Copy the historic and predicted data into the clipboard for external analysis
-
Dynamically zoom into any chart area
-
View detailed text explanations for all actions, plus the actual DMX statements associated with each action
-
Just click the Help button if you're lost

So, download the SQL Server 2008 Forecasting Sample and dig into these features.
We will be posting periodic articles on fun things to do with this sample, as well as updates.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Mon 04/07/2008 @ 07:30
|
|
This tip shows you how to create server data sources directly from DMX by calling a stored procedure (that you can build and deploy using the accompanying source code).
|
|
Creating a data source from DMX
As we know, DMX is not just a query
language - it includes syntax for creating and
processing mining structures and models as well. So
you could choose build your entire data mining
project using DMX, except for one thing: to train
your mining structure/model using INSERT-INTO, you need a
data source object and DMX doesn't give you a way to
create one!
Fortunately, this is easy to fix - just
write a stored procedure that creates data source
objects for you, deploy it to the server and viola,
you can create as many data sources as you want
directly from your queries without having to write
C# code each time to do so.
Deploying the Stored Procedure
You can create a C# class library
project in VS and replace the contents of the
Class1.cs file with the code below or you can
download the project and just fix up the references.
Follow these steps to build and
deploy the project:
-
Add a reference to the msmgdsrv.dll (Microsoft.AnalysisServices.AdomdServer.dll)
class library from the Browse tab.
This class library is available in the location where Analysis Services
2005 is installed By default, this location has the form: C:\Program Files\Microsoft SQL Server\MSSQL.2\OLAP\bin. Please note that, depending on your installation options, MSSQL.2 might be
MSSQL.1 or MSSQL.3 or so.
-
Add a reference to "Analysis
Management Objects" from the .NET tab.
-
Build the project.
-
Deploy the stored procedure on
the server:
-
Open SQL
Server Management Studio.
-
Connect to
the target Analysis Services server instance.
-
In the Object
Explorer, select the 'Assemblies' node at the server level.
-
Right-click
and select 'New Assembly'.
-
For the
FileName, field, use the
Browse (...) button to navigate to the location where the stored
procedure was built. A file named CreateDataSourceSP.dll should be available
at that location, under the bin\debug folder. Select that
file.
-
Assembly Name
should be automatically filled with CreateDataSourceSP.
-
Set Permissions to
Unrestricted.
-
Set Impersonation to "Use the
credentials of the current user".
-
Click OK to
deploy the stored procedure.
Using the Stored Proc from DMX
You can now call the stored procedure to create a
new data source using the following syntax in DMX:
CALL
CreateDataSourceSP.CreateDataSource('NewDS',
'Provider=SQLNCLI;Location=localhost','ImpersonateCurrentUser','','')
Source Code
And finally, here's the
source code if you want to create your own project:
using
System;
using
System.Collections.Generic;
using
System.Text;
using
Microsoft.AnalysisServices;
using
Microsoft.AnalysisServices.AdomdServer;
using
System.Diagnostics;
namespace
SSDM
{
public class
SSDM
{
[SafeToPrepare(true)]
public void
CreateDataSource(
string DataSourceName,
string ConnectionString,
string
sImpersonationMode,
string Account,
string Password)
{
if (Context.ExecuteForPrepare)
return;
// Connect to the calling session
Microsoft.AnalysisServices.Server svr = new
Microsoft.AnalysisServices.Server();
svr.Connect("*");
// Get the current database
Database db = svr.Databases.GetByName(Context.CurrentDatabaseName);
// Create a new datasource and set connection string
DataSource ds = db.DataSources.AddNew(DataSourceName, DataSourceName);
ds.ConnectionString = ConnectionString;
// Set the impersonation info
bool FoundMode = false;
foreach (ImpersonationMode mode
in Enum.GetValues(typeof(ImpersonationMode)))
{
if (mode.ToString() == sImpersonationMode)
{
FoundMode = true;
ds.ImpersonationInfo = new
ImpersonationInfo(mode, Account, Password);
break;
}
}
if
(!FoundMode)
throw (new
SystemException("Invalid
Impersonation Mode"));
// Update the server with the new datasource
ds.Update();
}
}
}
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Wed 08/06/2008 @ 08:48
|
|
This tip shows you how to get all the pieces you need to try out the new SQL Server 2008 DM Add-Ins for Office 2007.
|
|
The final release for SQL Server 2008 Data Mining Add-ins for Office 2007 is available for download. Here are the steps to setup the add-ins on your machine and play with the new features:
-
Download the evaluation version of SQL Server 2008 and install Analysis Services (client and server components). This should install .NET Framework 2.0 and ADOMD.NET, both prerequisites for the add-ins.
-
If you do not have Microsoft Office 2007 installed on your machine, download and install the trial version of Microsoft Office Professional 2007 (Windows Live sign-in required).
-
If you do not have Microsoft Visio 2007 installed on your machine, download and install the trial version of Microsoft Visio Professional 2007 (Windows Live sign-in required).This is required only if you want to use the SQL Server 2008 Data Mining Add-in for Visio.
-
Download and install the add-ins. Only the Table Analysis Tools add-in is installed by default so make sure you select all components during setup if you would like to test-drive all three add-ins (there is no reason not to).
-
Follow the Getting Started wizard (which launches the first time you run Excel 2007 or Visio 2007 after installing the SQL Server 2008 DM Add-ins) to configure your Analysis Services instance correctly for using the add-ins. If you miss it for some reason when starting Excel 2007 or Visio 2007, the Getting Started wizard is available from the Help button under the Data Mining tab (ribbon) in Excel 2007 as well as from Start menu -> Programs -> Microsoft SQL Server 2008 DM Add-ins.
-
Open up the sample Excel workbook provided with the add-ins (Start menu -> Programs -> Microsoft SQL Server 2008 DM Add-ins -> Sample Excel Data), click within the table on the Table Analysis Tools Sample sheet and select the Analyze tab under Table Tools to see the Table Analysis Tools ribbon. Or alternatively, you can create your own table in Excel by selecting a range and clicking the Format as Table button from the Home tab (ribbon).
You can get more information about the new features in this version of the add-ins on this page. You can learn more about all the cool features carried over from the SQL Server 2005 version here.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Tue 01/15/2008 @ 04:31
|
|
This tip shows you how you can apply the cross-validation feature in the upcoming SQL Server 2008 release to estimate if the training set size is sufficient for a given model.
|
|
How much training data is enough?
This is a question that frequently
comes up: when do you know that feeding more data to the mining
model will stop improving accuracy? We think the accuracy
procedures in
SQL Server 2008 might be able to help you get some
reasonable estimates.
Background
Algorithms learn from data and,
generally, feeding them with more data will improve the
accuracy. This improvement, however, is asymptotic. If the
training data is statistically representative of the problem
space, then a classifier will stop learning after a certain
limit: new data points will support existing patterns and will
help reducing the risk of over-training, but will not add more
information to the model. The solution we suggest for SQL Server
2008 allows you to figure out if your mining model is likely to
gain from adding new data or is likely to maintain the same
accuracy.
The solution is based on the
new cross-validation feature in SQL Server 2008, so we
will start with a small presentation of this mechanism.
Cross-validation takes a data set, splits it into a number of
(roughly equal size) partitions, called folds, then builds a set
of mining models (as many as there are folds). Each mining model
is built on top of the whole data set minus one
of the folds and then the accuracy of the model is measured on
the fold that was left out. Effectively, all data is used in
training and all data is left out (by at least one mining
model). Also, each model is validated against data that was
not seen during training. The accuracy measures
for each model are then evaluated across all the folds. A good
mining solution would provide compact accuracy measurements
(i.e. no big differences between the measurements on different
folds). Spikes in the measurements may indicate problems with
the data. Note that cross validation does not improve the
accuracy of the model, it just provides an “honest” estimation
of that, reducing the risk of over-training.
In SQL Server 2008, cross
validation allows the user to specify how many cases should be
used for the procedure. This is mainly for performance purposes
— it allows a quick comparison between different mining
models/algorithms on the same data set.
Using Cross-Validation To Determine
Data Sufficiency
Now back to the original
problem: would adding new data benefit accuracy?
The cross-validation stored
procedure supports multiple syntax flavors. The one that is
interesting for our purpose is described below:
CALL
SystemGetCrossValidationResults( ,
, , , ‘’)
The first two parameters are
straightforward: the mining structure containing the data and the mining
model to be cross validated. (Note that the stored procedure
does not require either the mining structure or the model to be
processed; if mining structure is not processed, the stored
procedure will process it first, but not its contained models).
As we mentioned, the procedure creates a number of models and
the source model is used as a template.
The Number of Folds
parameter specifies how many folds should be used in the
procedure while the Number Of Cases parameter specifies
how much is the data set to be used by the procedure. The Target
Attribute parameter is used for:
- ensuring stratified sampling
for the folds (the distribution of the Target Attribute’s states
is almost the same in all the folds)
- directing the accuracy
measurements to be performed on the mining models built by the
procedure.
A typical call is would set
NumberOfFolds to 10 and NumberOfCases to, say,
1000. Such an invocation builds a 1000-case sample (using random
selection)with ten 100-case folds over the sample. 10
models are built, each trained with 900 cases and evaluated
against the remaining 100.
To see if new data would
improve the accuracy, we could build models starting with a
small training set, then incrementally increase (for example,
double) the training set size until the accuracy flattens.
The Algorithm
Here are the actual steps of
the algorithm:
Step 1: Start with an
existing mining structure, using a reasonably large data set
We started with a mining
structure, MyStruct, built on top of the [v Target Mail] view of
the
Adventure Works BI sample database.
Step 2: Define a mining
model in the structure.
We created a simple model,
ClassifyBikeBuyer, predicting Bike Buyer and also English
Occupation. We used Microsoft Decision Trees with all the wizard’s
default suggestions, with the exception that we changed Bike Buyer to
"Discrete
Boolean".
Step 3: Start with a
rather small training set size (say, 10 — it is highly unlikely
that 10 is a good training set size, but it makes the charts
look good :-) )
Step 4: Execute the
cross validation procedure using 2 as the number of folds and 2*TrainingSetSize
as the number of cases. This step is the interesting part of the
algorithm: we have to use at least 2 folds, or the procedure
would fail. On the other hand, we want to use
at least 2 folds, so that the accuracy measurements are not
biased.
CALL
SystemGetCrossValidationResults( MyStruct,
ClassifyBikeBuyer, 2, 20, ‘Bike Buyer’)
Step 5: Record the
accuracy results for the current training size. The stored
procedure result contains multiple accuracy measurements for
each model built by the procedure. As the number of folds
argument was 2, there are two sets of results:

Now, there are a few things to
be recorded in the result set:
- First, the values of the
accuracy measurements for each fold. we used a single measure,
the Log Likelihood Score, a value that, by itself, does not
depend directly on the training or test set size (Pass/Fail
classification scores do depend on these)
- Then, the partition size for
each fold. This is very important, because it is the termination
condition for the algorithm (we do not know ahead of time how
much data is in our mining structure). The folds are almost the
same size; the stratified sampling mechanism may lead to minor
differences. If at least one fold has a size exceeding or equal
to our training set size, the procedure had enough data (see more
details on Partition Size in Step 6).
To simplify the results for
this execution (training set size = 10 cases), we stored the μ
(average) and σ (standard deviation) for the Likelihood Log
Score measurement
Step 6: Double up the
training set size and repeat from step 4 until not enough data
is found.
The Partition Size in the
result is not the size of the training data for
the model indicated by Partition Index. Every row marked with
Partition Index = 1 is about the model trained leaving partition
1 out (and, therefore, evaluated on partition 1).
The procedure above will stop
when step 4 asks for a data set size which exceeds the data
available in the mining structure.
The Output
Let's examine the results of
this process.
Our sample mining structure has
around 13k cases (the view contains 18k and we preserved 30% as
holdout). Therefore, the procedure worked correctly for data
sets of: 20, 40, 80, 160, 320, 640, 1280, 2560, 5120 and 10240
cases (so it stopped with a training size of 5120 cases —
remember, data size is split in 2 folds!). When asking for 20480
cases, the procedure only found enough data to create 2
partitions of 6470 cases. To evaluate the accuracy of (almost)
the whole training set in the structure, we added one more call:
CALL
SystemGetCrossValidationResults( MyStruct,
ClassifyBikeBuyer, 10, 0, ‘Bike buyer’)
Effectively, we asked the
procedure to perform 10-fold cross validation, so we can
evaluate the accuracy of a model using 90% of the training data.
Not exactly the whole data set, but close enough!
We repeated the same steps for
the ‘English Occupation’ target and plotted the results in
Excel. The chart below presents the accuracy evolution
(Likelihood Log Score) with the size of the training set, and
the error associated with each data point is the standard
deviation for the likelihood log score among the folds of that
size.

So, what does the chart tell
you?
Likelihood Log Score is a
measure that cannot exceed 0. The closer it is to 0, the better
is the accuracy of the model. The chart above suggests that:
- Our structure data is
pretty good for predicting Bike Buyer. The accuracy grows
minimally after 1280 training cases
- However, the data does not
seem enough to predict the English Occupation. As you can
see, the slope is still pretty steep on the last segment of
the English Occupation line.
And Now, The Code
The algorithm seems interesting
enough to apply it later, so we encapsulated it in a small
C# Analysis Services stored procedure.
All the code is available in
this file:
DataSufficiency_AccuracyEx.cs. To use it, you will need to download the
file, create a new C# class library project, add a reference to
Microsoft.AnalysisServices.AdomdServer library and include the
file in the project.
Once you build and deploy the
stored procedure (Safe permission set, impersonate current
user), you can invoke it with a call like this:
CALL
AccuracyEx.EvaluateModelingData(’ClassifyBikeBuyer’, ‘English
Occupation’, ‘Log Score’)
- The first parameter is the
mining model (which does not need to be processed, but must
belong to a processed structure)
- The second parameter is
the target predictable attribute.
- The last parameter is the
measure to be used in the result. If an empty string is
passed as argument, then all the measures will be returned.
Here is the output we used to
build
the chart shown earlier:

The procedure can be easily
modified to work for multiple predictable targets (no input
target attribute) and even for Clustering models.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Tue 11/27/2007 @ 10:36
|
|
This tip provides insight into the inner workings of the Microsoft Naive Bayes algorithm, showing how the algorithm computes the score used to filter out correlations.
|
|
[Note: This article is the first in a new "Under the Hood" series that explores the inner workings of the SQL Server Data Mining algorithms and infrastructure. Enjoy the deep dive!] Minimum Dependency Probability? When you create a Microsoft_Naive_Bayes model in SQL Server 2005 Data Mining, you can set an algorithm parameter MINIMUM_DEPENDENCY_PROBABILITY that specifies the minimum dependency probability between the input and output attributes. The algorithm then filters out input attributes that have a dependency probability lower than the set threshold. In this article, we show how exactly to compute the dependency probability between an input and an output attribute. The main idea behind this comes from the research paper titled “A Bayesian Approach to Learning Bayesian Networks with Local Structure”. A Naïve Bayes model can be treated as a special case of a Bayesian network which has conditional probabilities at every node. For a Naïve Bayes model, the dependency is just between input attributes and the output attribute only. Computation Walkthrough Let us consider a simple dataset with one input attribute “Number of Children” and one output attribute “Bike Buyer”. | Bike Buyer | Number of Children | Count | | No | 0 | 1 | | No | 1 | 9 | | No | 2 | 35 | | Yes | 0 | 40 | | Yes | 1 | 10 | | Yes | 2 | 5 | Let X = Score of the Bayesian Network without the split on “Number of Children” Y = Score of the Bayesian Network with split on “Number of Children” P = dependency probability of the node By definition the dependency probability is proportional to the score which gives us: P/ (1-P) = Y / X Taking natural LOG on both sides: LOG (P / (1 – P)) = LOG Y – LOG X For computing the score for a split, we use a function LG which is the Log Gamma function. See the end of this article for an explanation on how to compute the log gamma function. We first evaluate X using the following table: Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’) Number of cases = 100 | Bike Buyer = missing | LG(1 + 0) – LG(1) = 0 | | Bike Buyer = 0 | LG(1 + 45) – LG(1) = 129.124 | | Bike Buyer = 1 | LG(1 + 55) – LG(1) = 168.327 | | Bike Buyer | LG(3) – LG(3 + 100) = -375.82 | LOG X = -74.835 We then evaluate the value of Y by splitting the data set using the three different values of Number of Children and computing the partition score in all the three cases. Number of Children = 0 Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’) Number of cases = 41 | Bike Buyer = missing | LG(1 + 0) – LG(1) = 0 | | Bike Buyer = 0 | LG(1 + 1) – LG(1) = -4.4e-11 | | Bike Buyer = 1 | LG(1 + 40) – LG(1) = 110.32 | | Bike Buyer | LG(3) – LG(3 + 41) = -123.526 | Partition Score = -10.519 Number of Children = 1 Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’) Number of cases = 19 | Bike Buyer = missing | LG(1 + 0) – LG(1) = 0 | | Bike Buyer = 0 | LG(1 + 9) – LG(1) = 12.801 | | Bike Buyer = 1 | LG(1 + 10) – LG(1) = 15.104 | | Bike Buyer | LG(3) – LG(3 + 19) = -46.679 | Partition Score = -16.781 Number of Children = 2 Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’) Number of cases = 40 | Bike Buyer = missing | LG(1 + 0) – LG(1) = 0 | | Bike Buyer = 0 | LG(1 + 35) – LG(1) = 92.136 | | Bike Buyer = 1 | LG(1 + 5) – LG(1) = 4.787 | | Bike Buyer | LG(3) – LG(3 + 41) = -119.741 | Partition Score = 20.155 LOG Y = -47.455 Node Score = LOG y – LOG X = 27.379 Node Probability = 0.99999 The above example shows a very high probability for attribute “Number of Children” predicting the value of “Bike Buyer” which is apparent from the data set. Let us consider another example where “Number of Children” is a very poor predictor for “Bike Buyer”: | Bike Buyer | Number of Children | Count | | No | 0 | 10 | | No | 1 | 15 | | No | 2 | 25 | | Yes | 0 | 10 | | Yes | 1 | 15 | | Yes | 2 | 25 | Using the same score computation method, the values obtained are as follows: Node Score = LOG y – LOG X = -6.965 Node Probability = 0.000943 We see that the attribute “Number of Children” has a very low probability of predicting the Bike Buyer state. How Do I Verify This? Simple - use the accompanying download for this tip. It includes an Analysis Services project that uses an Access .mdb file with two data tables, “BikeBuyer1” for the first case and “BikerBuyer2” for the second, and creates two mining structures with a Naïve Bayes model for each case. It sets the algorithm parameter MINIMUM_DEPENDENCY_PROBABILITY to 1e-7 in the second case so that the node “Number of Children” shows up and also sets each attribute type to “Discrete”. Once the models are processed, you can use the Mining Content viewer under the “Model Viewer” tab in Business Intelligence Development Studio to look up the node score. The Mining Content viewer will have a column MSOLAP_NODE_SCORE which shows the value of the node score for Number of Children where the Node Type is 10. | Sidebar: Computing Log Gamma There are several numerical methods available to compute the gamma function and hence the natural logarithm of the value which we term as the log gamma function. The series is represented as follows: Gamma(z + 1) = (z + d + .5)^(z + .5) e^-(z + d +.5) * sqrt(2*pi) * [c0 + c1/(z+1) + ... + Cn/(z+n) Ln[Gamma(z + 1)] = (z + .5) * ln [z + d + .5] - (z + d + .5) + ln [sqrt(2*pi) * SERIES] |
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Tue 11/13/2007 @ 01:59
|
|
This tip shows you how to apply new DMX syntax to build a False-Positive/False-Negative report using Reporting Services.
|
|
In a previous tip, we explored new syntax introduced in SQL Server 2005 SP2 that allows you to pass column references in place of parameters in DMX queries. One of the nice scenarios this enables is generation of accuracy reports using Reporting Services. We will show you how to build one of these in this tip.
The Scenario
Let’s imagine that we are working for a bank that wants to detect possibly fraudulent transactions and investigate them. It might cost the bank just a few dollars to investigate a transaction that is flagged as fraudulent. On the other hand, the bank may lose several thousand dollars on every transaction that was in fact fraudulent but was not reported as such. Our goal is to minimize the bank’s expenses by detecting the probability threshold at which transactions should be reported as fraudulent.
The Cost of False Positives and False Negatives
A prediction is considered a true-positive (TP) if the probability of the prediction is above a certain threshold and the predicted state is the correct state. In DMX terms, this can be expressed as follows:
PredictProbability([Column],@State) > @Threshold AND t.[Column] = @State
A Prediction is true-negative (TN) if the probability of the prediction is below the threshold and the column’s state is not the predicted state, that is
PredictProbability([Column],@State) <= @Threshold AND t.[Column] <> @State
A prediction is a false-positive (FP) if
PredictProbability([Column],@State) > @Threshold AND t.[Column] <> @State
And finally, a prediction is a false-negative (FN) if
PredictProbability([Column],@State) <= @Threshold AND t.[Column] = @State
Computing the Cost
The bank’s expenses are equal to:
Number of false negative predictions * Cost of false-negative prediction + Number of false positive predictions * Cost of false-positive prediction
Putting Together the Report
Let’s generate a report that shows the number of true-positive, true-negative, false-positive and false-negative predictions and the bank’s expenses associated with all of the transactions. The report allows the user to model various thresholds and pick one that minimizes the associated expenses.
The accompanying download for this tip includes the complete BI Development Studio solution for the report as well as a backup file of the Analysis Services database containing the model that’s used by the report.
The following steps walk you through the process of creating the same report from scratch.
Step 1: Restore the database backup
Go to SQL Server Management Studio, connect to your Analysis Services instance and restore the database from the BankDB.abf file that is part of the download.
Step 2: Use the Report Server Project Wizard to create the report
1. Open Visual Studio and create a project using the Report Server Project Wizard template.
2. Click Next button on the “Welcome to the Report Wizard” page.
3. To define a new data source, select “Microsoft SQL Server Analysis Services” from the list of data source types, and click the “Edit…” button. Type in the server name and select the database name. Click OK and Next.
4. Click the “Query Builder…” button on the “Design the Query” wizard page.
a. When the Query Builder is opened, click the “Query Parameters” button in the toolbar and define the following query parameters:
· The State parameter will be used to specify the state that is being predicted. Fraudulent transactions have a value of 1 in the Fraudulent column, that is why parameter value is set to 1.
· Threshold parameter will be used to specify probability threshold.
· Cost_Per_FP parameter specifies cost of investigating false-positive transactions
· Cost_Per_FN parameter specifies average bank’s losses on fraudulent transaction.
Click Ok when all query parameters are defined.
b. Click “Design Mode” button in the toolbar of the Query Builder to turn off the design mode. Copy and paste following query and click Ok.
SELECT
vba!cint(PredictProbability([Fraudulent],@State)>@Threshold and t.[Fraudulent]=@State)*(-1) as [TP],
vba!cint(PredictProbability([Fraudulent],@State)<=@Threshold and t.[Fraudulent]<>@State)*(-1) AS [TN],
vba!cint(PredictProbability([Fraudulent],@State)>@Threshold and t.[Fraudulent]<>@State)*(-1) AS [FP],
vba!cint(PredictProbability([Fraudulent],@State)<=@Threshold and t.[Fraudulent]=@State)*(-1) AS [FN]
From
[BankModel]
NATURAL PREDICTION JOIN
(SELECT * FROM BankModel.CASES) AS T
5. Click Next button in the wizard
6. Click Finish button twice.
Step 3: Modify the report in the report designer
1. Change the report title from “Report1” to “Bank Report”.
2. Remove the table with the TP, TN, FP and FN values; we will not need it.
3. Open the Toolbox window and drag and drop 10 textboxes. Type the text for 5 of the textboxes as shown below:

4. In the report, the number of true-positive, true-negative, false-positive and false-negative transactions can be reported as
=SUM(Fields!TP.Value),
=SUM(Fields!TN.Value),
=SUM(Fields!FP.Value),
=SUM(Fields!FN.Value)
respectively.
5. Total expenses can be calculated as
=Parameters!Cost_Per_FP.Value*SUM(Fields!FP.Value)+Parameters!Cost_Per_FN.Value*SUM(Fields!FN.Value)
6. Change the font and color of the text boxes to match your aesthetic preferences.
The Final Output
Now we can estimate the bank’s expenses by plugging in different combinations of values for the Threshold, Cost_Per_FP and Cost_Per_FN parameters. Switch to the Preview tab to see the results.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Sat 07/28/2007 @ 11:16
|
|
This tip shows you how to leverage the programmability component of the Visio DM Add-in to render a model as an interactive web page from your app.
|
|
Programmatically generating HTML visualizations of models using DM
Add-ins
Let's say you have an
application that builds mining models on the fly for your business users. How do
you quickly generate a visualization that they can easily share as a interactive
web page? No worries - the Visio Add-in from the
SQL Server 2005 DM Add-ins
for Office 2007 package has a programmability component that will let you
accomplish this with a dozen or so lines of C# code.
The
component connects to an Analysis Services instance, starts the Visio
application in silent mode, uses the Visio engine in conjunction with the Add-in
to render the data mining diagram in Visio and finally takes advantage of the Save as Web
feature in Visio to convert the drawing to a interactive HTML web page.
The
Visio Add-in supports the following visualizations:
Setting up the C# Project
Here's what you need to do
before jumping into the actual code:
-
Install Visio 2007 and the Visio component of
the SQL Server 2005 DM Add-ins for Office 2007 on your machine.
-
Since the required
add-in
assemblies are in the GAC, you have to copy them to a local folder before you
can add the programmability component as a project reference. The steps below show how to locate and copy the files from
the GAC:
- Open a command prompt and navigate to
the folder %windir%\assembly\GAC_MSIL.
- Locate the file
Microsoft.SqlServer.DataMining.Office.Visio.Programability.dll (dir /s
Microsoft.SqlServer.DataMining.Office.Visio.Programability.dll).
- Copy the file to a folder on your
machine.
- Locate the file
Microsoft.SqlServer.DataMining.Office.Visio.dll (dir /s
Microsoft.SqlServer.DataMining.Office.Visio.dll).
- Copy the file to a folder on your
machine.
-
In the C# project in Visual Studio, add an assembly reference to the
Visio Interop library under COM -> “Microsoft
Visio 12.0 Type Library” and “Microsoft Visio 12.0 Save As
Web Type Library”.
-
Add a reference to the add-in
assembly Microsoft.SqlServer.DataMining.Office.Visio.Programability.dll copied in step #2.
And Finally, the Code
The following code snippet shows you how to
generate an
HTML visualization for a decision tree model built using the AdventureWorks dataset.
An example of the output from this can be found
here.
using System;
using
System.Collections.Generic;
using
System.Text;
using System.IO;
using
Microsoft.SqlServer.DataMining.Office.Visio.Programability;
namespace
DMDigramUsingViso
{
class Program
{
//This is the
folder where the data mining addins have been installed.
public const string
DMAddinInstallPath = @"C:\Program Files
(x86)\Microsoft SQL Server 2005 DM Add-Ins";
//This is the
name of the Visio template file for the data mining addins.
public const string
VisioTemplateName = @"Microsoft Data
Mining.vst";
//This is the
name of the target HTML file.
public const string
WebFileName = @"TargetMail.htm";
static void Main(string[]
args)
{
try
{
string
visioFile = DMAddinInstallPath + "\\"
+ VisioTemplateName;
string
webFile = System.IO.Directory.GetCurrentDirectory()
+ "\\" + WebFileName;
Console.WriteLine("DM Addin install path : {0}",
DMAddinInstallPath);
Console.WriteLine("Visio Template : {0}",
VisioTemplateName);
Console.WriteLine("Output : {0}", webFile);
//----------------------------------------------------
// Create the parameter object for the decision tree
// drawing
and populate the required parameters
//----------------------------------------------------
ParameterDecisionTree
parameter = new ParameterDecisionTree();
//Friendly
name for the connection
parameter.ConnectionName = "AdventureWorksConnection";
//Name
of the Analysis Server
parameter.DataSourceName = "localhost";
//Name
of database
parameter.CatalogName = "AdventureWorks";
//Name
of the mining model
parameter.ModelName = "Target Mail";
//Name
of the decision tree predictable
parameter.TreeName = "Bike Buyer";
//Flag
to hide the progress dialog
parameter.Silent = true;
//----------------------------------------------------
//
Optional properties for the parameter object
//----------------------------------------------------
//Shade
tree nodes using support
parameter.GradientType = GradientType.SupportGradient;
//Show
support in the nodes
parameter.ShowSupport = false;
//Fill
color for the nodes
parameter.FillColorArgb =
System.Drawing.Color.Aqua.ToArgb();
//Fill
pattern color for the nodes
parameter.PatternColorArgb =
System.Drawing.Color.Azure.ToArgb();
//----------------------------------------------------
//
Drawing helper object which performs the actual rendering
//----------------------------------------------------
DMDrawingHelper
helper = new DMDrawingHelper(visioFile,
false);
//Start
the Visio application
helper.Start();
/*
Helper object allows access
to Visio object model using the following methods:
helper.CreateNewDocument();
helper.CreateNewPage();
helper.ActiveDocument;
helper.ActivePage;
helper.Application;
*/
//Render
the decision tree on the current active page
Console.WriteLine("Rendering decision tree using the visio
engine...");
helper.DrawDecisionTree(helper.ActivePage, parameter);
//Convert
all the pages of the drawing to HTML
Console.WriteLine("Converting to HTML using the visio engine...");
helper.SaveDocumentAsWebPage(helper.ActiveDocument,
-1,
-1,
webFile,
DMDrawingHelper.ShowNavigationBar
| DMDrawingHelper.ShowPanAndZoom | DMDrawingHelper.ShowPropertiesWindow | DMDrawingHelper.ShowSearchTool);
//Close
the Visio App
helper.Close();
Console.WriteLine("HTML File is generated at : {0}",
webFile);
}
catch
(Exception ex)
{
Console.WriteLine("DMDigramUsingViso Error : {0}",
ex.Message);
}
}
}
}
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Fri 07/27/2007 @ 08:01
|
|
This tip explores a DMX extension introduced in SQL Server 2005 SP2 that can be used to render accuracy reports directly in Reporting Services.
|
|
One of the nice improvements we made to SQL Server 2005 DMX was the ability to predict the probability of a particular state. That is, in SQL Server 2000, you could get the probability of the predicted state using the following query:
SQL Server 2000: SELECT PredictProbability([Gender]) FROM Customers PREDICTION JOIN …
This query gives you the probability of whatever state is predicted for Gender for each row of the input. However if you wanted to get the probability that a customer was ‘Female’, for example, you would have to write a query like this:
SQL Server 2000: SELECT FLATTENED (SELECT $Probability FROM PredictHistogram([Gender]) WHERE [Gender]='Female') FROM Customers PREDICTION JOIN …
Which, to put in polite terms, is a tad bit cumbersome. Not only that, but you get a column called “Expression.$PROBABILITY” as your output – not nice.
Luckily in SQL Server 2005 Data Mining, we added new syntax that allowed you to simply indicate the state for which you wanted the probability directly in the function call, like this:
SQL Server 2005: SELECT PredictProbability([Gender], 'Female') FROM Customers PREDICTION JOIN …
And life was good. We even made it nicer by allowing the specification of a parameter for the state, so you could write queries like the following:
SQL Server 2005 SP2: SELECT PredictProbability([Gender], @Gender) FROM Customers PREDICTION JOIN …
And specify the desired state at run-time. Wonderful! However, there was a little piece missing. It turns out we even allowed arbitrary expressions like ‘Fe’+’male’ there, but we left out the ability to determine what the probability of a known state from the input. That means that if you wanted to determine what the probability was of a state and the state happened to be stored in the database, you would have to first issue a query to get the state and then issue another query to get the probability of that state. Not fun.
In SQL Server 2005 SP2 Data Mining we generalized the language so that essentially, every where you can place a parameter, you can also place a column reference. That means that you can solve the above problem with a query like this:
SELECT PredictProbability([Gender], t.Gender) FROM Customers NATURAL PREDICTION JOIN OPENQUERY([Movie Click], 'SELECT * FROM CUSTOMERS') AS t
Now, where is this useful? We’re sure all of the creative DMXers out there will find innumerable uses for this new flexibility, but the most obvious application is for accuracy and lift calculations. Now, in one query, you can generate lift figures for your entire testing set without caching the data, allowing, for example, the generation of accuracy reports in Reporting Services! In the next tip, we will show you an example of this.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Mon 03/26/2007 @ 11:22
|
|
This tip shows you how to compute the R-Squared and Adjusted R-Squared metrics for Microsoft regression models using a stored procedure.
|
|
Calculating R-Squared for Microsoft regression models
R-Squared
(a.k.a. Coefficient of Determination) is a well-known metric that measures
the goodness fit of your regression model. Its value lies between 0 and 1. The
closer R-Squared is to 1, the better your model is with respect to the training
data.
R-Squared
can be calculated as:
R-Squared = 1 – RSS / TSS
where RSS is the
residual sum of squares, and TSS is the total sum of squares.
Another
interesting measure is Adjusted
R-Squared, which adjusts the R-Squared according to the number of
explanatory terms in a model.
This
article shows you how to compute both these measures for Microsoft mining
models using our favorite mechanism – stored procedures!
Calculate R-Squared
with a stored procedure
We
have put together an accompanying
package for this article that contains the following:
- A sample dataset that you
should attach to your SQL Server 2005 Database Engine instance.
- A sample SQL Server Analysis
Services project that helps you build a linear
regression model. You need to deploy the project to your Analysis Services
2005 server instance.
- A stored procedure to calculate
R-Squared for data mining models.
- A Windows application that
demonstrates how to call the stored procedure. To build this project, you
need to add a reference to Microsoft.AnalysisServices.AdomdClient.dll
located in “%ProgramFiles% \Microsoft.NET\ADOMD.NET\90”.
Let's walk you through the process of using the pieces in the
package to explore the R-Squared sample.
Step 1:
First,
build and deploy the stored procedure (note that this version only
supports models that allow drill through):
-
Open the project in the
RSquareHelper with
Visual Studio 2005.
-
Add a reference to the
msmgdsrv.dll (Microsoft.AnalysisServices.AdomdServer.dll) class library
This class library is available in the location where Analysis Services
2005 is installed By default, this location has the form: C:\Program Files\Microsoft SQL Server\MSSQL.2\OLAP\bin Please note that, depending on your installation options, MSSQL.2 might be
MSSQL.1 or MSSQL.3 or so
-
Build the project
-
Deploy the stored procedure on
the server:
-
Open SQL
Server Management Studio
-
Connect to
the target Analysis Services server instance
-
In the Object
Explorer, select the 'Assemblies' node at the server level
-
Right-click
and select 'New Assembly'
-
For the
FileName, field, use the
Browse (...) button to navigate to the location where the stored
procedure was built. A file named RSquareHelper.dll should be available
at that location, under the bin\debug or bin\release folder. Select that
file
-
Assembly Name
should be automatically filled with
RSquareHelper
-
Set the
permission of this assembly to unrestricted.
-
Select OK and
deploy the stored procedure
To
call the stored procedure, you can execute the following command in SQL Server
Management Studio:
CALL
RSquareHelper.RSquareHelper.RSquare.CalculateRSquareFromDrillThroughModel (<modelName>, <targetAttributeName>)
o
<modelName> is the name of the target model
o
<targetAttributeName> is the name of the target
attribute
Step 2:
Now, go ahead and deploy the "RSquareLinearRegression"
Analysis Services project. This project creates a simple linear
regression model with two integer attributes x and y, where x is used as input
and y as output.
Step 3:
Finally, you can build and run the C# application CalculateRSquare in
the package. This application demonstrates how to call the stored procedure to calculatethe R-Squared metric.
The
following figure shows the interface of this application:

From
the interface, you can specify a server name and press the Connect to server
button to establish the connection to the Analysis Services server. The
application will display a list of valid catalogs, models and target attributes.
After selecting the desired model and target attribute, you can then click Calculate
R-Squared button to check the result. The application calls the stored
procedure to calculate the statistics. After fetching the result,
it shows the statistics in the text boxes at the bottom of the screen.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Tue 02/20/2007 @ 04:01
|
|
This tip shows you how to get all the pieces you need to try out the new SQL Server 2005 DM Add-Ins for Office 2007.
|
|
Try out the SQL Server 2005 DM Add-ins for Office 2007
So, the SQL Server 2005 Data Mining Add-ins for Office 2007 are
finally available for *free* public download and you can't wait to check them
out. However, you haven't purchased a copy of Office 2007 (yet) and you don't
have access to an installed instance of SQL Server 2005 Analysis Services. How
then do you go about playing with the add-ins? Let's walk you through the steps:
-
Download
SQL Server 2005 Enterprise Evaluation Edition (180-day Trial Software)
and install Analysis Services (client and server components). This should
install .NET Framework 2.0 and ADOMD.NET, both prerequisites for the
add-ins.
-
Download and install the
trial version of Microsoft Office Professional 2007 (Windows Live
sign-in required).
-
Download and install the
trial version of Microsoft Visio Professional 2007 (Windows Live
sign-in required).
-
Download
SQL Server 2005 SP2 and install the components for Analysis Services.
-
Download and install the
add-ins. Only the
Table Analysis Tools add-in is installed by default so make sure you select
all components during setup if you would like to test-drive all three
add-ins (there is no reason not to).
-
Follow the Getting Started wizard (which launches the
first time you run Excel 2007 or Visio 2007 after installing the SQL Server
2005 DM Add-ins) to configure your Analysis Services instance correctly for
using the add-ins. If you miss it for some reason when starting Excel 2007
or Visio 2007, the Getting Started wizard is available from the Help
button under the Data Mining tab (ribbon) in Excel 2007 as well as from
Start menu -> Programs -> Microsoft SQL Server 2005 DM Add-ins.
-
Open up the sample Excel workbook provided with the add-ins
(Start menu -> Programs -> Microsoft SQL Server 2005 DM Add-ins -> Sample
Excel Data), click within the table on the Table Analysis Tools Sample
sheet and select the Analyze tab under Table Tools to see the
Table Analysis Tools ribbon. Or alternatively, you can create your
own table in Excel by selecting a range and clicking the Format as Table
button from the Home tab (ribbon).
You can get more information about the add-ins, including
pointers to additional resources like tutorials
on this page.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Wed 01/24/2007 @ 08:17
|
|
This tip explains how to programmatically construct the regression formula on a node from the node distribution content for a regression tree.
|
|
Extracting the regression formula for a regression tree node
If you build a
Microsoft_Decision_Trees model with a continuous predictable attribute and one
or more continuous input attributes marked as regressors, the algorithm produces
regression trees with nodes that include a regression formula. You can see this
formula in the mining legend when you select a node in the decision tree viewer.
But how do you make use of this formula in your own app? That is the goal of
this article.
A
little background
Let's provide some
background first though, before we dive into the details.
A
regression tree is a mining model created using the Microsoft Decision Trees
algorithm, where the predictable attribute is continuous. In some cases, one or
more input attributes might also be continuous and can be used as
regressors for the output attribute. This creates decision
tree nodes that include a regression formula, which is a linear function of
the output attribute as a function of all the input attributes. The regression
formula of output Y and inputs X1, X2, X3 is expressed in the format:
Y = Ym + m1(X1 –
X1m) + m2(X2 – X2m) + …..
Where:
Ym: Mean
of Y
Xtm: Mean of Xt
mt: Slope of the
variable Xt with respect to Y
Take the example of
a model created from the AdventureWorksDW sample database
that ships with Analysis Services 2005, where “Age” is the
predictable column (continuous) and “Yearly Income” is an input column
(continuous). When the tree is viewed in the decision tree viewer, the resulting
regression tree has a node that includes the following regression formula:
Age = 33.348+0.00008*(Yearly
Income-51,481.123)
The
formula is interpreted as follows:
·
Mean for Age: 33.348
·
Mean for Yearly Income: 51,481.123
·
Gradient of the line: 0.00008
The
following content query for the node shows the distribution:
SELECT
FLATTENED
(
SELECT
ATTRIBUTE_NAME,
ATTRIBUTE_VALUE,
[SUPPORT],
[PROBABILITY],
VALUETYPE
FROM
NODE_DISTRIBUTION
)
as D
FROM
[Model].CONTENT
WHERE
NODE_UNIQUE_NAME= '<>'
Distribution:
|
D.ATTRIBUTE_NAME
|
D.ATTRIBUTE_VALUE
|
D.SUPPORT
|
D.PROBABILITY
|
D.VALUETYPE
|
|
Age
|
Missing
|
0
|
0.000193536
|
1
|
|
Age
|
33.5595353339787
|
5165
|
0.999806464
|
3
|
|
Yearly Income
|
8.4100185198634E-05
|
0
|
0
|
7
|
|
Yearly Income
|
689.231274498812
|
0
|
0
|
8
|
|
Yearly Income
|
51481.1229428848
|
0
|
0
|
9
|
|
29.2299633602484
|
0
|
0
|
11
|
The
VALUETYPE is defined as follows:
|
Missing |
1 |
|
Existing |
2 |
|
Continuous |
3 |
|
Discrete |
4 |
|
Discretized
|
5 |
|
Boolean |
6 |
|
Coefficient |
7 |
|
ScoreGain
|
8 |
|
RegressorStatistics
|
9 |
|
NodeUniqueName
|
10 |
|
Intercept |
11 |
|
Other |
12 |
Mean
for age will be calculated as follows:
For
each attribute with a coefficient (VALUETYPE=7), multiply the value of
co-efficient with the value of mean (VALUETYPE=9) for that attribute and sum
all the products.
51481.1229428848 * 8.4100185198634E-05 = 4.329571974
Add
this to the value of Intercept (VALUETYPE=11)
4.329571974 + 29.2299633602484 = 33.55953533
Mean
for each regressor attribute, “Yearly Income” in this
case is the value of mean (VALUETYPE=9) for that attribute.
51481.1229428848
Gradient
for each regressor attribute, “Yearly Income” in this
case is the value of coefficient (VALUETYPE=7) for that attribute.
8.4100185198634E-05
Using these three values and the sign
(+/-), we can construct the regression formula:
Age= 33.55953533 + 8.4100185198634E-05 * (Yearly Income -51481.1229428848)
Extract the regression formula programmatically
Now
that we know what numbers we want and where to get them from, let's apply
the corresponding DMX query below to build the regression formula in a
program:
|
SELECT FLATTENED
(
SELECT
ATTRIBUTE_NAME
AS
TARGET,
ATTRIBUTE_VALUE
AS
COEFFICIENT
FROM
NODE_DISTRIBUTION
WHERE
VALUETYPE= 3
) as D,
(
SELECT
ATTRIBUTE_NAME + ' '
AS
VARIABLE,
ATTRIBUTE_VALUE
AS
MEAN
FROM
NODE_DISTRIBUTION
WHERE
VALUETYPE=9
OR
VALUETYPE=7
) as D
FROM
[Target Mail Decision Tree].CONTENT
WHERE
NODE_UNIQUE_NAME='00000000000'
|
using System;
using
System.Collections.Generic;
using
System.Text;
using
Microsoft.AnalysisServices.AdomdClient;
namespace RegressionFormula
{
class Program
{
static void Main(string[]
args)
{
AdomdConnection
connection = null;
try
{
//Regression
Formula Variables
string
yName;
double
yMean;
bool
yMeanSign;
List<string> xName = new
List<string>();
List<double> xMean = new
List<double>();
List<bool> xMeanSign = new
List<bool>();
List<double> xSlope = new
List<double>();
List<bool> xSlopeSign = new
List<bool>();
//Open
a connection to the Analysis Server
connection = new AdomdConnection();
connection.ConnectionString = "Data Source=local;Initial
Catalog=SampleModels;";
connection.Open();
//Create
a command object to execute the DMX
string
dmxStatement = "SELECT FLATTENED "
+
" ( " +
" SELECT " +
" ATTRIBUTE_NAME AS TARGET, "
+
"
ATTRIBUTE_VALUE AS COEFFICIENT " +
" FROM " +
"
NODE_DISTRIBUTION " +
" WHERE " +
"
VALUETYPE= 3 "
+
") as D, " +
"( " +
" SELECT " +
"
ATTRIBUTE_NAME + ' ' AS VARIABLE, " +
"
ATTRIBUTE_VALUE AS MEAN " +
" FROM " +
"
NODE_DISTRIBUTION " +
" WHERE "
+
"
VALUETYPE=9 OR VALUETYPE=7 " +
" ) as D " +
"FROM " +
" [Target
Mail Decision Tree].CONTENT " +
"
WHERE " +
"
NODE_UNIQUE_NAME='00000000000'";
AdomdCommand
command = new AdomdCommand();
command.CommandText =
dmxStatement;
command.CommandType =
System.Data.CommandType.Text;
command.Connection =
connection;
//Execute
the DMX and get the reader
AdomdDataReader
reader = command.ExecuteReader();
//Obtain
the Y variable name and the mean
reader.Read();
yName = reader.GetString(0);
yMean = reader.GetDouble(1);
yMeanSign = (yMean >= 0);
yMean = Math.Abs(yMean);
//Obtain
all X Variables (Regressors)
while
(reader.Read())
{
xName.Add(reader.GetString(2));
xSlope.Add(reader.GetDouble(3));
xSlopeSign.Add(xSlope[xSlope.Count-1]>=0);
xSlope[xSlope.Count-1] = Math.Abs(xSlope[xSlope.Count-1]);
reader.Read();
xMean.Add(reader.GetDouble(3));
xMeanSign.Add(xMean[xMean.Count-1]<0);
xMean[xMean.Count-1] = Math.Abs(xMean[xMean.Count-1]);
}
//Construct
and print the formula
string
Formula = string.Empty;
Formula += yName;
Formula += " = ";
Formula += yMeanSign ? "" : " -
";
Formula += yMean;
for
(int iter = 0; iter < xName.Count; ++iter)
{
Formula += xSlopeSign[iter]
? " + " : "
- ";
Formula += xSlope[iter];
Formula += " *(";
Formula += xName[iter];
Formula += xMeanSign[iter]?" + ":" -
";
Formula += xMean[iter];
Formula += ")";
}
Console.WriteLine(Formula);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
finally
{
if
(connection != null)
{
connection.Dispose();
}
}
}
}
}
The
output for this sample shows the constructed regression formula:
Age
= 33.5595353339787 + 8.41001851986346E-05 *(Yearly Income - 51481.1229428848)
This program can be
easily modified to apply the formula to input variable values and estimate the
output variable instead of just printing out the formula.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Thu 12/21/2006 @ 10:55
|
|
This tip shows you how to extract a set of rules that indicate a certain product from an Association Rules model
|
|
SQL Server Data Mining contains an Association Rules
browser, which displays the most important rules found inside a mining model.
However, for certain pattern exploration tasks, it may be useful to visualize
only those rules that point to a certain item.
An example: for a model* that predicts movie associations, extract
those rules that point to ‘Star Wars’. Even more, extract only rules with a probability
larger than .25 and containing ‘Indiana Jones’ or ‘Blade Runner’ on the left
hand side.
All the rules (and item-sets) are exposed in the model
content, the rowset containing all the patterns detected by the model.
Content nodes that represent rules have a NODE_TYPE value of
8 (Item sets have NODE_TYPE = 7). Inside a rule node, the NODE_DISTRIBUTION
field contains the rule’s right hand side item (with a value type of 2, “Existing”)
as well as the identifier of the node representing the left hand side itemset
(value type of 10, NODE_UNIQUE_NAME).
Therefore, to get the results requested in the beginning, we
need a query that:
-
Selects the rule nodes (NODE_TYPE=8)
-
Filters only those that contain ‘Star Wars’ in NODE_DISTRIBUTION
-
Extracts the support, probability and other information about the rule
-
Applies a probability threshold ( >.25)
-
Filters only those that contain ‘Indiana Jones’ or ‘Blade Runner’
-
Sorts the results in descending order, by probability
And here is the query:
SELECT
top 100 FROM
(
select FLATTENED
NODE_CAPTION AS [Rule],
NODE_PROBABILITY
AS [Rule_Probability],
(SELECT ATTRIBUTE_NAME
FROM NODE_DISTRIBUTION
WHERE
VALUETYPE=2 AND
ATTRIBUTE_NAME='Movies(Star Wars)'
)
AS D
FROM
[MovieClick].CONTENT
WHERE
NODE_TYPE=8
)
AS A
WHERE
[D.ATTRIBUTE_NAME]
<> null AND
Rule_Probability
> 0.25 AND
(
VBA!InStr(Rule, 'Blade Runner') > 0 OR
VBA!InStr(Rule, 'Indiana Jones') > 0
)
ORDER BY
[Rule_Probability] DESC
The result contains all the requested rules, with their
probability.
* The model created for this example
used the MovieClick data available on this site. The algorithm was
Association Rules with the parameters MINIMUM_PROBABILITY set to 0.1 and
MINIMUM_SUPPORT set to 3.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Wed 11/21/2007 @ 11:57
|
|
This article describes two stored procedures for the Analysis Server, procedures that compute the Covariance and Correlation matrices
|
|
This article describes two stored procedures for the Analysis Server. The first one is to calculate the covariance matrix between all continuous attributes in the selected data mining model. The second one is to calculate the correlation matrix between all continuous attributes in the selected data mining model. In both cases it does not matter if the attributes are inputs or outputs of the model, the stored procedures ignore that fact. Preconditions The stored procedures are to be used with any model as the only parameter. The stored procedures will not work if any of the following is true: - One or more columns contain nested tables
- The model does not allow drill through
- The model is not processed
Even if those conditions are met the stored procedures will only return a result for the continuous attributes. If none of the attributes are continuous the resulting matrix will be empty. Interpreting Results The stored procedures will return a table with three columns. The first two columns indicate the name of the attributes that were used to calculate the covariance/correlation. This table is a flatten version of a square matrix on size nxn where n is the number of attributes. Since covariance/correlation matrixes are symmetric the result only contains one half of the matrix (including the diagonal). This can be done because COV(X, Y) = COV(Y, X) so we only need one result. The diagonal or COV(X, X) represents the variance of X and is included in the results. Deploying the stored procedure The stored procedures are created using C# as a class library project. The project contains only 1 file: “MatrixRelationships.cs”. To create the library file with the stored procedures follow the next steps: - 1. Create a new C# class library Project called MatrixRelationships
- 2. Download the MatrixRelationships.cs file into the MatrixRelationships directory
- 3. Add the source file to the project by pressing Shift+Alt+A (Add Existing Item) and delete the default Class1.cs element from the Solution Explorer
- 4. Add msmgdsrv to the references, select Project->Add Reference->.NET->Microsoft.AnalysisServices.AdomdServer
- 5. Build the project (Ctrl+Shift+B by default)
- 6. The dll file will be inside the bin\debug directory in the project
Once the project is built and the dll created we are ready to deploy the stored procedures into the Analysis Server. To deploy the stored procedures follow the next steps: - 1. Open the Microsoft SQL Server Management Studio
- 2. Connect to the Analysis Server where you wish to deploy
- 3. On the object explorer (F8) and expand the server branch, you will see Databases and Assemblies, right click Assemblies
- 4. Select New Assembly and the Register Server Assembly will pop up, set the parameters as follows:
- a. Type: .NET Assembly
- b. File Name: Location of dll file with stored procedures
- c. Assembly Name: You can use the default name (MatrixRelationships) or you can set a new name here (The name should not be a duplicated assembly name)
- d. Include Debug Information: Check
- e. Permissions: Safe (by default)
- f. Impersonation: Use the credentials of the current user
- 5. The stored procedures are ready to be used
Executing the stored procedure To execute the stored procedures you need to open a new Analysis Services DMX Query. To do this select File->New->Analysis Services DMX Query. Once the DMX query window is open type the following command: CALL MatrixRelationships.GetCovarianceMatrix( model name) or CALL MatrixRelationships.GetCorrelationMatrix( model name)
Note: MatrixRelationships will be substituted with whatever name was selected in Assembly Name. The list of available models will show in the DMX query window under mining model. Debugging the stored procedure If the user wants to debug the stored procedures follow the next steps: - 1. Open the project in Visual Studio
- 2. Deploy the stored procedures
- 3. In Visual Studio select Debug->Attach To Process
- 4. From available processes select msmdsrv.exe
Algorithm description Covariance Matrix Covariance matrix algorithm is based on the following formula:  The algorithm works as follows: - 1. Extract continuous attributes from the model
- 2. Verify that:
- a. the model is processed
- b. the model supports drill through
- c. the model does not contain nested tables
- 3. Extract mean and variance for each attribute
- 4. Obtain the data from the model (all the Xi and Yi)
- 5. Create a triangular matrix to store the partial results for each X,Y combination
- 6. For each row in the data set from the model:
- a. Update the number of rows processed
- b. Obtain the values for each column and subtract the mean
 - c. Update all combinations of (X,Y) by adding the appropriate
 - d. If the value of a column is null for a given row, a 0.0 will be inserted (no covariance)
- 7. Divide every combination of (X,Y) in the triangular matrix by the total number of rows processed
- 8. Flatten the triangular matrix into a 3xM matrix where M is the number of continuous attribute combinations, given by
where n is the number of continuous attributes. Correlation Matrix Correlation matrix algorithm is based on the Pearson product-moment coefficient which is given by the following formula:  The algorithm to calculate the correlation is the same as the covariance plus it divides every (X,Y) combination in the triangular matrix by the product of X and Y standard deviation and the standard deviation is the square root of the variance.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Mon 11/13/2006 @ 05:45
|
|
This sample shows the results of rendering data mining viewers with Visio 2007 and saving as a web page
|
|
The SQL Server 2005 Data Mining Addins for Office 2007 includes a delightful model rendering template for Visio 2007. With this template, you can render your graphical views from SQL Server Data Mining as Visio diagrams for annotation, enhancement, and presentation. The Data Mining Template for Visio 2007 allows you to render decision trees, regression trees, cluster diagrams, and dependency nets. While most of the behavior of the built-in viewers are preserved in the Visio environment, you also get additional functionality such as the ability to move sub-trees to new pages, grow dependency networks starting at an arbitrary node, or create a cluster diagram using discrimination charts. Once rendered, you can use the power of Visio to annotate your diagrams with text, add supplemental graphics supporting your model, and even apply color themes that match your presentation needs. Finally the models can be presented as Visio documents, embedded into other Office documents or simply saved as a web page.
The links here are to models saved as web pages as examples of some of the possibilities. New links will be added as they are created.
Home Ownership Decision Tree
Customer Clusters
Movies Dependency Network
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Tue 08/29/2006 @ 12:34
|
|
This tip shows you how to prepare training and test data sets for mining models with nested tables.
|
|
Sampling Nested Tables
SQL
Server Data Mining is unique in its ability to mine multiple tables of data
simultaneously through the nested table feature. This functionality,
however, adds some complexity when performing basic data manipulation tasks
required for data mining, such as sampling. How does one split a data set
with nested tables into training and testing sets? Well, this is how.
SQL
Server 2005 Integration Services (SSIS) provides all the tools necessary in its
data pipeline to sample data and sample nested data. The basic idea is
that you need to simultaneously read from your case data and nested data, sort
them*, sample the case data, and merge in the nested data. Such a data
flow is shown below.

To create
this package, follow these steps:
1.
Create
a new SSIS package.
2.
Add a Data Flow task for your
pipeline.
3.
Add
data sources for your case and nested tables.
4.
Use
a sort transform* to sort all tables by the case identifier. This should
be the key of the case table, and the foreign key of the nested tables that
relates them to the case table.
5.
Sample
the sorted case table using a Percentage Sampling transform, label outputs to
“Testing” and “Training”. A rule of thumb is to use 70% of the data for
training and 30% for testing.
6.
Use
the Multicast transform on each of the outputs of the nested sorting and the
case sampling (both training and testing).
At this
point (assuming a single nested table) you will have six data streams.
For convenience, we will call these Training1, Training2, Testing1, Testing2,
and Nested1, Nested2.
7.
Output
Training1 and Testing1 into destination transforms – these will be your
training and testing case tables. In the diagram above, the destinations
are row count transforms for illustrative purposes. In practice, you
would use a table destination such as the SQL Server Destination transform.
8.
Use
a Merge Join transform to join Training2 and Nested1. Join the streams
using the foreign key, and select all of the columns of Nested1 as output.
9.
Output
the result of the Merge Join to a destination. This will be your training
nested table.
10.
Use
a Merge Join transform to join Testing2 and Nested2. Join the streams
using the foreign key, and select all of the columns of Nested2 as output.
11.
Output
the result of the Merge Join to a destination. This will be your testing
nested table.
Once you
have completed the package, run it to perform the sampling and accurately
divide your data into training and testing sets. You can then add these
tables to a Data Source View in an Analysis Services project for data mining.
The
package described above and supporting data is available for
download.
* Sorting
If you are reading
from a sorted data source, you can take advantage of this fact and omit the
sorting step. In general, your relational database will be much better at
sorting than SSIS, so you can use an ORDER BY clause in your source data query
to take advantage of this. Once you have a sorted data source, however,
you need to inform SSIS what the sort keys are. To do this, follow these
steps:
1.
Right click on the Data Flow Source transform and select “Show
Advanced Editor.”
2.
Go to the Input and Output Properties page of the editor
3.
Open the “Source Output” node of the tree
4.
Open the “Output Columns” node of the tree
5.
Select the column by which the data is sorted
6.
Set the “SortKeyPosition” for that column to
“1”
7.
Click OK to close the dialog
Make sure you do
this for all the tables that are sorted, and make sure they are sorted by the
same key.
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Tue 07/18/2006 @ 04:36
|
|
This tip shows you how to execute and use the results of prediction queries in multiple ways within the SQL Server relational database engine.
|
|
Executing Predictions from the Relational Server
A common
request from the relational database gurus in the SQL Server Data Mining community
is to execute predictions from the SQL Server relational database engine – either
in batch mode or ‘on-the-fly’ while rows are being inserted into a table. In
this article, we’ll start with the basics of executing DMX queries in T-SQL and
walk you all the way to predicting values in real-time during the INSERT
operation.
Executing
prediction queries from the relational server
Before we
start, we'll state the assumptions we are making:
-
The
SQL Server 2005 Database Engine and SQL Server 2005 Analysis Services are installed as default instances on the same
machine
-
A
database named TestDB is deployed on the Analysis
Services instance
-
A
mining model named Iris is deployed in the [Test DB] Analysis Services database
-
The
mining model is intended to classify Iris flowers based on their sepal/petal
dimensions and has 5 columns:
o
Petal
Length (double, input)
o
Petal
Width (double, input)
o
Sepal
Length (double, input)
o
Sepal
Width (double, input)
o
Class
(text, predictable)
Let's now
go through the basic steps for setting up prediction query execution from T-SQL.
Step 1: Create a linked
server
The first
step is to create a linked server inside SQL Server pointing to the Analysis
Services instance. The linked server object is created with a statement like
below:
EXEC master.dbo.sp_addlinkedserver
@server =
N'DMServer',
@srvproduct=N'Analysis Services 2005',
@provider=N'MSOLAP',
@datasrc=N'localhost',
@catalog=N'TestDB'
GO
Upon
executing this statement, a linked server named DMServer
is registered on the database server. The linked server uses the MSOLAP OLE DB
provider to connect to the “localhost” instance of
Analysis Services.
SQL
Server 2005 Management Studio provides a friendly UI that allows you to create
a linked server and to set various linked server options, such as query
timeouts, impersonation rules and others. This is accessible under Server
Objects -> Linked Servers -> New Linked Server …
Step 2: Execute DMX
queries from T-SQL using OPENQUERY
Once the
linked server is created, it can be used for data mining (DMX) queries like this:
SELECT * FROM OPENQUERY(DMServer,
'select node_caption,
node_type from iris.content')
A
potential use of such queries is to use data mining to assign predicted labels
to all rows in a table and then fetch the rows together with the new label and
save them inside the relational server. Such a query typically looks like
below:
INSERT INTO
QueryResultsTable SELECT * FROM OPENQUERY(DMServer,
'select … FROM Modell
PREDICTION JOIN OPENQUERY…')
Executing prediction
queries from inside a trigger
Now that
we know how to execute basic data mining queries from the SQL Server 2005
relational server, let's look at the more advanced “on-the-fly” scenario.
For this
purpose, we’ll build a trigger that is launched whenever a row is added to a
certain table. The trigger will execute a data mining prediction and append the
prediction result to the newly inserted row.
Trigger
objects are well documented in the SQL Server 2005 books online. The only
problem for our task is executing the prediction query and using the result.
The problem resides in the fact that:
-
The
prediction query must usually be parameterized with the values of the other row
columns
-
The
OPENQUERY statement which is used for data mining queries cannot take
parameters or even string variables as the statement text
To
overcome this problem, we will use the EXEC statement.
We’ll
assume that a relational table named NewIris is
created on the SQL Server instance and the table has a set of columns matching
the mining model, i.e.:
-RowKey int (the identity column)
-PLength float (to be mapped to
model’s Petal Length column)
-PWidth float (to be mapped to
model’s Petal Width column)
-SLength float (to be mapped to
model’s Sepal Length column)
-SWidth float (to be mapped to
model’s Sepal Width column)
-PredictedClass – varchar(50) (to keep the prediction result)
The
trigger will be launched on inserting a new row and has to
a)
Collect
the values for the columns
b)
Author
a data mining prediction statement for the current row (a singleton prediction)
c)
Execute
the prediction
Inside
the trigger’s scope, the current row can be identified using the
@@identity variable.
Note that this variable is only available when the table has an identity column
(RowKey above). The column values for the current row
should be collected into a set of variables, so that they can be used later to
author the DMX query:
DECLARE
@newPLength varchar(15)
DECLARE @newPWidth
varchar(15)
DECLARE
@newSLength varchar(15)
DECLARE @newSWidth
varchar(15)
-- Select the newly added values
SELECT @newPLength =
dbo.FormatFloatForDMX(PLength),
@newPWidth = dbo.FormatFloatForDMX(PWidth),
@newSLength =
dbo.FormatFloatForDMX(SLength),
@newSWidth =
dbo.FormatFloatForDMX(SWidth)
FROM NewIris WHERE $identity=@@identity
The
$identity column represents the key column
of the table.
FormatFloatForDMX is a user-defined function which formats a float number
(or a NULL) so that it can be used inside a DMX statement. The function is
defined as shown below:
CREATE FUNCTION FormatFloatForDMX
(
@Value float
)
RETURNS varchar(20)
AS
BEGIN
DECLARE
@ResultVar varchar(20)
SELECT
@ResultVar = CASE
WHEN
@Value IS NULL
THEN
'NULL'
ELSE
CONVERT(VARCHAR(15), @value)
END
RETURN
@ResultVar
END
GO
Now, with
the column values collected, we can author the prediction query.
-- compose the Analysis Services
DMX query
DECLARE @DMQuery varchar(256)
SET @DMQuery = 'select [Class] from [Iris]
NATURAL PREDICTION JOIN' +
'(SELECT '+
@newPLength +
' AS [Petal
Length], '+
@newPWidth + ' AS [Petal Width], '+
@newSLength +
' AS [Sepal
Length], '+
@newSWidth + ' AS [Sepal Width] '+
') AS T'
The data
mining prediction query is ready to be executed against Analysis Services. Remember, the OPENQUERY
statement does not allow variables as arguments, so this query cannot be
executed directly. Besides, we need to collect the predicted result and insert
it into the NewIris table in the PredictedClass
column. To achieve this, we will use the SQL EXEC instruction.
--compose the OPENQUERY statement
DECLARE @FullQuery
varchar(512)
SET @FullQuery
=
'DECLARE
@predictedValue VARCHAR(50) ;'+
'SELECT
@predictedValue=[Class] FROM OPENQUERY(DMServer, ''' + @DMQuery +''');' +
'UPDATE
[NewIris] SET PredictedClass=@predictedValue WHERE $identity=' + CONVERT(varchar(15), @@IDENTITY)
-- execute the OPENQUERY statement
EXEC (@FullQuery)
As you
notice, the SQL statements above declare a string variable, @predictedValue, then populate the
value from the execution of an OPENQUERY statement against the DMServer linked server. After that, it sets the value of
the variable as the PredictedClass column value in
the NewIris table.
The whole
set of SQL instructions is grouped in a single string and executed in an EXEC instructions.
With
this, all the blocks required for defining the trigger are available and the
full trigger definition can be put together as shown below:
CREATE TRIGGER PredictClassForNewIris
ON NewIris
AFTER INSERT
AS
DECLARE
@newPLength varchar(15)
DECLARE @newPWidth
varchar(15)
DECLARE
@newSLength varchar(15)
DECLARE @newSWidth
varchar(15)
-- Select the newly added values
SELECT @newPLength =
dbo.FormatFloatForDMX(PLength),
@newPWidth =
dbo.FormatFloatForDMX(PWidth),
@newSLength =
dbo.FormatFloatForDMX(SLength),
@newSWidth =
dbo.FormatFloatForDMX(SWidth)
FROM NewIris WHERE $identity=@@identity
-- compose the Analysis Services query
DECLARE @DMQuery varchar(256)
SET @DMQuery = 'select [Class] from [Iris]
NATURAL PREDICTION JOIN' +
'(SELECT '+
@newPLength +
' AS [Petal
Length], '+
@newPWidth + ' AS [Petal Width], '+
@newSLength +
' AS [Sepal
Length], '+
@newSWidth + ' AS [Sepal Width] '+
') AS T'
--compose the OPENQUERY statement
DECLARE @FullQuery
varchar(512)
SET @FullQuery
=
'DECLARE
@predictedValue VARCHAR(50) ;'+
'SELECT
@predictedValue=[Class] FROM OPENQUERY(DMServer, ''' + @DMQuery +''');' +
'UPDATE
[NewIris] SET PredictedClass=@predictedValue WHERE $identity=' + CONVERT(varchar(15), @@IDENTITY)
-- execute the OPENQUERY statement
EXEC (@FullQuery)
GO
We can now
test the new trigger with the following statements:
INSERT INTO NewIris(PWIdth, SLength)
VALUES(2.5, 1)
GO
SELECT * FROM newIris
GO
And here's
the
result of the second query confirming the real-time prediction that occurs
during INSERT, as promised:
|
PLength
|
PWidth
|
SLength
|
SWidth
|
RowKey
|
PredictedClass
|
|
NULL
|
2.5
|
1
|
NULL
|
1
|
Iris-virginica
|
|
|
|
| |
|
|
|
|
Last updated by DMTeam on Tue 06/20/2006 @ 05:06
|
|
This article presents a way of executing multiple DMX statements from inside SQL Server Management Studio.
|
|
Analysis Services doesn't support IMultipleResults, so it is not actually possible to execute multiple statements at once. However, SQL Server Management Studio has a way to work around this issue.
If you typed in multiple DMX statements seperated
| | | | | | |