Explore filtered models in SQL Server 2008

Introduction

This article describes how to use model filtering, a powerful new feature in SQL Server 2008 Data Mining. Filtering allows you to take full advantage of the mining structure / mining model dichotomy, separating the data staging (in the structure) from the modeling (in the mining model). This means that you can define and process a single mining structure for your problem space and then build mining models on specific slices of interest within that space.

SQL Server 2008 Data Mining allows specifying, for each mining model, a filter to be applied on the training data. The filter acts as a partitioning mechanism inside the mining structure and it is applied on top of any training/testing partitioning already existing at the structure level. Furthermore, filters can be applied to data in nested tables. Filters allow you to build different models for various partitions of the data with minimal effort, either to obtain better overall accuracy (think of it as manually specifying the first split in a decision tree), or to compare the patterns between different partitions.

Creating Filtered Models

So, let’s start with a simple scenario, using the same old Adventure Works data – building a mining model that predicts how likely a customer is to purchase a bike based on various demographic information. Let’s also assume that you know – based on your experience with the data – that the Geography factor has an important role and you don’t want your North American patterns to hide (because of volume) the peculiarities of the Pacific market.

In this section, we will show you both the DMX and UI ways of working with filtered models. Pick your path and follow it!

Suppose you start with a mining structure defined like below:

DMX:

CREATE MINING STRUCTURE TestStructure
(
 CustomerKey LONG KEY,
 Region TEXT DISCRETE,
 YearlyIncome DOUBLE DISCRETIZED,
 EnglishOccupation TEXT DISCRETE,
 CommuteDistance TEXT DISCRETE,
 BikeBuyer BOOLEAN DISCRETE
) WITH HOLDOUT( 30 PERCENT )

Then, train the mining structure with an INSERT INTO statement:

INSERT INTO MINING STRUCTURE TestStructure
(
 CustomerKey,
 Region,
 YearlyIncome,
 EnglishOccupation,
 CommuteDistance,
 BikeBuyer
)

OPENQUERY([Adventure Works DW], ‘SELECT
 CustomerKey,
 Region,
 YearlyIncome,
 EnglishOccupation,
 CommuteDistance,
 BikeBuyer  FROM dbo.vTargetMail‘)

UI:
To create the structure using BI Development Studio, you just need to select “New Mining Structure”. On the “Create the Mining Structure” page of the wizard select “Create mining structure with no models”. Assuming you already have a data source view for Adventure Works 2008, select it and then select vTargetMail in the “Specify table type” in the wizard.

Now, the modeling part. As usual, you would create a mining model as follows:

DMX:

ALTER MINING STRUCTURE TestStructure
ADD MINING MODEL BikeBuyerClass 
(
 CustomerKey,
 YearlyIncome,
 EnglishOccupation,
 CommuteDistance,
 BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH DRILLTHROUGH

UI:

In the UI, you can create the mining model by selecting create related mining model. Name the model BikeBuyerClass and select Decision Trees as the algorithm. Once the model is selected you will have to set the BikeBuyer column to Predict and the Regioncolumn to ignore.

Now you can add a mining model to the mining structure using only the data for the Pacific area:

DMX:

ALTER MINING STRUCTURE TestStructure
ADD MINING MODEL DT_Pacific
(
 CustomerKey,
 YearlyIncome,
 EnglishOccupation,
 CommuteDistance,
 BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH FILTER( Region=’Pacific’ ), DRILLTHROUGH

The new syntax element is the FILTER construct, which instructs the model to use in training only those cases where the Region column contains the ‘Pacific’ value.

UI:
To create the filtered model in the UI, you will need to create the model first and then add the filter. To use fewer steps, you can create the new model based on BikeBuyerClass model we just created. Just right click on the BikeBuyerModel name and select “New Mining Model”.

This can also be accomplished by picking the mining model you want to clone and using the Mining Model drop down from the file menu and selecting “New Mining Model”.

Now, to add the filter you have to select “Set Model Filter” from the Mining Model menu while the mining model is selected or from the right click contextual menu. This will bring the Model Filter dialog.

The dialog presents data in tabular form. Each row is a condition. Each condition can have up to four columns:

And / Or: This will do a logical And/Or between the different conditions. Valid values are “And” and “Or”. The value can only be set from the second condition on. Current version doesn’t allow condition grouping from the grid.
Mining Structure Column: This can be any column from the mining structure, values are prefilled so you only need to select a column name from a drop down. Only columns that will not show are key columns since you can’t filter on keys.
Operator: This is a drop down with valid operators based on the data type of the selected mining structure column (more on this below).
Value: This is the value used to compare against the structure column using the operator. Value is not always necessary (more on this below).

As you can see above, the same expression used in the DMX version of the filter is displayed in the Expression portion of the dialog.

Note: An expression can be edited manually by selecting “Edit Query” in the dialog. Be careful: once the expression is edited you will not be able to use the grid to add elements to the filter. Everything will need to be added by hand in the Expression text box.

Similarly, add a new model for the North America region:

DMX:

ALTER MINING STRUCTURE TestStruct
ADD MINING MODEL DT_NorthAmerica
(
 CustomerKey,
 YearlyIncome,
 EnglishOccupation,
 CommuteDistance,
 BikeBuyer PREDICT
) USING Microsoft_Decision_Trees WITH FILTER( Region=’North America’ ), DRILLTHROUGH

UI:
Clone DT_Pacific and edit the filter to be North America instead of Pacific as shown below.

Note that if you clone the DT_Pacific model to create the DT_NorthAmerica model using the steps described above, the filters in the base mining model will also be cloned. This can be a real time-saver when creating multiple models with similar filters – the time-saving adds up when you start building complex filters.

DMX Deep-Dive on Filter Syntax

As you notice, FILTER is syntactically a function, taking a Boolean expression. The Boolean expression may be a simple predicate (of the Column=Value kind), an EXISTS predicate or a Boolean combination (using AND, OR and NOT operators) of multiple predicates.

The Column part of a simple predicate is always the name of a mining structure column.

The corresponding mining structure column does not have to be included in the mining model. In the examples above, Region is no longer an interesting attribute when all the training cases seen by a model have the same value (e.g. Pacific), therefore it is not part of the model.

A simple predicate is defined as , where the accepted set of operators depends on the content type of the column, and Value must always be a constant. Value generally has to have the same type as the column but can also be the NULLconstant.

UI: In the Filter dialog in BI Studio the NULL constant is denoted by two added values in the operator drop downs as IS NULL and IS NOT NULL to demote = NULL or <> NULL respectively.

If the column is:

A case level KEY – then no operator can be applied (no key filters are supported) . Nested KEY on the other hand is just a mapping between case and nested tables and can be used to filter (more on nested table filtering later)
DISCRETE or DISCRETIZED – supports the = and <> operators and also IS NULL and IS NOT NULL in the Filter dialog
CONTINUOUS – supports the =, <>, <, <=, >, >= operators and also IS NULL and IS NOT NULL in the Filter dialog

XML serialization of the floating point double numbers as well as rounding errors may result in losses of less significant digits, therefore using an = or <> operator with a double value may result in unexpected results, so it is safer to use a combination of other predicates.

In case of discretized (numeric or date/time) values, the = and <> operators have a semantic similar with the PREDICTION JOIN operator: they evaluate the bucket that contains the value and then apply the predicate to the bucket index. Therefore, for a discretized YearlyIncome column with buckets (0, 10000], (10000, 20000] etc. the filters below mean exactly the same thing:

(YearlyIncome=5000)
(YearlyIncome=9000)
(YearlyIncome=999.999)

Applying a filter on a mining model means, effectively, training the mining model with the data returned by a query like below:

SELECT … FROM MINING STRUCTURE TestStructure.CASES WHERE IsTrainingCase() AND ()

For those of you who use OLAP mining models, filters are somewhat similar to defining a cube slice on a mining structure – but at the model level.

If your filtered model has drillthrough enabled, then all the structure cases that match the filter will be linked with a content node, not only the training cases from the structure. Of course, only the structure training cases matching the filter will be used in training.

Therefore, the following queries are equivalent

SELECT … FROM MINING MODEL DT_NorthAmerica.CASES WHERE IsTrainingCase()

SELECT … FROM MINING STRUCTURE TestStructure.CASES WHERE IsTrainingCase()AND (Region=’North America’)

Model Accuracy in Filtered Models

Accuracy stored procedures can be applied on various data partitions (test or training). Filters add a new dimension to the data partitions, which can now be filtered test data or filtered training data. Accuracy stored procedures have syntax like below:

CALL SystemGetLiftTable(BikeBuyerClass, 2, ‘BikeBuyer’, true) // – to get the lift chart for the TRUE state of the BikeBuyer attribute

CALL SystemGetClassificationMatrix(BikeBuyerClass, 2, ‘BikeBuyer’) // – to get the full classification matrix for the BikeBuyer attribute

The second parameter (2) is a bitmask identifying the data partition(s) to be used in the accuracy stored procedure. Here are some bit values:
– 1 identifies the training partition
– 2 identifies the test partition
– 3 (bitwise OR combination of 1 and 2) identifies ALL the data in the mining structure (training + test)

Filters introduce another bit in the bitmask, with a value of 4. Therefore, here are some new values for the second parameter of the accuracy stored procedures:

– 5 (4 bitwise OR 1) identifies the filtered training data seen by the model in training
– 6 (4 bitwise OR 2) identifies the filtered test data – those test cases from the mining structure where the model filter applies

– 7 (4 bitwise OR 1 bitwise OR 2) identifies all the cases (training or test) in the structure where the model filter applies

Therefore the following calls might return different lift tables:

CALL SystemGetLiftTable(DT_Pacific, 2, ‘BikeBuyer’, true) // – to get the lift chart for the TRUE state of the BikeBuyer attribute on all the test cases in the structure

CALL SystemGetLiftTable(DT_Pacific, 6, ‘BikeBuyer’, true) // – to get the lift chart for the TRUE state of the BikeBuyer attribute on the test cases in the structure that match the filter defined in DT_Pacific

Filtering Nested Tables

The most common use of nested tables in SQL Server Data Mining is to model transaction tables. That is, model Customer entities together with the “bag” of products they purchased.

Nested table filters primarily serve two purposes:

Use (in the model) only those nested table rows with a certain property (e.g. consider only products that are not “Coca Cola”, because most people do buy “Coca Cola”, so it is not that interesting)
Use (in the model) only those cases where the associated nested tables have certain properties (e.g. – build a prediction model on customers that have purchased Coca Cola)

Notice the difference between the filters: while the first filter removes products from shopping baskets before modeling, the second removes customers from the model (those that have not purchased Coca Cola).

Let’s start with a mining structure containing customers, some demographic information and a list of products they bought:

CREATE MINING STRUCTURE TestStructure2
(
 OrderNumber TEXT KEY,
 IncomeGroup TEXT DISCRETE,
 Products TABLE
 (
      Product TEXT KEY,
      LineNumber LONG DISCRETE
 )
)

Next step, train the mining structure:

INSERT INTO MINING STRUCTURE TestStructure2
(
 OrderNumber,
 IncomeGroup,
 Products
 (
      SKIP,
      Product,
      LineNumber
 )
)
SHAPE {OPENQUERY([Adventure Works DW], ‘SELECT 
 OrderNumber,
 IncomeGroup
 FROM dbo.vAssocSeqOrders ORDER BY OrderNumber‘)
}
APPEND ({ OPENQUERY([Adventure Works DW], ‘SELECT 
 OrderNumber,
 Model,
 LineNumber
 FROM dbo.vAssocSeqLineItems ORDER BY OrderNumber‘) }
RELATE OrderNumber TO OrderNumber) AS Products

Now, the modeling part: for the first kind of filter described in the beginning, the goal is to build a model that predicts some, for example, demographics (IncomeGroup) based on the products a customer purchased, without considering a certain product (say, the ‘Mountain-500′ bike):

ALTER MINING STRUCTURE TestStructure2
ADD MINING MODEL IncomeGroupFromProducts
(
 OrderNumber,
 IncomeGroup PREDICT,
 Products
 (
      Product
 ) WITH FILTER(Product<>’Mountain-500′)
)USING Microsoft_Decision_Trees

The syntax of the nested table filter is very similar with the syntax of model filters – it effectively specifies a WHERE clause to be applied on the rows of the nested table. The training data for the model can be obtained with the following drillthrough query:

SELECT OrderNumber,  IncomeGroup,
 (SELECT Product FROM Products WHERE Product<>’Mountain-500′)
FROM MINING STRUCTURE TestStructure2.CASES WHERE IsTrainingCase()

For the UI create a structure with no models and add a new Model setting the columnLineNumber to ignore (as described in the previous section). The main difference you will notice from the previous example is that now if you select the nested table column a new element is selectable in the Mining Model / Contextual menu: Set Nested Table RowFilter. Select this option to open the Nested table filter dialog. You will now you opened the correct dialog because it will say Nested Table Row Filter for . The only items that will be available in the filter are nested table columns. The filter for IncomeGroupFromProducts is shown below.

The syntax changes for the second filter described before. Now the goal is to build a model predicting IncomeGroup only for those customers that did buy a certain product:

ALTER MINING STRUCTURE TestStructure2
ADD MINING MODEL IncomeGroupFromMountain500
(
 OrderNumber,
 IncomeGroup PREDICT,
 Products
 (
      Product
 )
)USING Microsoft_Decision_Trees WITH
FILTER(
EXISTS (SELECT * FROM Products WHERE Product=’Mountain-500′)
)

The new syntax element is the EXISTS predicate, which evaluates to TRUE if the query specified as argument returns at least one row.

Note that the list of columns for the EXISTS query is never used, so * is as good as anything else. The WHERE clause of the EXISTS predicate supports, in the case of nested table filters, only the simple predicates (or Boolean combinations of such predicates) described before.

The same filter can be applied using the Model Filter dialog in BI Studio. The main difference will be the available mining structure columns where you can select any of the structure’s columns or the nested table. You need to select the nested table as the Structure Column. Valid operators for nested tables are: Contains which translates intoExists and Not Contains which is the equivalent of Not Exists.

If you select the nested table then a ‘plus’ sign will appear to the left of that condition, this means you can expand the nested table filter. The nested table conditions work pretty much as regular conditions with the same rules. Everything inside the nested conditions will be used to create an Exists clause (with or without the Not). You can have multiple of these clauses and use AND / OR to create Boolean combinations. You can see the filter for IncomeGroupFromMountain500 below:

The training data for the model can be obtained with the following drillthrough query:

SELECT OrderNumber,  IncomeGroup,
 (SELECT Product FROM Products)
FROM MINING STRUCTURE TestStructure2.CASES 
WHERE 
IsTrainingCase() AND
EXISTS( SELECT * FROM Products WHERE Product=’Mountain-500′)

The query actually works in DMX, which implies that DMX contains now the EXISTS function which evaluates to TRUE if the sub-query argument returns at least one row. Note that, when used in a regular DMX query (i.e. not in a filter), the EXISTS sub-query can be applied to any nested table (including function results, such as PredictHistogram) and can take any WHERE clause that is valid in that context, including UDFs (therefore, outside of a filter, the EXISTS clause is not limited to structure table columns and simple predicates).

Combinations of filters and other SQL Server 2008 DMX features can be used to have very specialized data views, allowing complex modeling scenarios.

Multiple Filters on Nested Tables

Notice how the structure above contains a LineNumber column in the nested table, a numeric field indicating the category of the product on the same line. The model below detects cross-sales rules that lead from products in line 1 (bikes) to products in any other category:

ALTER MINING STRUCTURE TestStructure2
ADD MINING MODEL LineNumber1Recommendations
(
 OrderNumber,
 Products AS LineNumber1Products
 (
      Product
 ) WITH FILTER(LineNumber=1),
 Products AS OtherLineNumberProducts PREDICT_ONLY
 (
      Product
 ) WITH FILTER(LineNumber<>1)
)USING Microsoft_Association_Rules
WITH FILTER(
      EXISTS(SELECT * FROM Products WHERE LineNumber=1) AND
      EXISTS(SELECT * FROM Products WHERE LineNumber<>1)
)

Here is a breakout of the filters and constructs:

LineNumber1Products is an input nested table, based on data in the Products structure column, and containing only products in the category denoted by LineNumber 1
OtherLineNumberProducts is a predictable (PREDICT_ONLY) nested table, also based on data in the Products structure column, but containing only products in the categories other than the one denoted by LineNumber 1
The model is trained only on those customers that have both at least one product in the LineNumber 1category and at least one product in a different category, because any other customers are not relevant for the problem the model is trying to solve.
All the filters (both the nested table row filters and the model case filter) are applied on mining structure columns which do not necessarily show in the model (the case of LineNumber). When they do show in the model under an alias, the filter is applied on the source structure column name.

UI: The same can be accomplished in BI Studio by creating Nested Filters for each nested table and then on Model filter with both Exist clauses.

[Acknowledgement: Most of the content for this article comes from a two-part post on this topic on Bogdan Crivat’s blog.]

Under the Hood: How the Naive-Bayes Attribute Discrimination Viewer Gets Its Data

This tip provides insight into how the Attribute Discrimination pane in the Naive-Bayes viewer uses a system stored procedure to get its data.

If you have ever wondered about how the DM viewers get the data to display on the screen, read on.

In many cases, what is displayed in the data mining viewers is the result of built-in stored procedures which allow the processing required for the view to be done on the server without requiring all of the model content to be brought to the server. In this “Under the hood” tip, we will dive into one of those stored procedures.

Naive-Bayes Attribute Discrimination View

The attribute discrimination view for the Naive-Bayes algorithm shows the differences in the input attributes across the states of an output attribute.

The viewer doesn’t download all of the correlations in the Naive-Bayes content, rather it calls the stored procedure GetAttributeDiscrimination like this:

CALL System.GetAttributeDiscrimination
  ('Classify CollegePlans NB', 
   '100000005', 
   'Plans to attend', 
   1, 
   'All other states', 
   2, 
   0.0005, 
   true)

The not-so-obvious parameters are, in order strModel, strPredictableNode, strValue1, iValType1, strValue2, iValType2, dThreshold, and bNormalize. Let’s go through these parameters:

strModel – The name of the model.

strPredictableNode – This one is a bit difficult, as it takes the Node Unique Name of the target attribute instead of just the string you see in the viewer. The Node Unique Name identifies the attribute in the content rowset generated by the model. You can get the list of predictable attributes and their Node Unique Names by calling another stored procedure – like this CALL System.GetPredictableAttributes(‘ModelName’). This stored procedure returns two columns – one for the attribute name and one for the Node Unique Name.

strValue1 – The name of the value you want to compare on the left hand side. The usage of this parameter depends on the value of the next parameter.

iValType1 – This parameter indicates how to treat strValue1. It can have values 0,1, or 2. If this parameter is a 1, the value in strValue1 is the actual state of the attribute. However, if this parameter is a 0 or 2, the value in strValue1 is ignored. If the value is 0, the left-hand value is considered to be the “missing state”. If the value is 2, the left hand value is considered to be “all other states.” In the example above, “All other states” is specified only because it looks nice (and it’s easier to just drop the combo box value into the function call even if it will just be ignored).

strValue2 – Like strValue1, but for the right hand side.

iValType2 – Like iValType2, but for the right hand side.

dThreshold – A threshold value used to filter results, such that small correlations don’t come back in the results. Usually you set it to a really small number like 0.0005 in the example above.

bNormalize – Whether or not the result is normalized. If this value is true, the results are normalized to a maximum absolute value of 100, giving a possible range of –100 to 100. All this does is take the largest absolute value in the result and divide that into 100, and then multiple all the other numbers by that amount. If set to false, the numbers are whatever they are and you can figure it out yourself – it’s up to you, but the NB viewer always sets this to true.

The Results

Calling this routine returns a row for every differentiating attribute/value pair with a score higher than the specified threshold. The row contains the differentiating pair along with the score and some other columns and looks.

The score column is the “important” one and is best explained as if you did something like a C language compare routine e.g int Compare(int v1,int v2) { return v1-v2; } . That is, if the value is positive, it favors value1 and if the value is negative, it favors value2. The other columns are the actual counts of the correlations of the discriminator against the inputs. The best way to understand them is to look at the Mining Legend as you browse the model and click on rows. For example, if you clicked on the first row of the result above (in either picture), the Mining Legend.

Of course, once you have the result set you can use it wherever you want – in Reporting Services, Integration Services, or in your custom program.

Sidebar: How Do I See What Queries Are Being Sent by the Viewers?

Run SQL Server Profiler.
Start a New Trace from the File Menu and connect to Analysis Services.
Leave all the defaults on for the Trace Properties dialog that appears.
Go to the data mining viewer you’re interested in digging into, and browse a model.
Profiler will show you the queries that are being sent. Click on a “Query Begin” event line in the top pane to see the full query text in the bottom pane.