Microsoft.public.sqlserver.datamining FAQ

General topics

What is the microsoft.public.sqlserver.datamining group?

This is a newsgroup dedicated to discuss issues on the Microsoft Analysis Services data mining functionality available with the Microsoft SQL Server product. The topics include, but not limited to,

Data mining general issues,
Using and understanding the Microsoft Analysis Services data mining component,
OLE DB for DM specification,
Using the sample provider source code,
Application development.

How to access microsoft.public.sqlserver.datamining group?

Visit http://www.microsoft.com/sql/community/newsgroups/default.asp and select “Data Mining” group. You could use Web-based client or NNTP news reader which is available from Microsoft Outlook Express.

What is data mining?

Simply put, data mining is the process of exploring large quantities of data in order to discover meaningful information about the data, in the form of patterns and rules. In this process, various forms of analysis can be used to discern such patterns and rules in historical data for a given business scenario, and the information can then be stored as an abstract mathematical model of the historical data, referred to as a data mining model. After a data mining model is created, new data can be examined through the model to see if it fits a desired pattern or rule. From this information, actions can be taken to improve results in the given business scenario.

Is there a data mining community web site?

Yes, there is a Data Mining MSN Community for posting articles and binaries that tend to expire on newsgroups. The address to the MSN community is: http://communities.msn.com/AnalysisServicesDataMining. You can find sample applications and many useful add-ins. And, they are increasing.

What is Microsoft Analysis Service data mining functionality?

The data mining functionality in SQL Server 2000 Analysis Services are integrated tightly with both relational and OLAP data sources. SQL Server 2000 Analysis Services includes two of data-mining algorithms developed by Microsoft Research: Microsoft Decision Trees and Microsoft Clustering. It supports SQL-like data mining language that is specified in OLE DB for Data Mining. In addition, starting from SQL Server 2000 SP1, it supports third-party data mining provider aggregation.

Visit a Microsoft TechNet page for Data Mining for an excellent introduction to Microsoft Analysis Service data mining feature. Also, the online tutorial “Analysis Manager Concepts & Tutorial” on the first “Getting Started” page from the Analysis Manager is a good start to use the data mining functionality.

What are the case studies of Microsoft Analysis Services data mining available?

Note that the following list is never meant to be comprehensive, but to be useful as starting point of reference.

ComputerFleet – A case study for data-mining solution promoting efficiency in managed leasing business.
Web Usage Mining With SQL Server 2000 – An E-book that describes how to build a collaborative filtering and clustering for portal.
Disco S.A – A case study for customer understanding in frequent buyer program.
Alinea® Forecast Engine – optimizing process for promotion coupons.

What are new algorithms available in the next release?

Beta1 of Yukon (next release) has been made available to selected list of beta customers. Lot of communications on Yukon beta are being exchanged through http://www.sqljunkies.com. Beta2 of Yukon is expected to be announced to public some time this year. The Yukon data mining component will have a huge number of new features including new data mining algorithms. The new data mining algorithms already available in the beta 1 are the following:

– Decision trees with regression
– Association rules
– Sequence clustering
– Time series
– Neural network
– Naive Bayesian

General Usage and DMX

Do we have a data mining query tool, similar to the SQL Server query analyzer?

Yes. Visit Data Mining MSN Community where you can download DMSample (look for “Files”–>”DMSamp”). Note that you have to have Microsoft Analysis Services 2000 installed in order for DMSample to work. Also, you will have to download MSHFLXGD.* from the same folder to your system folder (say, C:\Windows\Systems32) and register MSFLXGD.OCX by using regsvr32.exe.

DMSample is a Visual Basic application with which you can build/execute any prediction query against server or local mining models. You can also create/train local mining models. DMSample is available as source code. You can easily modify the code for your own purpose. In the same folder where you located DMSample, you can also find a zip file (DMXQuerySamples.zip) that contains lots of DMX examples. You can load them into DMSample appl and run.

Can I run a data mining query from SQL server (e.g., query analyzer)?

Yes and No. In order to use DMX (OLE DB for DM syntax) through the SQL server, you could make MSOLAP provider allowed in-proc in SQL Server and use OPENROWSET() in the SQL Server query analyzer. Making MSOLAP allowed in-proc can be done setting the registry as follows:

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\Providers\MSOLAP]

“AllowInProcess”=dword:00000001

You could also do it in SQL Server Enterprise Manager when you add a new linked server under provider options (enable allowinproc).

Note that even if you added MSOLAP as in-proc provider of SQL server, it doesn’t mean that you can run DMX (DM query) against the SQL server. It just means that you can run a SQL server query that can include OPENROWSET through which you can issue DMX query. See the following example.

SELECT ClusterNo, SUM(One)

FROM OPENROWSET(‘MSOLAP’, ‘location=;Initial Catalog=;…’,

‘SELECT $Cluster AS ClusterNo, 1 AS One

FROM MyModel PREDICTION JOIN

OPENROWSET(”Microsoft.Jet.OLEDB.4.0”,

”data source= People.mdb”,

”SELECT [Key], Age, Gender, CarColor FROM People”) AS t

ON MyModel.TheKey = t.[Key] AND

MyModel.Age = t.Age AND

MyModel.Gender = t.Gender AND

MyModel.CarColor = t.CarColor’)

GROUP BY ClusterNo

Also, note that DMX could return a nested table column (e.g., PredictHistogram()) that SQL Server query analyzer cannot handle. You can avoid this problem by using FLATTENED in the SELECT-list.

You would find DMSample (described in the previous question) much more useful for DMX query tool.

I have models trained. How do I perform predictions?

There are two ways of performing predictions after you have models trained. One is to use the DTS data mining prediction task. The other is to use ADO in your favorite programming language (e.g., Visual Basic, C++, or C#). The following piece of C# codes connects to the Analysis Server through MSOLAP provider and executes a DMX query to get the result Recordset. Note that this example uses ADODB reference.

using System;

namespace DMCSharp

{

class Class1

{

[STAThread]

static void Main(string[] args)

{

ADODB.Connection conn = new ADODB.Connection();

conn.Open("Provider=MSOLAP;Data Source=localhost;InitialCatalog=Foodmart 2000",null,null,0);

string query ="SELECT * FROM [Member Card RDBMS].CONTENT";

ADODB.Recordset rst = new ADODB.Recordset();

rst.Open(query, conn, ADODB.CursorTypeEnum.adOpenForwardOnly,

ADODB.LockTypeEnum.adLockReadOnly, 0);

// do what you want to do with the result record set

}

}

}

How do I create a nested table column using Analysis Manager?

Let’s suppose you have two tables: Customers (case table), Purchases (transaction table). Then, the following steps will create a mining model with a nested table column you want:

– In the Analysis Manager, start “New mining model”
– When you’re asked to select case table, check “A single table contains the data” and go ahead to select “Customers” as the case table
– At the final dialog, check “Save, but don’t process now”
– When it brings up the mining model editor, add a nested column as follows:
– Right click at the schema pane and select “Insert tables…”
– Select “Purchases”
– In the tool bar, click “Insert nest table”
– Select a column as the key of the nest table column. Note that this column is not the key of “Purchases”, but the key of the rows that belong to the same customer. For instance, if the key of “Purchases” is , then you should select ProductName as key of the nested table.
– You could add more columns inside the nested column by right click on the nested column created

How do I use a nested table input in prediction?

In the prediction query, you will need to ‘shape’ your input tables (case table and nested table) into a hierarchical structure that the model has. Suppose the model looks like the following:

CREATE MINING MODEL SiteAffiliation

( CustomerID TEXT KEY,

Gender TEXT DISCRETE,

WebClick TABLE PREDICT

(

URL TEXT KEY

)

) USING Microsoft_Decision_Trees

Suppose we have two tables, Customers for the customer’s demographics, Clicks for the URLs that the customer has visited. The following DMX is using the SHAPE syntax to input the URLs to the prediction engine:

SELECT input.CustomerID, Predict(SiteAffiliation.WebClick)

FROM SiteAffiliation PREDICTION JOIN

SHAPE {OPENROWSET('SQLOLEDB', …,

'SELECT CustomerID, Gender FROM Customers ORDER BY CustomerID')}

APPEND ({OPENROWSET('SQLOLEDB', …,

'SELECT CustomerID, URLCategory FROM Clicks ORDER BY CustomerID')}

RELATE CustomerID to CustomerID) AS WebClick AS input

ON SiteAffiliation.Gender = input.Gender AND

SiteAffiliation.WebClick.UrlCategory = input.WebClick.UrlCategory

Algorithm Details

Give me an example of collaborative filtering using Microsoft decision trees.

There is a sample for shopping items recommendation system (collaborative filtering) using Microsoft Analysis Services decision trees. The document and source codes are found at Duwamish Online in MSDN (Setup, Design, Development).

The following is a simple example to analyze web site affiliation based on web click. You may try similar models for other applications.

Creation of the model (this would be typically done with the mining model wizard from the Analysis Manager unless you’re creating a local mining model):

CREATE MINING MODEL SiteAffiliation

(              CustomerID            TEXT KEY,

WebClick               TABLE PREDICT

(

UrlCategory TEXT KEY

)

) USING Microsoft_Decision_Trees

Training (again, this would be also typically part of the mining model wizard from the Analysis Manager unless you’re training a local mining model):

INSERT INTO SiteAffiliation (CustomerID, WebClick (SKIP, UrlCategory))

SHAPE {OPENROWSET('SQLOLEDB', …,

'SELECT CustomerID FROM Customer ORDER BY CustomerID')}

APPEND ({OPENROWSET('SQLOLEDB', …,

'SELECT CustomerID, URLCategory FROM WebClick ORDER BY CustomerID')}

RELATE CustomerID to CustomerID) AS WebClick

By this, we’re actually building a separate tree for each UrlCategory (KEY) value. Each tree will tell how likely the URL will be visited based on other URLs that the customer visited.

Now, predicting top 5 most likely UrlCategory items given history:

SELECT FLATTENED

TopCount(Predict(SiteAffiliation.WebClick, INCLUDE_STATISTICS), $Probability, 5)

FROM SiteAffiliation PREDICTION JOIN

(SELECT (SELECT 'url-1' AS UrlCategory UNION

(SELECT 'url-2' AS UrlCategory) AS WebClick) AS input

ON SiteAffiliation.WebClick.UrlCategory = input.WebClick.UrlCategory

The Predict(SiteAffiliation.WebClick, INCLUDE_STATISTICS) will examine each tree (say, URL-3 tree, URL-3 tree, and so on) to figure out the probability that the URL would be visited based on the input (i.e., URL-1, URL2). Then, TopCount() will take top 5 based on the order of the probability. Although several optimization is being kicked in our implementation, this is how it works conceptually.

Give me an example of customer segmentation using Microsoft clustering.

This is a simple example to show how you can find segmentation of people based on demographics. This could be easily modified to your own purpose. Creation of the model (this would be typically done with the mining model wizard from the Analysis Manager unless you’re creating a local mining model):

CREATE MINING MODEL MyModel

(

TheKey              LONG KEY,

Age                    DOUBLE CONTINUOUS,

Gender               TEXT DISCRETE,

HairColor           TEXT DISCRETE,

CarColor            TEXT DISCRETE

) USING Microsoft_Clustering;

Training (again, this would be also typically part of the mining model wizard from the Analysis Manager unless you’re training a local mining model):

INSERT INTO MyModel (TheKey, Age, Gender, HairColor, CarColor)

OPENROWSET('Microsoft.Jet.OLEDB.4.0',

'data source=People.mdb',

'SELECT [Key], Age, Gender, HairColor, CarColor

FROM People ORDER BY [Key]');

And, finding out which cluster each case (people) belongs to:

SELECT t.Age, t.Gender, t.CarColor, Cluster(), ClusterProbability(), PredictHistogram(Cluster())

FROM MyModel PREDICTION JOIN

OPENROWSET('Microsoft.Jet.OLEDB.4.0',

'data source= People.mdb',

'SELECT [Key], Age, Gender, CarColor FROM People') AS t

ON  MyModel.TheKey = t.[Key] AND

MyModel.Age = t.Age AND

MyModel.Gender = t.Gender AND

MyModel.CarColor = t.CarColor;

Why is the support of a cluster in the content is different from Cluster()?

When you perform customer segmentation using Microsoft_Clustering just as described above, you might find that the number of customers for a cluster may be different from the support of the cluster in the content. Or, sometimes, you may find a cluster with no customers belong to. Here is the explanation why it happens.

The Microsoft Clustering algorithm performs what is known as “soft clustering.” This means that any particular case could belong to many or even all clusters simultaneously, with a percent likelihood of the case belonging to any one cluster. Continuous variables are stored as a Gaussian distribution, basically stating that “cases with values that fall in this range have a certain likelihood of falling within this cluster.” The support of a cluster you see in the model’s content is calculated based on this “soft clustering”.

However, when you do Cluster() in PREDICTION JOIN query to find out what cluster each case belongs to, it will return the closest one, which is essentially “hard clustering” where you are assigning each case to one and only one cluster. In some cases, you may find a cluster with no cases belong to in terms of this hard clustering.

Where do I get the details of the two algorithms?

For decision tree (classification):
– Correlation counting:
Surajit Chaudhuri, Usama M. Fayyad, Jeff Bernhardt,
Scalable Classification over SQL Databases. ICDE 1999: 470-479
Found in ftp://ftp.research.microsoft.com/users/surajitc/icde99.pdf

– The default scoring methods (Bayesian Dirichlet Equivalent with Uniform prior):
David M. Chickering; Dan Geiger; David Heckerman,
Learning Bayesian Networks: The Combination of Knowledge and
Statistical Data, MSR-TR-94-09, 1994
Found in http://www.research.microsoft.com/research/pubs/view.aspx?tr_id=81

For clustering:
Paul S. Bradley; Usama M. Fayyad; Cory A. Reina
Scaling EM (Expectation Maximization) Clustering to Large Databases,
MSR-TR-98-35, 1998
Found in http://www.research.microsoft.com/scripts/pubs/view.asp?TR_ID=MSR-TR-98-35.

What kind of pruning method is used in Microsoft decision trees?

We don’t do pruning in Microsoft Decision Trees. Instead, we allow users to be able to control overfitting/underfitting by specifying COMPLEXITY_PENALTY parameter. Here is the description from the BOL:

A floating point number with a range between 0 and 1. Used to inhibit the growth of the decision tree, the value is subtracted from 1 and used as a factor in determining the likelihood of a split. The deeper the branch of a decision tree, the less likely a split becomes; the complexity penalty influences that likelihood. A low complexity penalty increases the likelihood of a split, while a high complexity penalty decreases the likelihood of a split. The effect of this mining parameter is dependent on the mining model itself; some experimentation and observation may be required to accurately tune the data mining model. The default value is based on the number of attributes for a given model:

For 1 to 9 attributes, the value is 0.5.
For 10 to 99 attributes, the value is 0.9.
For 100 or more attributes, the value is 0.99.

How is the dependency network generated?

Imagine all the decision trees are learned at the same time (we don’t actually learn all the trees at once, but this is the way to see how the slider works). At any one time during the learning process, each target variable for which we are building a tree has a set of possible splits and corresponding score gains.

To grow the dependency network, we accept the split with the highest score gain across all variables. We repeat this process until all possible splits for all variables have negative score gains. When you move the slider up, you’re seeing the order in which the dependency network was grown with this (hypothetical) procedure.

The Dependency Network is generated using the split score gains for the splits in a family of related trees. One model can contain many trees, especially in the case where there is a nested table being predicted, and each tree can have any number of splits. Each of these splits is given a ‘score gain’ during the training, and these scores can be compared to find the ‘best’ one. Our content schema rowset (also available as SELECT * FROM model.CONTENT) should have a column called MSOLAP_NODE_SCORE which gives the score gain of the split that happened to the node. The input attribute of the split in the node is found from NODE_RULE column.

What is $AdjustedProbability or PredictAdjustedProbability()?

The MS Analysis Service DM has a formula to penalize popular items in prediction. Suppose the predicted probability of A and B being purchased are the same, say, 10% based on the customer’s basket. Now, suppose A is so popular that 10% of all customers buy while B is not so popular that only 1% of them buy (this is what we call marginal probability that is what we can tell even without looking at that particular customer’s basket). In this case, you probably want to recommend B more than A because B would be more interesting to that particular customer. The $AdjustedProbability is a “lift” of the predicted probability over marginal probability. The formula is our internal, undocumented, which may change in the next release, but it looks something like the following:

AdjustedProbability = PredProb * (1 – MargProb) ^ Constant

As you can see, AdjustedProbability is not a probability per se any more. It’s intended to be interpreted as “lift” of probability.

Why is the probability of a zero-support state greater than 0 in prediction?

It is by design. It is the result of some smoothing we apply to learned probabilities to take into account our prior belief. For example: How many US presidents have been women? 0. So does that mean that the odds of a woman ever being president are 0? No it does not.

Can I apply Microsoft Analysis Services to text mining?

Although Microsoft Analysis Services 2000 doesn’t support text mining directly, you could still do lots of things with the nested table and associative prediction capability that it supports. For instance, you could define a mining model as something like this:

CREATE MINING MODEL TextCategory

(

DocID    LONG KEY,

Category    TEXT DISCRETE PREDICT,

WordVectors TABLE

(

Word    TEXT KEY,

Frequency LONG DISCRETIZED

)

) USING Microsoft_Decision_Trees

And, train it with your documents. You will have to extract the word vectors from each document to train the model. The extraction process would involve stemming, getting rid of stop words, etc that are outside of data mining functionality.

Now, you can predict category for each document using PREDICTION JOIN query. I haven’t tried this, but certainly it will be different from other typical text mining because decision tree finds pattern based on scoring (e.g., similar to Bayesian) while most of text mining or IRS systems uses its own similarity or distance functions.

In addition, if you wanted to know how each word is related (i.e., what words show up together in a document), you might try to create a mining model something like following and check out the dependency network:

CREATE MINING MODEL TextCategory
(
DocID    LONG KEY,
WordVectors TABLE PREDICT
(
Word    TEXT KEY
)
) USING Microsoft_Decision_Trees

Can I do a Time-series analysis with Microsoft Analysis DM?

Yes, indirectly using the Microsoft Decision Tree. You can create a model that will do predictions on a time series like the one below, but it requires some data manipulation.

What you need to do is to create cases that using relative time steps as input values and use decision trees to predict the current year’s value. You can use however many previous data steps you wish in your prediction. As our decision tree doesn’t predict continuous values you need to discretize the output column with a high number to get a lot of small buckets. You will also want to play with the COMPLEXITY_PENALTY parameter to experiment with the generated trees.

For example:

CREATE MINING MODEL myTSModel
{
Year       LONG    KEY,
Value      LONG    DISCRETIZED(50)
Minus1    LONG    CONTINUOUS,
Minus2    LONG    CONTINUOUS,
Minus3    LONG    CONTINUOUS,
Minus4    LONG    CONTINUOUS
}
USING Microsoft_Decision_Trees (COMPLEXITY_PENALTY=0.5)

and your input table would look like:

Year     Value     Minus1     Minus2    Minus3    Minus4
1997      100      NULL      NULL     NULL     NULL
1998      200       100         NULL     NULL     NULL
1999      400       200         100         NULL     NULL
2000      800       400         200         100         NULL
2001     1600      800         400         200         100
2002     3200    1600         800         400         200

When you predict values for 2003, you would use the data from 1999-2002, for 2004, you would use 2000-2002 plus the predicted value for 2003.

One problem is guaranteeing you have enough data for the trees to create something meaningful. You aren’t going to get any splits in your tree if you don’t have enough input cases.

Can I perform association rules discovery using Microsoft Analysis DM?

Although the Microsoft Analysis Service 2000 DM component doesn’t implement association rules algorithm per se, the functionality that you’re looking for might be already available with our decision trees.

1) You can do certain type of association discovery with the Microsoft_Decision_Trees algorithm. If you create a model with multiple predictable columns, or especially a nested table that is marked “Input and Predictable”, a tree is generated for every predictable attribute. From these trees, associations can be derived and are displayed in the dependency network viewer provided with the Analysis Manager (right click on the mining model). Information on how the network is derived is available in this FAQ as a separate item (look for “dependency network”). Basically, the dependency networks tells you which attribute is the best predictor for which.

2) With the model above, you can also perform associative prediction (aka, recommedation). Say, you could predict what items a customer would buy together with items that the customer already bought. An example of this type of task is described in this FAQ as a separate item (look for “collaborative filtering”).

3) DBMiner (www.dbminer.com) provides a package that adds an association rules algorithm to SQL Server 2000 Analysis Services. You can view their
website for details.

OLAP Mining Models

Can I use calculated members to create a mining model?

This is a known limitation we are working toward resolving in the next version.

Although many new features of the OLAP mining model won’t be available before the next release, you can build sophisticated models with OLAP source programmatically in SQL Server 2000. The trick is to do everything in a relational way! For example, you may create a regular mining model (relational), then train it with data get from cubes using MDX query. If the model has nested tables, you may need to use SHAPE syntax after you getting the result back from OLAP server. This way, you can train a model using calculated members, measures, etc.

Can I create a dimension for the clusters of the cases?

In order to see the members (cases) for each cluster, you need to do a prediction query using cluster() prediction function. Cluster() will give the cluster id (see the previous question). In order to reorganize a regular dimension based on the OLAP clustering mining model, you need to do the following steps:

Create an OLAP cluster mining model and train the model.
Use SQL and MDX queries to get the input data for each case.
Predict the cluster id for each member in the case dimension using singleton prediction query based on the trained OLAP mining model.
Create a regular dimension based on the predicted cluster id

The sample singleton query could be:

SELECT cluster()

FROM CustomerSegment PREDICTION JOIN

(SELECT 'F' AS gender, 'S' AS marital_status, 'Partial College' AS

education, '$30K - $50K' AS yearly_income, 'Silver' AS member_card,

1.08 AS store_sales ) AS t

ON CustomerSegment.[Customers.Name.Gender] = t.gender AND

CustomerSegment.[Customers.Name.Marital Status] = t.marital_status AND

CustomerSegment.[Customers.Name.Education] = t.education AND

CustomerSegment.[Customers.Name.Yearly Income] = t.yearly_income AND

CustomerSegment.[Customers.Name.Member Card] = t.member_card AND

CustomerSegment.[measures.Store Sales] = t.store_sales

How do I make

prediction

from an OLAP mining model?

You can use the same data mining prediction queries. For instance, the following example is a valid DMX prediction query on [Customer Pattern Discovery] model which is OLAP type model, available when you install Analysis Services 2000.

SELECT [customer_input].[customer_id],

[Customer Pattern Discovery].[Customers.Name.Member Card],

PredictProbability([Customer Pattern Discovery].[Customers.Name.Member Card]) AS Prob

FROM [Customer Pattern Discovery] PREDICTION JOIN

OPENROWSET

(

'Microsoft.Jet.OLEDB.4.0',

'Provider=Microsoft.Jet.OLEDB.4.0;Data Source=Y:\foodmart 2000.mdb;Persist Security Info=False',

'select * from customer') AS [customer_input]

ON [Customer Pattern Discovery].[Customers.Name.Gender] = [customer_input].[gender] AND

[Customer Pattern Discovery].[Customers.Name.Marital Status] = [customer_input].[marital_status] AND

[Customer Pattern Discovery].[Customers.Name.Education] = [customer_input].[education] AND

[Customer Pattern Discovery].[Customers.Name.Yearly Income] = [customer_input].[yearly_income]

Why does processing my DM dimension or virtual cube fail?

There is a limitation in AS 2000 DM dimensions that they can only be formed on models that contain one tree. Therefore you would have to make one (and only-one) case dimension property or measure predictable to create a data mining dimension and a virtual cube. Besides, the underlying DM model has to have more nodes than just “All” root node, which would be meaningless since it would create a dimension with 0 members. Otherwise, you’ll get the following processing error:

Analysis Server Error: Processing error [Source data contains no rows]
Note that you can still create a mining model from a cube to perform cross-selling predictions, just don’t check the create dimension/virtual cube sections. The best (easiest) cross sell models I’ve created with this data is by using the Orders table as the case table, with the “Order Details” view as the nested table from the Northwind database.

Select:
OrderID – case key
ProductName – nested key

make the product table “Input and Predictable” and use whatever other columns you wish.

How can I use my cube as prediction input to my relational mining model?

For instance, the Member Card RDBMS is a relational mining model (as opposed to OLAP mining model), you cannot do predictions in MDX using the MDX function Predict. Instead, you have to predict using “regular” DM syntax (a PREDICTION JOIN with either an OPENROWSET or a singleton).
It is possible to get the data from the cube. One way is to use the flattened rowset of MDX and run an MDX query that will return the results as a rowset (see more about flattened rowset in Books OnLine=BOL) or using our SQL syntax (see more about that in BOL).

For example use the following query (substitute your server name in the Data Source property and note the SQL Compatibility=3 for the order by to work)

SELECT
[DataRowset].[Customers:Name!KEY] as [Customer ID],
[DataRowset].[Customers:Name!NAME] as [Full Name],
Predict([Member Card RDBMS].[member card], INCLUDE_NULL) as [Prediction]
FROM
[Member Card RDBMS] PREDICTION JOIN
OPENROWSET (
'MSOLAP.2',
'Provider=MSOLAP.2;Data Source=localhost;Initial Catalog=FoodMart 2000;SQL Compatibility=3',
'SELECT DISTINCT [Customers:Name!KEY],
[Customers:Name!NAME],
[Customers:Name!Gender],
[Customers:Name!Marital Status],
[Customers:Name!Education]
FROM [FoodMart 2000].[Sales] ORDER BY [Customers:Name!NAME]'
) as [DataRowset]
ON
[DataRowset].[Customers:Name!KEY] = [Member Card RDBMS].[customer id] AND
[DataRowset].[Customers:Name!Gender] = [Member Card RDBMS].[gender] AND
[DataRowset].[Customers:Name!Marital Status] = [Member Card RDBMS].[marital status] AND
[DataRowset].[Customers:Name!Education] = [Member Card RDBMS].[education]

However the flattened rowset is a much better approach. The following query gives also expected ordering etc (the only problem seems to be a bug that
prevents anything but the key to be selected in the select clause (see the commented line – can’t uncomment it):

SELECT
[DataRowset].[[Customers]].[Key]]] as [Customer ID],
//[DataRowset].[[Customers]].[Name]]] as [Full Name],
Predict([Member Card RDBMS].[member card], INCLUDE_NULL) as [Prediction]
FROM
[Member Card RDBMS] PREDICTION JOIN
OPENROWSET (
'MSOLAP.2',
'Provider=MSOLAP.2;Data Source=localhost;Initial Catalog=FoodMart 2000',
'SELECT {measures.defaultmember} on 0,
Order([Customers].[Name].members,[Customers].currentmember.Name, BASC)
dimension properties
Customers.[Name].Properties("Name"),
Customers.[Name].Properties("Key"),
Customers.[Gender],
Customers.[Marital Status],
Customers.[Education] on 1 from
[Sales]'
) as [DataRowset]
ON
[DataRowset].[[Customers]].[Key]]] = [Member Card RDBMS].[customer id] AND
[DataRowset].[[Customers]].[Gender]]] = [Member Card RDBMS].[gender] AND
[DataRowset].[[Customers]].[Marital Status]]] = [Member Card RDBMS].[marital status] AND
[DataRowset].[[Customers]].[Education]]] = [Member Card RDBMS].[education]

Third-party Providers and Tools

Can I create my own (third-party) provider and plug-in to the Analysis Services?

Certainly. A white paper discussing the implementation of aggregated data mining providers for SQL Server Analysis Services SP1 and later is available.

What are the third-party providers and tools available?

DBMiner – various algorithms plugged in to Analysis Services
Angoss resource kit for OLE DB for DM – Free visualization tools (decision trees, clustering, lift-chart) from Angoss, based on OLE DB for DM
Currito’s paper – creates a new provider using the sample provider source code
Also check the data mining community web site where we have several samples including a cluster viewer, thin-client, etc.

Where can I find a web-based thin-client browsers?

Some free codes for xml-based decision tree browser are available at the MSN DM group site (http://groups.msn.com/AnalysisServicesDataMining).

If you want users to do predictions through thin-client, you can easily write ASP pages that collect input, generate a prediction query (DMX) and return the results to the user. The MSN DM group site has links to articles and samples that could help you out.

Can my (third-party) provider create OLAP mining models?

Creating OLAP mining models with an aggregated provider for Analysis Server 2000 SP1 is not supported. Creating such models is not part of the OLE DB for Data Mining specification.

The Microsoft Decision Tree and Clustering algorithms are part of Analysis Services and not an aggregated provider. It is possible for you to create your own syntax/method for data mining across a cube; however this functionality of your provider will not be accessible through the Analysis Manager UI.

If you want to extend the functionality of your provider beyond that of the OLE DB for Data Mining spec, you can create your own UI as an Analysis Manager Plug In, and use the syntax

EXECUTE ON

to send non-standard commands to your provider through the aggregator interface. The aggregator will strip away the “EXECUTE ON ” portion of the command and send the directly to your provider.

Can I export my trained model into PMML format or vice versa?

You can obtain the PMML representation of a mining model content from the DMSCHEMA_MINING_MODEL_XML schema rowset. The GUID
for this schema rowset is in Samples\include\oledbdm.h under your Analysis Services install directory.

You can also obtain the same result by executing “SELECT * FROM model.PMML”.

You can also create a mining model from a PMML (i.e., importing a PMML) by using CREATE MINING MODEL FROM PMML . But, there are several issues that you have to keep in mind. First of all, the only way to create server objects in Microsoft Analysis Service is through DSO, which uses its own object model. But, CREATE MINING MODEL FROM PMML will simply create a “session” model that will go away when you close the session, or you could create a “local” mining model by setting the “Mining Location” parameter on your connection string.

Also, note that Microsoft Analysis Services 2000 DM component was written using PMML 1.0, which was very incomplete and required many “extensions” just to make the format usable for us. The PMML generated by other vendors isn’t likely to be directly importable and will likely require annotations before importing.

Other Usage Issues

Why do the models created in DMSample or through ADO disappear when I restart the application?

The mining models created through DM Sample application are session mining models (unless you specified a path for “Mining Location” in the connection dialog). Their life time is the current session; they will disappear once you restart (or disconnect) the DM Sample application. Similarly, you can create session mining models or local mining models by executing CREATE statement with ADO. The session mining models or local mining models are different from server mining models that you can see and manage from the Analysis Manager. You should use Analysis Manager or DSO to create a server mining model.

At http://groups.msn.com/AnalysisServicesDataMining there is an example of data mining in Excel that uses Session Mining Models. There are also sample
Analysis Manager Add-ins that use DSO to modify Server Mining Models.

Can I redistribute DM browsers (e.g., decision tree browser) with my application?

Unfortunately, the license for Analysis Services does not include the ability to reuse the controls in other applications, which is why you won’t
find any documentation for it.

The mining model wizard shows no algorithms saying “Unable to get list of data mining algorithms”. What’s wrong?

Microsoft Analysis Services data mining functionality is supported through several separate dlls and rlls, which are loaded when its functionality is actually being used. The symptom that you describe may happen when loading of such dlls/rlls fails for some reason (such as corruption of the registry entry or incomplete setup, etc). In most cases, installing (re-installing if necessary) Microsoft SQL Server SP3solves this problem.

Why am I getting “The cube ‘my model’ does not exist” error for a data mining query?

You are seeing the misleading error because the server tries to execute the query against both DM and OLAP and if both fail, it returns the error for the default dialect (which is OLAP if not specified). You can set the “Default GUID Dialect” (DBPROP_MSMD_DEFAULT_GUID_DIALECT) connection property to the dialect guid for DM to get the correct errors. See msmd.h under Samples\include in your Analysis Services install tree for the guid defined for MDGUID_DM.

Using the Sample Provider Source Code

What is the sample provider and where to get?

SQL Server Analysis Services SP1 supports aggregated provider functionality so that other algorithm vendors could develop their own OLE DB for DM provider and plug it into Analysis Services platform. A white paper discussing the implementation of aggregated data mining providers is available.

In order to leverage the effort of ISVs to build data mining providers, Microsoft has made the data mining sample provider with source code available. The sample code includes the complete implementation of an aggregated provider as well as the following:

All required OLE DB objects, such as session, command, and rowset
The OLE DB for Data Mining syntax parser
Tokenization of input data
Query processing engine
A sample Naïve Bayes algorithm
Model archiving in XML and binary formats

Can I compile the sample provider with trial version of Visual Parse++?

Maybe. The sample provider code uses several functions of Visual Parse to initialize the parser data table and those functions are not available in the trial version. You will receive the following error messages when you link with the trial version.

… error LNK2001: unresolved external symbol “public: __thiscall SSYaccTable::SSYaccTable(structHINSTANCE__ *,char const *,char const *)” ([email protected]@[email protected][email protected]@[email protected])

… error LNK2001: unresolved external symbol “public: __thiscall SSLexTable::SSLexTable(structHINSTANCE__ *,char const *,char const *)” ([email protected]@[email protected][email protected]@[email protected])

……

There may be some way to get around this. But, it will certainly require changes of the sample provider source code.

How do I debug my sample provider?

You should set the Analysis Server executable (msmdsrv.exe found in bin folder of your Analysis Services installation folder) as your executable for debugging. Under Project\Settings\Debug, put msmdsrv.exe as the executable for debug session and “-c” (console mode) for the Program arguments. Of course you will have to stop your MS OLAP service before you do this, but this should be the easiest way to debug your DLL.

Also, note that DMSProv.dll (your sample provider) will not be loaded until its data mining functionality is being requested unless you ask the debugger to load its symbols initially.

How is AttributeSet generated in the sample provider?

Attributes are slightly different than mining columns. For instance, a flat mining model,

CREATE MINING MODEL foo

(

CustID   LONG     KEY,

Age        LONG     CONTINUOUS,

Gender   TEXT      DISCRETE

) ...

would give you attributes “Age” and “Gender.”

However, if you have a nested table mining model,

CREATE MINING MODEL foo

(

CustID   LONG     KEY,

Age        LONG     CONTINUOUS,

Gender   TEXT      DISCRETE

Products    TABLE

(

ProductName TEXT KEY

)

) ...

you have the attributes Age, Gender, PLUS an entire tree of attribute rooted at “Products” for every ProductName that is inserted into the model, e.g. “Products.Coke,” “Products.Pepsi,” etc. Therefore, if you have only 2 product names in the products table, this would give you 5 attributes

0 – Age

1 – Gender

2 – Products (root of hierarchy)

3 – Products(Coke)

4 – Products(Pepsi)

FIsCase() for Products would return false, as that attribute # could never be found in a training or prediction case. For all other attributes it would return true.

To populate the attribute hierarchy you need to train it correctly. Let’s assume you have 2 tables

Transactions – TransId

Purchases – TransId, ProductName

You would have to insert the data as such

INSERT INTO MyAssociationModel(Transaction_id, [Product purchases](SKIP, [Product Name))

SHAPE { OPENROWSET(“…”, “SELECT TransId FROM Transactions ORDER BY TransId” ) }

APPEND { OPENROWSET (“…”, “SELECT TransId, ProductName FROM Purchases ORDER BY TransId”) }

RELATE TransId to TransId

Note the “SKIP” in the insertion column list skips the TransId in the nested table. The shape implementation in the sample provider requires that the keys be ordered the same in the case and nested tables. If you use the Shape Provider, you do not need the ORDER BY’s, but you are constrained to the amount of data that will fit into memory.

If you don’t have a separate table that lists the TransId’s you can SELECT DISTINCT from your “Purchases” table, e.g. “SELECT DISTINCT TransId FROM Purchases ORDER BY TransId”

Note that only distinct value of the KEY column in the nested table is considered to generate attribute set. Also, adding additional column other than KEY to the nested table doesn’t necessarily introduce new attribute. Suppose you have 100 distinct ProductID for the customer (case).

Purchases TABLE (ProductID TEXT KEY) will give you 100 attribute each of which tells “the product exists”.

Purchases TABLE (ProductID TEXT KEY, OnSale TEXT DISCRETE) will give you again the same 100 attribute each of which tells OnSale of the product as well as it exists.

Purchases TABLE (ProductID TEXT KEY, OnSale TEXT DISCRETE, Coupon TEXT DISCRETE) will give you 200 attribute, 100 of which would the same as previous, the other 100 of which tells Coupon of each product.

Why isn’t error codes from my sample provider propagated?

If the provider raises an appropriate error (HRESULT) and posts an ErrorInfo record per OLE DB spec, the error should get propagated up to Analysis Manager.

Note, however, that there is a known issue as follows:

> I am using Windows 2000 Adv. Svr SP2. Exist three variables in Windows Registry used to store locale that are connected with our problem:

> VAR1. HKEY_USERS\.DEFAULT\Control Panel\International\Locale initialized when Windows is installed and changed manually by editing the Registry;

> VAR2. HKEY_CURRENT_USER\Control Panel\International\Locale changed by Settings for the current user in Regional Options of Control Panel;

> VAR3. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\Language\Default changed by Language Settings for the system/Set default in Regional Options of Control Panel;

>

> DMErrorLookup::GetErrorDescription function of DMProv/DMErrorLookup.cpp:

>

>   // Check the Locale

>   if ( in_lcid != GetUserDefaultLCID() )

>    return DB_E_NOLOCALE;

>

> When DMPostErrorResID calls this function it uses GetUserDefaultLCID() as in_lcid parameter. So this function will never return DB_E_NOLOCALE error. When RowsetViewer calls this function, it use GetSystemDefaultLCID() (VAR3) as in_lcid parameter. So only when VAR2 = VAR3 correct error messages are displayed. VAR1 does not matter.

Note that the server side uses GetSystemDefaultLCID(). You may have to change your provider code appropriately if this is an issue for you.

How is my aggregated provider registered?

When you build your provider, DMSProv.dll from the Visual Studio, the VS should auto-register the DLL as part of the build. But if you move it to a new machine, you’ll need to register it manually using regsvr32.exe. Aggregated data mining providers are registered like regular OLE DB providers. In addition to the standard OLE DB provider registry keys, the following are also required for aggregatable DM providers:

‘OLE DB DM Provider’ = s ”
‘Microsoft DM Aggregator Participation’ = s ‘SERVER’

Please see DMProv\DMSProv.rgs in the SampleProvider source tree for complete details of all the keys and where they go in the registry.

Also, note that newly registered DM providers are only picked up from aggregation at Analysis Server startup so you’ll need to restart the server after you register the new provider for it to show up as available in Analysis Services.

Books and Useful Links

Claude Seidman, Data Mining with Microsoft SQL Server 2000 Technical Reference, MS Press, 2001.
Barry de Ville, Microsoft Data Mining: Integrated Business Intelligence for e-Commerce and Knowledge Management, Butterworth-Heinemann, 2001.
Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2000.
Microsoft-offered Training: Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000 (this contains a module for DM introduction).

Some useful links for the Microsoft Analysis Services data mining component or data mining in general include:

Data Mining MSN Community: data mining community web site
Microsoft TechNet for Data Mining: an excellent introduction to Microsoft Analysis Service data mining feature.
MSDN resources for SQL Server 2000 data mining: excellent resources for DM from MSDN.
Third-Party Data Mining Provider Aggregation: a white paper on how a third-party OLE DB for Data Mining provider can be integrated into SQL Server 2000 Analysis Services.
Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis Services: demonstrates how to apply data mining to a real-world situation using Microsoft SQL Server 2000 components.

OLE DB for Data Mining Sample Provider: sample provider source code
Data mining visualization controls from Angoss.

OLE DB for Data Mining Specification: defines DMX (data mining extension for SQL) language.
XML for Analysis: an XML-based standard interface for OLAP and data mining.
Performance Study of Microsoft Data Mining Algorithms.
KDNuggets: a comprehensive resources for data mining, web mining & knowledge discovery
DMG (Data Mining Group): an independent, vendor led group which develops data mining standards (e.g., PMML).
CRISP-DM: an industry- and tool-neutral data mining process model.
ACM-SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data and Data Mining.
ACM-SIGMOD: ACM Special Interest Group on Management of Data.

Copyrights and Disclaimer

Most of the information on this FAQ is collected (and edited if necessary) from the newsgroup postings made by many contributors. Although I compiled this FAQ in good faith to be as accurate as possible, there may be errors. As such, all the information in this FAQ is provided “AS IS” with no warranties, and confers no rights. Microsoft Corporation does not officially sponsor, approve, or endorse this FAQ or its content.

General topics

What is the microsoft.public.sqlserver.datamining group?

How to access microsoft.public.sqlserver.datamining group?

What is data mining?

Is there a data mining community web site?

What is Microsoft Analysis Service data mining functionality?

What are the case studies of Microsoft Analysis Services data mining available?

What are new algorithms available in the next release?

Do we have a data mining query tool, similar to the SQL Server query analyzer?

Can I run a data mining query from SQL server (e.g., query analyzer)?

I have models trained. How do I perform predictions?

How do I create a nested table column using Analysis Manager?

How do I use a nested table input in prediction?

Give me an example of collaborative filtering using Microsoft decision trees.

Give me an example of customer segmentation using Microsoft clustering.

Why is the support of a cluster in the content is different from Cluster()?

Where do I get the details of the two algorithms?

What kind of pruning method is used in Microsoft decision trees?

How is the dependency network generated?

What is $AdjustedProbability or PredictAdjustedProbability()?

Why is the probability of a zero-support state greater than 0 in prediction?

Can I apply Microsoft Analysis Services to text mining?

Can I do a Time-series analysis with Microsoft Analysis DM?

Can I perform association rules discovery using Microsoft Analysis DM?

Can I use calculated members to create a mining model?

Can I create a dimension for the clusters of the cases?

How do I make

from an OLAP mining model?

Why does processing my DM dimension or virtual cube fail?

How can I use my cube as prediction input to my relational mining model?

Can I create my own (third-party) provider and plug-in to the Analysis Services?

What are the third-party providers and tools available?

Where can I find a web-based thin-client browsers?

Can my (third-party) provider create OLAP mining models?

Can I export my trained model into PMML format or vice versa?

Why do the models created in DMSample or through ADO disappear when I restart the application?

Can I redistribute DM browsers (e.g., decision tree browser) with my application?

The mining model wizard shows no algorithms saying “Unable to get list of data mining algorithms”. What’s wrong?

Why am I getting “The cube ‘my model’ does not exist” error for a data mining query?

What is the sample provider and where to get?

Can I compile the sample provider with trial version of Visual Parse++?

How do I debug my sample provider?

How is AttributeSet generated in the sample provider?

Why isn’t error codes from my sample provider propagated?

How is my aggregated provider registered?

Related posts

Mining the Flow

Plugging Into SQL Server Data Mining

Microsoft Denmark’s “Momentum” Magazine Article on SQL Server 2005 Data Mining