Scenario
Imagine you built a model with multiple nested tables and multiple predictable columns inside each nested table. In your application you want to run a prediction query against the model and gather prediction statistics for each predictable attribute. Such queries are usually long and tedious to write, plus they are hard to debug if you made a minor slip. This tip shows you how to programmatically generate such a query using C# and ADOMD.NET.
Let us consider a database schema with the following tables:
· Products (ProductID, ProductColor, ProductCategory, ProductPrice, ProductWeight)
· Stores ( StoreId, State, Size, Type)
· Manufacturers (ManufacturerId, Country, Price, ProductId)
· ProductToStores (StoreId, ProductId)
We construct the following model from the above schema:
CREATE MINING MODEL [Products] (
[Product ID] LONG KEY,
[Product Color New] TEXT DISCRETE,
[Product Category] TEXT DISCRETE PREDICT,
[Product Price] DOUBLE DISCRETIZED PREDICT,
[Product Weight] DOUBLE CONTINUOUS PREDICT,
[Stores With Products] TABLE
(
[Store ID] LONG KEY,
[State] TEXT DISCRETE,
[Size] LONG CONTINUOUS PREDICT,
[Type] LONG DISCRETE PREDICT
),
[Manufacturers] TABLE
(
[Manufacturer ID] LONG KEY,
[Country] TEXT DISCRETE,
[Price] DOUBLE CONTINUOUS PREDICT
)
)
USING Microsoft_Decision_Trees
Notice that the column names in the mining model have spaces (typical if you created the model using BI Development Studio). The column [Product Color New] has been renamed from the original column name. The nested table [Stores With Products] is created from a join of the two source tables. Our goal is to generate a PREDICTION JOIN statement which in the most generic form looks as follows:
SELECT [FLATTENED] [TOP n] select_expression_list
FROM model | sub_select [NATURAL] PREDICTION JOIN
source_data_query [ON join_mapping_list]
[WHERE condition_expression]
[ORDER BY expression [DESC|ASC]]
Building the Prediction Query
First let us consider the list of columns we want to generate in our select_expression_list for the Prediction Join query:
1. For each predictable column in the case table, we want to include the equivalent column from the data source if it exists. This enables us to compare the actual value to the predictable value in our application.
2. For each predictable column in the case table, we want to make available the following statistics:
a. Discrete Column: The histogram of state with probability for each state.
b. Continuous Column: The mean, STDEV and the probability
c. Discretized Column: The Min, Mid, Max and the probability.
From our example above, the query fragment for #1 and #2 would look as follows:
/*Discrete*/
T.[ProductCategory],
(SELECT Predict([Product Category]) as [Product Category],
PredictProbability([Product Category]) as [Prob]
FROM PredictHistogram([Product Category])) AS [Product Category Predicted]
/*Discretized*/
,T.[ProductPrice]
,(SELECT RangeMin([Product Price]) as [Min],
RangeMax([Product Price]) as [Max],
RangeMid([Product Price]) as [Mid],
PredictProbability([Product Price]) as [Prob]
FROM PredictHistogram([Product Price])) AS [Product Price Predicted]
/*Continuous*/
,T.[ProductWeight]
,(SELECT Predict([Product Weight]) as [Product Weight]
,PredictSTDEV([Product Weight]) as [STDEV]
,PredictProbability([Product Weight]) as [Prob]
FROM PredictHistogram([Product Weight])) AS [Product Weight Predicted]
3. For each nested table and for each predictable, we want to include the equivalent column from the data source if it exists. This enables us to compare the actual value to the predictable value in our application.
4. For each predictable column in the nested table and for each unique nested table key, we want to output the same data as #2.
From our example above, the query fragment for #3 and #4 would look as follows:
,(SELECT [ManufacturerId]
,[Price]
FROM T.[Manufacturers]) AS [Manufacturers]
,(SELECT [Manufacturer Id]
,(SELECT Predict([Price]) as [Price],
PredictSTDEV([Price]) as [STDEV],
PredictProbability([Price]) as [Prob]
FROM PredictHistogram([Price])) AS [Price Predicted]
,$PROBABILITY FROM Predict([Manufacturers], INCLUDE_STATISTICS, INCLUSIVE)) AS [Manufacturers_Predicted]
,(SELECT [StoreId]
,[Size]
,[Type]
FROM T.[Stores With Products]) AS [Stores With Products]
,(SELECT [Store Id]
,(SELECT Predict([Size]) as [Size],
PredictSTDEV([Size]) as [STDEV],
PredictProbability([Size]) as [Prob]
FROM PredictHistogram([Size])) AS [Size Predicted]
,(SELECT Predict([Type]) as [Type],
PredictProbability([Type]) as [Prob]
FROM PredictHistogram([Type])) AS [Type Predicted]
,$PROBABILITY
FROM Predict([Stores With Products], INCLUDE_STATISTICS, INCLUSIVE)) AS [Stores With Products_Predicted]
The last piece is the part source_data_query [ON join_mapping_list] . For our example, this looks as follows:
PREDICTION JOIN
SHAPE {
OPENQUERY([Test],'SELECT
[ProductCategory]
,[ProductColor]
,[ProductID]
,[ProductPrice]
,[ProductWeight]
FROM [Products] ')
}
APPEND
({
OPENQUERY([Test],'SELECT [ProductID]
,[Country]
,[ManufacturerId]
,[Price]
FROM [Manufacturers]
ORDER BY [ProductID]')
}
RELATE [ProductID] TO [ProductID]) AS [Manufacturers]
,({
OPENQUERY([Test],'SELECT [ProductID]
,[Size]
,[State]
,[StoreId]
,[Type]
FROM (SELECT dbo.Stores.StoreId, dbo.Stores.State, dbo.Stores.[Size], dbo.Stores.Type, dbo.ProductToStores.ProductId
FROM dbo.Stores INNER JOIN dbo.ProductToStores
ON dbo.Stores.StoreId = dbo.ProductToStores.StoreId)
AS [Stores With Products]
ORDER BY [ProductID]')
}
RELATE [ProductID] TO [ProductID]) AS [Stores With Products]
AS T
ON
T.[ProductCategory] = [Products].[Product Category]
AND T.[ProductColor] = [Products].[Product Color New]
AND T.[ProductID] = [Products].[Product ID]
AND T.[ProductPrice] = [Products].[Product Price]
AND T.[ProductWeight] = [Products].[Product Weight]
AND T.[Manufacturers].[Country]=[Products].[Manufacturers].[Country]
AND T.[Manufacturers].[ManufacturerId]=[Products].[Manufacturers].[Manufacturer Id]
AND T.[Manufacturers].[Price]=[Products].[Manufacturers].[Price]
AND T.[Stores With Products].[Size]=[Products].[Stores With Products].[Size]
AND T.[Stores With Products].[State]=[Products].[Stores With Products].[State]
AND T.[Stores With Products].[StoreId]=[Products].[Stores With Products].[Store Id]
AND T.[Stores With Products].[Type]=[Products].[Stores With Products].[Type]
Helper Class Outline
The helper class included with this tip generates all the above queries by reading the model metadata and sending PREPARE queries to the source to read the relational schema.
The usage pattern of the class is shown below:
string connString = "Data Source=localhost;Initial Catalog=Test";
PredictionQueryGenerator queryGen = new PredictionQueryGenerator(connString, "Test", "Products");
//Maps the model column [Product Color New] to DB Column [ProductColor]
queryGen.AddCaseTableColumnMapping("Product Color New", "ProductColor");
//Maps the nested table [Stores With Products] to a Query
queryGen.AddTableMapping("Stores With Products",
@"(SELECT dbo.Stores.StoreId, dbo.Stores.State, dbo.Stores.[Size], dbo.Stores.Type, dbo.ProductToStores.ProductId
FROM dbo.Stores INNER JOIN dbo.ProductToStores
ON dbo.Stores.StoreId = dbo.ProductToStores.StoreId) as [Stores With Products]");
//Initialized auto mapping
queryGen.InitAutoMapping();
//Generates the prints the query
Console.WriteLine(queryGen.DMXQuery());
Show Me the Code
Download the full source code for the helper class and play with it. The download also includes a SQL Server 2008 backup of the source database.