One of the nice improvements we made to SQL Server 2005 DMX was the ability to predict the probability of a particular state. That is, in SQL Server 2000, you could get the probability of the predicted state using the following query:
SQL Server 2000:
SELECT PredictProbability([Gender]) FROM Customers
PREDICTION JOIN …
This query gives you the probability of whatever state is predicted for Gender for each row of the input. However if you wanted to get the probability that a customer was ‘Female’, for example, you would have to write a query like this:
SQL Server 2000:
SELECT FLATTENED
(SELECT $Probability FROM PredictHistogram([Gender])
WHERE [Gender]='Female')
FROM Customers PREDICTION JOIN …
Which, to put in polite terms, is a tad bit cumbersome. Not only that, but you get a column called “Expression.$PROBABILITY” as your output – not nice.
Luckily in SQL Server 2005 Data Mining, we added new syntax that allowed you to simply indicate the state for which you wanted the probability directly in the function call, like this:
SQL Server 2005:
SELECT PredictProbability([Gender], 'Female') FROM Customers
PREDICTION JOIN …
And life was good. We even made it nicer by allowing the specification of a parameter for the state, so you could write queries like the following:
SQL Server 2005 SP2:
SELECT PredictProbability([Gender], @Gender) FROM Customers
PREDICTION JOIN …
And specify the desired state at run-time. Wonderful! However, there was a little piece missing. It turns out we even allowed arbitrary expressions like ‘Fe’+’male’ there, but we left out the ability to determine what the probability of a known state from the input. That means that if you wanted to determine what the probability was of a state and the state happened to be stored in the database, you would have to first issue a query to get the state and then issue another query to get the probability of that state. Not fun.
In SQL Server 2005 SP2 Data Mining we generalized the language so that essentially, every where you can place a parameter, you can also place a column reference. That means that you can solve the above problem with a query like this:
SELECT PredictProbability([Gender], t.Gender) FROM Customers
NATURAL PREDICTION JOIN
OPENQUERY([Movie Click], 'SELECT * FROM CUSTOMERS') AS t
Now, where is this useful? We’re sure all of the creative DMXers out there will find innumerable uses for this new flexibility, but the most obvious application is for accuracy and lift calculations. Now, in one query, you can generate lift figures for your entire testing set without caching the data, allowing, for example, the generation of accuracy reports in Reporting Services! In the next tip, we will show you an example of this.