[Note: This article is the first in a new "Under the Hood" series that explores the inner workings of the SQL Server Data Mining algorithms and infrastructure. Enjoy the deep dive!]
Minimum Dependency Probability?
When you create a Microsoft_Naive_Bayes model in SQL Server 2005 Data Mining, you can set an algorithm parameter MINIMUM_DEPENDENCY_PROBABILITY that specifies the minimum dependency probability between the input and output attributes. The algorithm then filters out input attributes that have a dependency probability lower than the set threshold. In this article, we show how exactly to compute the dependency probability between an input and an output attribute. The main idea behind this comes from the research paper titled “A Bayesian Approach to Learning Bayesian Networks with Local Structure”.
A Naïve Bayes model can be treated as a special case of a Bayesian network which has conditional probabilities at every node. For a Naïve Bayes model, the dependency is just between input attributes and the output attribute only.
Computation Walkthrough
Let us consider a simple dataset with one input attribute “Number of Children” and one output attribute “Bike Buyer”.
| Bike Buyer | Number of Children | Count |
| No | 0 | 1 |
| No | 1 | 9 |
| No | 2 | 35 |
| Yes | 0 | 40 |
| Yes | 1 | 10 |
| Yes | 2 | 5 |
Let
X = Score of the Bayesian Network without the split on “Number of Children”
Y = Score of the Bayesian Network with split on “Number of Children”
P = dependency probability of the node
By definition the dependency probability is proportional to the score which gives us:
P/ (1-P) = Y / X
Taking natural LOG on both sides:
LOG (P / (1 – P)) = LOG Y – LOG X
For computing the score for a split, we use a function LG which is the Log Gamma function. See the end of this article for an explanation on how to compute the log gamma function.
We first evaluate X using the following table:
Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’)
Number of cases = 100
| Bike Buyer = missing | LG(1 + 0) – LG(1) = 0 |
| Bike Buyer = 0 | LG(1 + 45) – LG(1) = 129.124 |
| Bike Buyer = 1 | LG(1 + 55) – LG(1) = 168.327 |
| Bike Buyer | LG(3) – LG(3 + 100) = -375.82 |
LOG X = -74.835
We then evaluate the value of Y by splitting the data set using the three different values of Number of Children and computing the partition score in all the three cases.
Number of Children = 0
Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’)
Number of cases = 41
| Bike Buyer = missing | LG(1 + 0) – LG(1) = 0 |
| Bike Buyer = 0 | LG(1 + 1) – LG(1) = -4.4e-11 |
| Bike Buyer = 1 | LG(1 + 40) – LG(1) = 110.32 |
| Bike Buyer | LG(3) – LG(3 + 41) = -123.526 |
Partition Score = -10.519
Number of Children = 1
Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’)
Number of cases = 19
| Bike Buyer = missing | LG(1 + 0) – LG(1) = 0 |
| Bike Buyer = 0 | LG(1 + 9) – LG(1) = 12.801 |
| Bike Buyer = 1 | LG(1 + 10) – LG(1) = 15.104 |
| Bike Buyer | LG(3) – LG(3 + 19) = -46.679 |
Partition Score = -16.781
Number of Children = 2
Number of unique values of “Bike Buyer” = 3 (including the value ‘missing’)
Number of cases = 40
| Bike Buyer = missing | LG(1 + 0) – LG(1) = 0 |
| Bike Buyer = 0 | LG(1 + 35) – LG(1) = 92.136 |
| Bike Buyer = 1 | LG(1 + 5) – LG(1) = 4.787 |
| Bike Buyer | LG(3) – LG(3 + 41) = -119.741 |
Partition Score = 20.155
LOG Y = -47.455
Node Score = LOG y – LOG X = 27.379
Node Probability = 0.99999
The above example shows a very high probability for attribute “Number of Children” predicting the value of “Bike Buyer” which is apparent from the data set. Let us consider another example where “Number of Children” is a very poor predictor for “Bike Buyer”:
| Bike Buyer | Number of Children | Count |
| No | 0 | 10 |
| No | 1 | 15 |
| No | 2 | 25 |
| Yes | 0 | 10 |
| Yes | 1 | 15 |
| Yes | 2 | 25 |
Using the same score computation method, the values obtained are as follows:
Node Score = LOG y – LOG X = -6.965
Node Probability = 0.000943
We see that the attribute “Number of Children” has a very low probability of predicting the Bike Buyer state.
How Do I Verify This?
Simple - use the accompanying download for this tip. It includes an Analysis Services project that uses an Access .mdb file with two data tables, “BikeBuyer1” for the first case and “BikerBuyer2” for the second, and creates two mining structures with a Naïve Bayes model for each case. It sets the algorithm parameter MINIMUM_DEPENDENCY_PROBABILITY to 1e-7 in the second case so that the node “Number of Children” shows up and also sets each attribute type to “Discrete”. Once the models are processed, you can use the Mining Content viewer under the “Model Viewer” tab in Business Intelligence Development Studio to look up the node score. The Mining Content viewer will have a column MSOLAP_NODE_SCORE which shows the value of the node score for Number of Children where the Node Type is 10.
| Sidebar: Computing Log Gamma There are several numerical methods available to compute the gamma function and hence the natural logarithm of the value which we term as the log gamma function. The series is represented as follows: Gamma(z + 1) = (z + d + .5)^(z + .5) e^-(z + d +.5) * sqrt(2*pi) * [c0 + c1/(z+1) + ... + Cn/(z+n) Ln[Gamma(z + 1)] = (z + .5) * ln [z + d + .5] - (z + d + .5) + ln [sqrt(2*pi) * SERIES] |