SQL Server Data Mining contains an Association Rules
browser, which displays the most important rules found inside a mining model.
However, for certain pattern exploration tasks, it may be useful to visualize
only those rules that point to a certain item.
An example: for a model* that predicts movie associations, extract
those rules that point to ‘Star Wars’. Even more, extract only rules with a probability
larger than .25 and containing ‘Indiana Jones’ or ‘Blade Runner’ on the left
hand side.
All the rules (and item-sets) are exposed in the model
content, the rowset containing all the patterns detected by the model.
Content nodes that represent rules have a NODE_TYPE value of
8 (Item sets have NODE_TYPE = 7). Inside a rule node, the NODE_DISTRIBUTION
field contains the rule’s right hand side item (with a value type of 2, “Existing”)
as well as the identifier of the node representing the left hand side itemset
(value type of 10, NODE_UNIQUE_NAME).
Therefore, to get the results requested in the beginning, we
need a query that:
-
Selects the rule nodes (NODE_TYPE=8)
-
Filters only those that contain ‘Star Wars’ in NODE_DISTRIBUTION
-
Extracts the support, probability and other information about the rule
-
Applies a probability threshold ( >.25)
-
Filters only those that contain ‘Indiana Jones’ or ‘Blade Runner’
-
Sorts the results in descending order, by probability
And here is the query:
SELECT
top 100 FROM
(
select FLATTENED
NODE_CAPTION AS [Rule],
NODE_PROBABILITY
AS [Rule_Probability],
(SELECT ATTRIBUTE_NAME
FROM NODE_DISTRIBUTION
WHERE
VALUETYPE=2 AND
ATTRIBUTE_NAME='Movies(Star Wars)'
)
AS D
FROM
[MovieClick].CONTENT
WHERE
NODE_TYPE=8
)
AS A
WHERE
[D.ATTRIBUTE_NAME]
<> null AND
Rule_Probability
> 0.25 AND
(
VBA!InStr(Rule, 'Blade Runner') > 0 OR
VBA!InStr(Rule, 'Indiana Jones') > 0
)
ORDER BY
[Rule_Probability] DESC
The result contains all the requested rules, with their
probability.
* The model created for this example
used the MovieClick data available on this site. The algorithm was
Association Rules with the parameters MINIMUM_PROBABILITY set to 0.1 and
MINIMUM_SUPPORT set to 3.