Sometimes it’s tricky to set the right parameters for the Association Rules mining algorithm. Depending on your data, you may find nothing very quickly, or more than you can handle after quite some time. The nature of the Association Rules algorithm is such that the size and quality of the resultant model is highly sensitive to the modeling parameters. The Association Rules algorithm counts items that happen together, called frequent item sets. The potential number of itemsets is equal to the number of possible combinations of attribute/value pairs in your data – a huge number. Therefore, the Association Rules algorithm uses its parameters to limit the number of combinations it will consider. Once itemsets are determined, the Association Rules uses those itemsets to determine rules. Additional parameters are provided for determining what conditions have to be met to create a valid rule.
For example lets say we have 1000 cases, with items Apples, Pears, and Bananas occur in the following counts, with the appropriately sized itemsets.
Apples – 500
Pears – 200
Bananas – 100
Apples & Pears - 150
Apples & Bananas - 50
Pears & Bananas - 100
Apples & Bananas & Pears - 50
From these we can derive the following rules, and can be read “A imples B with x probability”:
Apples -> Pears 30%
Apples -> Bananas 10%
Apples, Pears -> Bananas 50%
Apples, Bananas -> Pears 100%
Pears, Bananas -> Apples 50%
Using parameters, we can control how the rules and itemsets are created. All parameters are set by right clicking on the model in the Mining Models pane of the Data Mining Designer and selecting “Set Algorithm Parameters…”
MINIMUM_SUPPORT is the key parameter for reducing the problem set size. If you process your model and get too many itemsets, increase this number, if you get none or too few, decrease this number. The impact of this parameter is to change the number of itemsets that will be generated by the algorithm . For instance, setting this to 10% (0.1) in the example above would eliminate the Apples & Bananas and the Apples and Bananas and Pears itemsets. A number between 0 and 1 represents a percentage - e.g. 0.03 means that any itemsets that occur in at least 3 percent of the cases will be counted. A number 1 or higher means an absolute count - e.g. 3 means that any itemsets that occur in at least 3 cases will be counted.
MAXIMUM_SUPPORT controls the maximum number of times an itemset may occur before getting ignored. This defaults to 100%, but can be used to eliminate items that occur in every, or almost every case. Setting this parameter may cause non-obvious results. For example, if I set MAXIMUM_SUPPORT to 40%, the Apples itemset will not be in the model. This means that the Apples & Pears and Apples & Bananas itemsets, and therefore the associated rules, also will not be in the model since they depend on the presence of Apples, which you have already eliminated. This parameter is set with the same semantic as MINIMUM_SUPPORT.
MAXIMUM_ITEMSET_SIZE allows you to control the size of the resultant model by enforcing a limit on how long itemsets can be. Potentially you can have itemsets of length equal to the number of attributes in your problem. Typically you will run out of memory before you get that far. Luckily this parameter defaults to a conservative 3. Microsoft Association Rules will actually reduce this number automatically when it detects memory pressure. Looking at the Mining Models schema rowset after training will allow you to see what value was used in case this occurs.
MINIMUM_ITEMSET_SIZE reduces the number of itemsets and rules by filtering out smaller itemsets. Note that this does not have the same impact as MAXIMUM_SUPPORT – the smaller itemsets are still generated in order to generate the larger ones, but they are never displayed in the content or used in rule generation. NOTE: The dependency network only displays pairwise rules. Setting MINIMUM_ITEMSET_SIZE to a number larger than 2 means that you won’t see any rules in the dependency network view.
MINIMUM_PROBABILITY changes the number of rules and is a percentage. For example, setting this parameter to 40% in the above example would eliminate the Apples->Pears and Apples->Bananas rules.
MINIMUM_IMPORTANCE controls the number of rules generated and is a number. The default is set such that all rules are allowed. The “importance” of a rule is based on the lift provided by executing that rule. For example, say we add Oranges to the above dataset and Oranges occurs in all 1000 cases. We would end up with a rule that says Oranges->Apples with a 50% probability. This really doesn’t provide us any information since we get Apples with a 50% probability anyway, however Apples->Pears gives us additional information and will have a higher importance.