Suppose you are building a decision tree to detect fraudulent
transactions. You would have a predictable column, say,
[IsFraudulent], and many other input columns. This is a typical
classification task - classifying new user’s behavior into two groups:
Fraudulent, Not-Fraudulent. However, it may be difficult for a decision tree to
learn a pattern for fraudulent transactions if there is not enough support:
for example, say only one out of a million transactions is fraudulent. To make it worse,
the majority of patterns that may be found in fraudulent transactions can also be
found in non-fraudulent transactions.
Here are a couple of techniques that can help you solve such
problems using the Microsoft_Decision_Trees algorithm:
1.
Decreasing MINIMUM_LEAF_CASES
The default value for
the MINIMUM_LEAF_CASES parameter is 10. This will keep a decision tree node with support less than 10 from
being created by a split. This means that the model will never discover a
distinctive pattern that appears in only less than 10 cases. You could reduce this
parameter if you want to learn such patterns. However, note that reducing this
number may results in a model with too many patterns that might not be
interesting. In particular, if you have more than two values in the classifier
(e.g., CreditCardType = {Copper, Silver, Gold, Platinum}) and
you’re interested in patterns with small support only for a state (e.g.,
Platinum), you might want to prepare the training data with only two values
(e.g., Platinum, Non-Platinum). Otherwise, the model will learn all patterns
with small support for other values as well.
2.
Over-sampling
Another way to make the model learn
patterns with very small support is to over-sample the patterns that you’re
interested in. For instance, you could simply multiply fraudulent transactions
so that it has enough support. Unlike reducing MINIMUM_LEAF_CASES, this
approach won’t introduce any problem when you have more than two values in the
classifier. However, this approach involves an additional data preparation step for
over-sampling and the prediction probability out of the model should be adjusted
accordingly (i.e., divided by the multiple).
One way to accomplish over-sampling
is to use SQL Server 2005 Integration Services. Say, you had a column in
the source data named [IsFraudulent].
You would
-
Use a conditional split on the
[IsFraudulent] column to separate your cases
-
Then
use a sampling transform to randomly reduce the number of non-fraudulent cases.
-
Finally, use a merge transform to bring the data back into a single dataset.