This is a question that frequently
comes up: when do you know that feeding more data to the mining
model will stop improving accuracy? We think the accuracy
procedures in
SQL Server 2008 might be able to help you get some
reasonable estimates.
Background
Algorithms learn from data and,
generally, feeding them with more data will improve the
accuracy. This improvement, however, is asymptotic. If the
training data is statistically representative of the problem
space, then a classifier will stop learning after a certain
limit: new data points will support existing patterns and will
help reducing the risk of over-training, but will not add more
information to the model. The solution we suggest for SQL Server
2008 allows you to figure out if your mining model is likely to
gain from adding new data or is likely to maintain the same
accuracy.
The solution is based on the
new cross-validation feature in SQL Server 2008, so we
will start with a small presentation of this mechanism.
Cross-validation takes a data set, splits it into a number of
(roughly equal size) partitions, called folds, then builds a set
of mining models (as many as there are folds). Each mining model
is built on top of the whole data set minus one
of the folds and then the accuracy of the model is measured on
the fold that was left out. Effectively, all data is used in
training and all data is left out (by at least one mining
model). Also, each model is validated against data that was
not seen during training. The accuracy measures
for each model are then evaluated across all the folds. A good
mining solution would provide compact accuracy measurements
(i.e. no big differences between the measurements on different
folds). Spikes in the measurements may indicate problems with
the data. Note that cross validation does not improve the
accuracy of the model, it just provides an “honest” estimation
of that, reducing the risk of over-training.
In SQL Server 2008, cross
validation allows the user to specify how many cases should be
used for the procedure. This is mainly for performance purposes
— it allows a quick comparison between different mining
models/algorithms on the same data set.
Using Cross-Validation To Determine
Data Sufficiency
Now back to the original
problem: would adding new data benefit accuracy?
The cross-validation stored
procedure supports multiple syntax flavors. The one that is
interesting for our purpose is described below:
CALL
SystemGetCrossValidationResults( ,
, , , ‘’)
The first two parameters are
straightforward: the mining structure containing the data and the mining
model to be cross validated. (Note that the stored procedure
does not require either the mining structure or the model to be
processed; if mining structure is not processed, the stored
procedure will process it first, but not its contained models).
As we mentioned, the procedure creates a number of models and
the source model is used as a template.
The Number of Folds
parameter specifies how many folds should be used in the
procedure while the Number Of Cases parameter specifies
how much is the data set to be used by the procedure. The Target
Attribute parameter is used for:
- ensuring stratified sampling
for the folds (the distribution of the Target Attribute’s states
is almost the same in all the folds)
- directing the accuracy
measurements to be performed on the mining models built by the
procedure.
A typical call is would set
NumberOfFolds to 10 and NumberOfCases to, say,
1000. Such an invocation builds a 1000-case sample (using random
selection)with ten 100-case folds over the sample. 10
models are built, each trained with 900 cases and evaluated
against the remaining 100.
To see if new data would
improve the accuracy, we could build models starting with a
small training set, then incrementally increase (for example,
double) the training set size until the accuracy flattens.
The Algorithm
Here are the actual steps of
the algorithm:
Step 1: Start with an
existing mining structure, using a reasonably large data set
We started with a mining
structure, MyStruct, built on top of the [v Target Mail] view of
the
Adventure Works BI sample database.
Step 2: Define a mining
model in the structure.
We created a simple model,
ClassifyBikeBuyer, predicting Bike Buyer and also English
Occupation. We used Microsoft Decision Trees with all the wizard’s
default suggestions, with the exception that we changed Bike Buyer to
"Discrete
Boolean".
Step 3: Start with a
rather small training set size (say, 10 — it is highly unlikely
that 10 is a good training set size, but it makes the charts
look good :-) )
Step 4: Execute the
cross validation procedure using 2 as the number of folds and 2*TrainingSetSize
as the number of cases. This step is the interesting part of the
algorithm: we have to use at least 2 folds, or the procedure
would fail. On the other hand, we want to use
at least 2 folds, so that the accuracy measurements are not
biased.
CALL
SystemGetCrossValidationResults( MyStruct,
ClassifyBikeBuyer, 2, 20, ‘Bike Buyer’)
Step 5: Record the
accuracy results for the current training size. The stored
procedure result contains multiple accuracy measurements for
each model built by the procedure. As the number of folds
argument was 2, there are two sets of results:

Now, there are a few things to
be recorded in the result set:
- First, the values of the
accuracy measurements for each fold. we used a single measure,
the Log Likelihood Score, a value that, by itself, does not
depend directly on the training or test set size (Pass/Fail
classification scores do depend on these)
- Then, the partition size for
each fold. This is very important, because it is the termination
condition for the algorithm (we do not know ahead of time how
much data is in our mining structure). The folds are almost the
same size; the stratified sampling mechanism may lead to minor
differences. If at least one fold has a size exceeding or equal
to our training set size, the procedure had enough data (see more
details on Partition Size in Step 6).
To simplify the results for
this execution (training set size = 10 cases), we stored the μ
(average) and σ (standard deviation) for the Likelihood Log
Score measurement
Step 6: Double up the
training set size and repeat from step 4 until not enough data
is found.
The Partition Size in the
result is not the size of the training data for
the model indicated by Partition Index. Every row marked with
Partition Index = 1 is about the model trained leaving partition
1 out (and, therefore, evaluated on partition 1).
The procedure above will stop
when step 4 asks for a data set size which exceeds the data
available in the mining structure.
The Output
Let's examine the results of
this process.
Our sample mining structure has
around 13k cases (the view contains 18k and we preserved 30% as
holdout). Therefore, the procedure worked correctly for data
sets of: 20, 40, 80, 160, 320, 640, 1280, 2560, 5120 and 10240
cases (so it stopped with a training size of 5120 cases —
remember, data size is split in 2 folds!). When asking for 20480
cases, the procedure only found enough data to create 2
partitions of 6470 cases. To evaluate the accuracy of (almost)
the whole training set in the structure, we added one more call:
CALL
SystemGetCrossValidationResults( MyStruct,
ClassifyBikeBuyer, 10, 0, ‘Bike buyer’)
Effectively, we asked the
procedure to perform 10-fold cross validation, so we can
evaluate the accuracy of a model using 90% of the training data.
Not exactly the whole data set, but close enough!
We repeated the same steps for
the ‘English Occupation’ target and plotted the results in
Excel. The chart below presents the accuracy evolution
(Likelihood Log Score) with the size of the training set, and
the error associated with each data point is the standard
deviation for the likelihood log score among the folds of that
size.

So, what does the chart tell
you?
Likelihood Log Score is a
measure that cannot exceed 0. The closer it is to 0, the better
is the accuracy of the model. The chart above suggests that:
- Our structure data is
pretty good for predicting Bike Buyer. The accuracy grows
minimally after 1280 training cases
- However, the data does not
seem enough to predict the English Occupation. As you can
see, the slope is still pretty steep on the last segment of
the English Occupation line.
And Now, The Code
The algorithm seems interesting
enough to apply it later, so we encapsulated it in a small
C# Analysis Services stored procedure.
All the code is available in
this file:
DataSufficiency_AccuracyEx.cs. To use it, you will need to download the
file, create a new C# class library project, add a reference to
Microsoft.AnalysisServices.AdomdServer library and include the
file in the project.
Once you build and deploy the
stored procedure (Safe permission set, impersonate current
user), you can invoke it with a call like this:
CALL
AccuracyEx.EvaluateModelingData(’ClassifyBikeBuyer’, ‘English
Occupation’, ‘Log Score’)
- The first parameter is the
mining model (which does not need to be processed, but must
belong to a processed structure)
- The second parameter is
the target predictable attribute.
- The last parameter is the
measure to be used in the result. If an empty string is
passed as argument, then all the measures will be returned.
Here is the output we used to
build
the chart shown earlier:

The procedure can be easily
modified to work for multiple predictable targets (no input
target attribute) and even for Clustering models.