SQL
Server Data Mining is unique in its ability to mine multiple tables of data
simultaneously through the nested table feature. This functionality,
however, adds some complexity when performing basic data manipulation tasks
required for data mining, such as sampling. How does one split a data set
with nested tables into training and testing sets? Well, this is how.
SQL
Server 2005 Integration Services (SSIS) provides all the tools necessary in its
data pipeline to sample data and sample nested data. The basic idea is
that you need to simultaneously read from your case data and nested data, sort
them*, sample the case data, and merge in the nested data. Such a data
flow is shown below.

To create
this package, follow these steps:
1.
Create
a new SSIS package.
2.
Add a Data Flow task for your
pipeline.
3.
Add
data sources for your case and nested tables.
4.
Use
a sort transform* to sort all tables by the case identifier. This should
be the key of the case table, and the foreign key of the nested tables that
relates them to the case table.
5.
Sample
the sorted case table using a Percentage Sampling transform, label outputs to
“Testing” and “Training”. A rule of thumb is to use 70% of the data for
training and 30% for testing.
6.
Use
the Multicast transform on each of the outputs of the nested sorting and the
case sampling (both training and testing).
At this
point (assuming a single nested table) you will have six data streams.
For convenience, we will call these Training1, Training2, Testing1, Testing2,
and Nested1, Nested2.
7.
Output
Training1 and Testing1 into destination transforms – these will be your
training and testing case tables. In the diagram above, the destinations
are row count transforms for illustrative purposes. In practice, you
would use a table destination such as the SQL Server Destination transform.
8.
Use
a Merge Join transform to join Training2 and Nested1. Join the streams
using the foreign key, and select all of the columns of Nested1 as output.
9.
Output
the result of the Merge Join to a destination. This will be your training
nested table.
10.
Use
a Merge Join transform to join Testing2 and Nested2. Join the streams
using the foreign key, and select all of the columns of Nested2 as output.
11.
Output
the result of the Merge Join to a destination. This will be your testing
nested table.
Once you
have completed the package, run it to perform the sampling and accurately
divide your data into training and testing sets. You can then add these
tables to a Data Source View in an Analysis Services project for data mining.
The
package described above and supporting data is available for
download.