Server Data Mining is unique in its ability to mine multiple tables of data
simultaneously through the nested table feature. This functionality,
however, adds some complexity when performing basic data manipulation tasks
required for data mining, such as sampling. How does one split a data set
with nested tables into training and testing sets? Well, this is how.
Server 2005 Integration Services (SSIS) provides all the tools necessary in its
data pipeline to sample data and sample nested data. The basic idea is
that you need to simultaneously read from your case data and nested data, sort
them*, sample the case data, and merge in the nested data. Such a data
flow is shown below.
this package, follow these steps:
a new SSIS package.
Add a Data Flow task for your
data sources for your case and nested tables.
a sort transform* to sort all tables by the case identifier. This should
be the key of the case table, and the foreign key of the nested tables that
relates them to the case table.
the sorted case table using a Percentage Sampling transform, label outputs to
“Testing” and “Training”. A rule of thumb is to use 70% of the data for
training and 30% for testing.
the Multicast transform on each of the outputs of the nested sorting and the
case sampling (both training and testing).
point (assuming a single nested table) you will have six data streams.
For convenience, we will call these Training1, Training2, Testing1, Testing2,
and Nested1, Nested2.
Training1 and Testing1 into destination transforms – these will be your
training and testing case tables. In the diagram above, the destinations
are row count transforms for illustrative purposes. In practice, you
would use a table destination such as the SQL Server Destination transform.
a Merge Join transform to join Training2 and Nested1. Join the streams
using the foreign key, and select all of the columns of Nested1 as output.
the result of the Merge Join to a destination. This will be your training
a Merge Join transform to join Testing2 and Nested2. Join the streams
using the foreign key, and select all of the columns of Nested2 as output.
the result of the Merge Join to a destination. This will be your testing
have completed the package, run it to perform the sampling and accurately
divide your data into training and testing sets. You can then add these
tables to a Data Source View in an Analysis Services project for data mining.
package described above and supporting data is available for