Mining the Flow
by Jamie MacLennan
SQL Server 2005 Analysis Services (SSAS) provides nine state of the art data mining algorithms that can be used to solve a range of problems limited only by your imagination. Over the years, as data mining technology has evolved, miners have faced a variety of hurdles that have limited users’ ability to leverage data mining in their solutions. The difficulty and expense of efficiently preparing data for mining is often a barrier. Additionally once the core data mining operations is complete and models have been generated, there typically have been no clear-cut and easy way to bring these models from implementation to deployment. Through the combination of SSAS and SQL Server 2005 Integration Services (SSIS) these difficulties are alleviated, plus new opportunities to deploy predictive power into the enterprise arise where it’s unlikely previously to be considered.
SQL Server 2000 shipped with a DTS task that allowed you to specify a data mining query and dump the results into a relational database. 2005 provides a similar task with some enhanced functionality, but it also comes with powerful data mining transforms that allow you to perform mining operations directly in your operational data flow. In this article I will describe how SSIS can be used both to prepare data for data mining, and also how you can use data mining to enhance your SSIS packages in simple but impactful ways that will get you into the data mining groove before you know it.
Flowing into the mine
The majority of data prep for data mining is similar to data prep for any other analysis or warehousing operations – that is you have to make sure the data is clean. SSIS provides many transforms to help in this manner, ranging from the standard lookup to more exotic “Fuzzy” groupings and lookups. Cleaning the data and creating new variables are essential for data mining project to be successful.
In additional to standard data preparation, data mining typically requires sampling to split data to one data set to train a model and another to validate the accuracy of the model. SSIS provides two sampling transforms, Percentage Sampling and Row Sampling, to be used for this function. These transforms split the data, and can be used to perform more interesting sampling operations such as "over" or “stratified” sampling. This is useful when the patterns you are looking for are very uncommon, such as when analyzing returns on mailing campaigns or in fraud scenarios
Take for example a mail campaign that has a historical return rate of 2% that you want to improve with data mining. In order to make an effective model to predict whether a customer will respond, you need to increase the frequency of success in the data you show to your model, and then leave some data with approximately the original frequency for testing. To accomplish this task you would follow steps like the following:
- Create a SSIS package with a data flow and a data source transform connected to your source data.
- Use a Row Sampling transform to split 20,000 rows from the data to use for testing.
- Save the Sampling Selected rows to a table.
- Use a Conditional Split transform on the Sampling Unselected output to separate the successful mailings from the rest of the data. The split condition will be on the column indicating whether each customer responded to the mailing or not.
- Use two Row Sampling transforms on the outputs of the conditional split to sample 10,000 rows from each output.
- Use a Union All transform to join together the Sampling Selected outputs of each of the Row Sampling transforms.
- Take the result of the union and send to a destination transform.
The result of the training set can be saved to a table in a SQL Server database from where the data can be mined in the BI Development Studio, or you can use the Data Mining Training transform to mine the data directly from the pipeline. This destination transform chunks the streaming data into packages that are pushed from the pipeline to your SSAS server where it is fed to the data mining algorithm of your choice. Since algorithms supported by Analysis Services may require multiple passes of the data, the model training doesn't actually begin until all of the data has been pushed to the server. Once the data transfer is complete, the algorithm processes the data to generate a model and then completes the SSIS package. Figure 1 shows a package that implements this process.
Figure 1: Using SSIS for stratified sampling and model training
The Data Mining Training transform allows you to either train an existing model using pipelined data or to train a model created on the spot inside of the SSIS interface. The steps for creating a model inside of SSIS are identical to those you would use to create the same model in an Analysis Services project. Once the model is trained, you can browse the model and test it using the lift and profit charts in the BI Development Studio (BIDS) or SQL Server Management Studio (SSMS). Figure 2 shows the tree created by the model in the above package. From this diagram, you can see that customers who don't own cars and are 30 years or younger have a high probability of responding to the marketing campaign, whereas people who have more than two cars are not so likely.
Figure 2: Browsing the TargetMailDT Model
To test a model from SSMS, connect to your Analysis Services database, right-click on the created model in the Object Explorer and select "View Lift Chart." The Lift Chart allows you to compare your trained model to the testing set saved by the package above to determine how well the model performs and to maximize profitability achieved by using the model in production. Figure 3 shows a profit chart created against 50,000 candidates by assuming that it costs $5,000 to launch the marketing campaign, $3 to contact each customer, and you get a $50 return on each successful contact. The chart shows a maximum profit of over $48,000. More importantly, by clicking at the peak of the chart and looking at the mining legend, you can see that the maximum profitability is achieved by contacting every candidate for which the model determines that they have a 52.36% probability or greater to respond.
Figure 3: TargetMail Profit Chart
Flowing from the mine
As cool as the ability to take streaming data and directly populate mining model is, by far the most interesting scenarios arising from integrating data mining and SSIS come from the Data Mining Query transform. The Data Mining Query transform allows you to join the contents of the pipeline with a mining model and retrieve predicted results. In short, this can be considered the ultimate "fuzzy lookup" transform using any arbitrary algorithm and returning any arbitrary result. For example, continuing with the marketing campaign from above, you can use the Data Mining Query transform to predict whether a customer will respond to the campaign, or to maximize your profit potential, you can "look up" the probability a customer will respond and act accordingly. Since you are in the SSIS pipeline, these potential customers can come from any source – be it a SQL Server database, flat file extracts from your mainframe servers, or even web services you subscribe to that constantly stream candidates to you via XML.
Since the Data Mining Query transform transfers data to an Analysis Services server, it handles data in a manner optimized to reduce the movement of data. First, only the columns from the input that are required for the data mining query are transferred to the server, and secondly, all the columns on the input of the query are copied to the output. This process is described in Figure 4. This allows you to select only the supplemental columns from the data mining model that you need in the pipeline. For instance, Figure 5 shows the data mining query builder user interface joining our TargetMailDT model to the pipeline. Since the goal is to determine if the probability of a candidate to respond to a bicycle offer, the only thing you need to select is the probability that the column "Bike Buyer" has the value "1". All of the columns already in the pipeline, such as name and phone number are passed through to the output of the transform. Furthermore, since those columns are not bound to any columns in the mining model, they are not sent to the Analysis Services server.
Figure 4: Data Transfer in the Data Mining Query Transform
Figure 5: Prediction Query Builder in SSIS
If you examine the query created by the prediction query builder you will see the following query that shows only the single column value being returned from Analysis Services.
(PredictProbability([TargetMailDT].[Bike Buyer],1)) as [ProbSuccess]
@InputRowset AS t
[TargetMailDT].[Marital Status] = t.[MaritalStatus] AND
[TargetMailDT].[Gender] = t.[Gender] AND
[TargetMailDT].[Total Children] = t.[TotalChildren] AND
[TargetMailDT].[Number Children At Home] = t.[NumberChildrenAtHome] AND
[TargetMailDT].[Education] = t.[Education] AND
[TargetMailDT].[Occupation] = t.[Occupation] AND
[TargetMailDT].[Home Owner] = t.[HomeOwner] AND
[TargetMailDT].[Number Cars Owned] = t.[NumberCarsOwned] AND
[TargetMailDT].[Commute Distance] = t.[CommuteDistance] AND
[TargetMailDT].[Age] = t.[Age]
One peculiar aspect of this query is how the source data is referenced. The "@InputRowset" syntax to the right of the PREDICTION JOIN clause indicates that the source data is supplied as a rowset parameter. SQL Server Data Mining in Analysis Services allows the specification of entire rowsets as input parameters. This allows, for instance, the mining of client data on the server. Depending on the API that you choose, the rowset parameter can be an IRowset for OLE DB implementations, a DataTable for .NET implementations, or an XMLA rowset for completely thin client implementations. In the OLE DB or .NET cases, the client pieces of Analysis Services take care of converting the rowset parameters to the appropriate XMLA format and remoting the parameters to the server. This rowset parameter allows the Data Mining Query transform to work in a manner similar to the Training transform in that it chunks data from the input into reasonable sized batches to send to the Analysis Services server. However, in the query case, the transform does not need to collect all data to continue - as soon as the server has finished with a batch the rows it returns are joined to the rows in the pipeline and the data is allowed to continue down the pipe. Note that where clauses in the data mining query, or flattened results of queries returning nested tables can respectively reduce or increase the number of rows in the pipeline, so the number of rows flowing into the Data Mining Query transform is not necessarily the same.
Figure 6 shows the resulting query package. The conditional split divides the data based on whether the ProbSuccess column output by the data mining query is greater than 0.52 as indicated by the profit chart test above. In the end you have over 4,700 customers to call and over 8,000 that you've filtered out that simply represent money saved. Of course, if you are not interested in the customers that are not likely candidates, you can simply filter them using the where clause in the query transform.
Figure 6: SSIS package using the Data Mining Query Transform
In this example you saw just one application of using data mining in SSIS – one where SSIS serves as a mechanism to "do data mining." There are many other ways to leverage data mining inside SSIS using the same mechanisms but that serve the ETL operations SSIS was designed for such as data cleanliness. For example, assume your source data had many missing values in the 'Occupation' or 'Age' columns. You can easily change the model built in the first step from predicting 'Bike Buyer' to predicting the columns where you have missing values. For the second step, you can split out the data where the values are missing while loading data into your warehouse and use the data mining query transform to impute values for those columns by prediction. You can even add an additional column containing the confidence of the predicted values – on the branch where the value is known, you would simply set the confidence to 100%, and on the predicted branch you would fetch the PredictProbability of the missing column from the model into the pipeline.
Another example is when your data may not have missing values, but may simply be dirty or even fraudulent. The Microsoft Clustering algorithm including with Analysis Services has the function "PredictCaseLikelihood" which returns a number that will indicate to you how likely each row is to be valid based on the combination of values across that row. This can be used in a query transform to separate outliers, i.e. bad data, before it gets into your warehouse. Similar techniques as described above can be used to determine exactly which values are the most suspect.
The flexibility and functionality present in SQL Server 2005 Integration Services provides analysts, DBA's and developers the ability to mine data from any source without limitations. Data can be mined directly from any database, from flat files, Excel files, or even straight off the web without any prior staging. The data pulled from these sources can even be massaged into any arbitrary form along the way as necessary and appropriate for mining operations to occur.
The ability to access the predictive ability from the ETL pipeline provides data warehouse developers with new and unprecedented tools in their arsenals for data manipulation. Using data mining to determine the viability of data or even to complete data will create better, more accurate and more robust warehouses for downstream use.
The marriage of ETL and data mining tools between SSIS and SSAS allows you to take advantage of operations typically difficult, unwieldy, or impossible using a single tool and implement them easily and effectively using the combination of both.
To get more information about the data mining functionality in SQL Server 2005, check out the website www.sqlserverdatamining.com.