Mining in-memory data
Besides being a powerful data analysis tool, SQL Server Data Mining is also a platform for developing intelligent applications. This article presents a methodology for using SQL Server Data Mining to perform avanced analysis over data collected by a client application.
An obvious approach is to stage the data in a table and then follow the Data Source/Data Source View standard procedure. This raises some problems, including the fact that the staging database has to be visible for both the client and the server. Besides, the data is moved a few times from one place to another, which negatively impacts the performance. The approach we will take is to use a SQL Server data mining feature that allows you to pass table parameters to remote in-memory or other client data from the client to server
This article will show how to compute the trend for a set of simple (x,y) points using Microsoft_Linear_Regression. The code can be easily changed to solve other data mining problems, such as clustering or classification. The article has 3 sections: modeling, predicting (or scoring) and additional notes. If you do not need to create new models based on in –memory data (but rather use existing server models) you can skip the first section. The last section contains pointers to related information as well as some advanced topics (such as in-memory nested tables)
Modeling
Let’s assume that a desktop application contains a set of data points included in an array of Point structures:
// Array of points to be analyzed
System.Collections.ArrayList arPoints =
new System.Collections.ArrayList();
Start by adding a reference to Microsoft.AnalysisServices.AdomdClient.dll and use this reference in your code as below:
using Microsoft.AnalysisServices.AdomdClient;
The first step in using SQL Server Data Mining is connecting to a server:
AdomdConnection conn = new AdomdConnection();
conn.ConnectionString = "Data Source=localhost";
Next, we’ll create a mining model on the server. The model will analyze the trend of the y point coordinate function of the x coordinate. To do this, we’ll use a DMX statement such as below:
using (AdomdCommand cmd = new AdomdCommand())
{
cmd.Connection = conn;
cmd.CommandText = "CREATE MINING MODEL ClientData" +
"( [Key] LONG KEY, " +
" X LONG REGRESSOR CONTINUOUS, " +
" Y LONG REGRESSOR CONTINUOUS PREDICT " +
") USING Microsoft_Linear_Regression";
cmd.ExecuteNonQuery();
}
Note that the DMX statement for creating the mining model specifies the algorithm to be used, the usage for each column (Y is predictable, while X is implicitly input) and modeling flags for each column (notably REGRESSOR for X, as X will be used as in regression for computing the trend of Y).
With the model created, the next step is to train it using the collection of points mentioned in the beginning. To do this, we’ll use the DMX INSERT INTO statement, which (very much like its SQL equivalent) will insert into a mining model or structure the tabular data set that is passed as an argument. The AdomdCommand object supports executing commands with parameters, and it also supports table parameters for operations such as INSERT INTO or PREDICTION JOIN
We’ll start by creating a DataTable that will hold all the point information:
// Create a table containing the points
System.Data.DataTable tbl = new DataTable();
// Add columns to the table
tbl.Columns.Add("Key", typeof(int));
tbl.Columns.Add("X", typeof(int));
tbl.Columns.Add("Y", typeof(int));
// Populate the table from the points
object[] dataRow = new object[3];
for (iIndex = 0; iIndex < arPoints.Count; iIndex++)
{
Point pt = (Point)arPoints[iIndex];
dataRow[0] = iIndex; // Use the point index as a key
dataRow[1] = pt.X;
dataRow[2] = pt.Y;
tbl.Rows.Add(dataRow);
}
Now, with execute the INSERT INTO DMX passing the newly created table as an argument:
using (AdomdCommand cmd = new AdomdCommand())
{
// Create a parameterized command to train the model
cmd.Connection = conn;
cmd.CommandText = "INSERT INTO ClientData([Key], X, Y)" +
"@InputRowset";
cmd.Parameters.Add(new AdomdParameter("InputRowset", tbl));
cmd.ExecuteNonQuery();
}
When the command execution is complete, we have a trained model on the server
Prediction
There are two ways of executing predictions with client data against SQL Server Data Mining. When the data is small, it can be specified inside the prediction query (a singleton query).
The Movie Click sample is based on such singleton predictions. However, this approach is not very efficient when a large number of predictions have to be performed, (too many queries, and the parsing times for each query add up). The second way of executing predictions is to ask the server to predict against a batch of data. Usually, such a batch is described by an OPENQUERY (or OPENROWSET) statement. We'll present now a way to programatically send a batch of data from the client application to the server.
The mechanism for predicting against in-memory data is very similar with the one used in training. The DMX PREDICTION JOIN operator allows using a model to score a tabular data set. In our case, the tabular data set is the in-memory data and we will use the same parameter mechanism as above to pass it to the server (the tbl DataTable is assumed to be similar in structure with the one built above, just before training the model):
using (AdomdCommand cmd = new AdomdCommand())
{
// Create a parameterized command to score the points
cmd.Connection = conn;
// selecting the Key allows us to join the prediction result
// back to the original data
cmd.CommandText = "SELECT T.[Key], Y FROM ClientData" +
"NATURAL PREDICTION JOIN @InputRowset AS T";
cmd.Parameters.Add(new AdomdParameter("InputRowset", tbl));
AdomdDataReader rdr = cmd.ExecuteReader();
while (rdr.Read())
{
// read one prediction result, use it
int pointIndex = rdr.GetInt32(0); // the index of the point
int y = rdr.GetInt32(1); // the predicted value of the point
// use the prediction here – e.g. draw the
// trend line at the coordinates of the pointIndex
// point,using the predicted value
}
rdr.Close();
}
Notes
- Multiple kinds of analysis can be performed simultaneously by creating a Mining Structure first, then a few mining models using different algorithms. This way, a single training pass can create a classification model (to detect anomalies or explain certain values) and a clustering model (to analyze groups of data points sharing similar characteristics)
- See SQL Server Books online for details on the DMX syntax for creating / training /predicting from a mining model.
- Two or more tables can be shaped together to model a nested table. For this, the tables should be in a 1-to-many relationship. For instance, assume one table containing User ID and some demographics and the second table containing the list of movies rented by one user (therefore, UID and Movie Title). Each row in the first table will correspond to zero or more rows in the second table. The relation between these rows is that all the rows in the nested table will have the UID column value identical to the User ID column value in the first table’s row:
| Table 1 (System.Data.DataTabletable1)
|
|
Table 2: (System.Data.DataTabletable2)
| UID |
Movie Title |
| 110223 |
Star Wars |
| 110223 |
A Beautiful Mind |
| 110224 (different user) |
Star Wars | |
Shaping for this kind of data can be done like below:
INSERT INTO MiningModel
(
[Key],
[Movies]
(
SKIP,
[Movie Title]
)
)
SHAPE {
@Table1Parameter
} APPEND
( { @Table2Parameter }
RELATE [User ID] TO [UID]
)
And then, the AdomdCommand has to be populated with the two table parameters:
cmd.Parameters.Add(new AdomdParameter("Table1Parameter", table1));
cmd.Parameters.Add(new AdomdParameter("Table2Parameter", table2));
Please keep in mind that, when SHAPE is used, the two tables must be both sorted by the key they share ([User ID]/UID in the case above)