Plugging Into SQL Server Data Mining – SQL Server Data MIning

Add your own data mining algorithms to SQL Server 2005

Summary

This article provides a high-level overview and describes the benefits of the extensibility framework in SQL Server 2005 Analysis Services that allows independent software developers to easily integrate new data mining algorithms into the product.

Introduction

Data Mining in SQL Server 2005 is a platform that independent software developers can build on in many ways. One facet of the platform vision of the architects of the SQL Server Data Mining is the ability of third parties to develop their own specialized data mining algorithms and plug them into the Analysis Services infrastructure in SQL Server 2005.

Benefits

As an independent data mining algorithm developer, you not only have to design and implement the complex logic for building and navigating your models, you also need to worry about reading raw data from various data sources, transforming it into a format that is usable by the mining algorithm code and finally presenting the results to the user in a form that they can comprehend. Note that we have not even talked about common enterprise requirements like deployment to multiple users, secure storage and access control, multi-user querying and programmability. This is where building on top of a platform like SQL Server 2005 Data Mining proves hugely advantageous.

The benefits are not just limited to the resource savings from not having to develop, test and support the infrastructure pieces described above – you automatically gain a slew of features simply by virtue of being a plug-in algorithm:
A complete design and development environment, in the form of Business Intelligence Development Studio (BIDS), which includes tools for all phases of data mining model development: data exploration widgets, model creation wizards, designers, model visualizers, model scoring and comparison tools and a prediction query builder. BIDS allows developers and analysts to develop and refine mining models iteratively and then deploy them for multi-user access to a production server.

A management environment in the form of SQL Server Management Studio. SSMS allows administrators to centrally manage storage and security for mining models using a familiar database-oriented paradigm.
Deep integration with OLAP that provides features like the ability to directly mine OLAP cubes and to build data mining dimensions.

Data mining components in SQL Server 2005 Integration Services for complex data preparation and text mining.
The ability to reach the large community of database developers through familiar programming models like ADO and ADO.NET. Plug-in algorithms are also accessible at the same level as native SQL Server 2005 Data Mining algorithms through the new client and server-side ADOMD.NET programming models as well as through AMO, the new object model for managing Analysis Services objects.

The ability to query mining models based on your plug-in algorithm via DMX, a powerful SQL-based language that also supports stored procedures and user-defined functions written in a .NET language of your choice.
The ability to easily produce and distribute data mining results through SQL Server 2005 Reporting Services.
The ability to allow users to access your mining models as a web service, since Analysis Services 2005 is a native XMLA (XML for Analysis) server that can be accessed via TCP or HTTP.

The ability to plug your custom viewers for both native and plug-in algorithms into the SQL Server 2005 development and management environments.

Interoperability with other data mining products via PMML.

Access to the large installed base of SQL Server licensees.

Where DM Plug-Ins Fit Into Analysis Services 2005
Analysis Server
Data Mining Engine
Plug-in
Algorithm
On-Disk OLAP/DM Storage
COM Interfaces
Parsing, Security and Session Management

Storage Engine
Native Data Mining Algorithms

OLAP Formula Engine
External Data Sources
Algorithm Manager
BI Development Studio
(Wizards, Designers, Viewers)
Plug-in Algorithm Viewer
SQL Management Studio
(Security, Querying, Browsing)
Client Programming Models:
ADOMD.NET, AMO, OLE DB [ADO, ADO.NET]
Custom Data Mining Application
XMLA
Server Object Model
Stored Procs
Case Processor
DMX Query Processor

Plug-In Architecture

In the following sections, we will drill down into the various implementation pieces that make it possible for SQL Server Data Mining users to not only seamlessly build, train, browse and deploy models based on your plug-in data mining algorithms, but also embed them into their custom applications.

Communication Protocol

The Data Mining Engine in SQL Server 2005 Analysis Services communicates with plug-in algorithms via a set of COM interfaces that are available in a public header file. These are grouped into two categories: interfaces implemented by plug-in algorithms and interfaces implemented by the Data Mining Engine and consumed by plug-in algorithms.

This means that a plug-in algorithm is implemented as a COM DLL that is loaded into the Analysis Server process.

The interaction between the Data Mining Engine and a plug-in algorithm via these COM interfaces falls into one of the five types –

Metadata Query: discovering the capabilities of the plug-in algorithm.
Training: iterating a set of cases and extracting knowledge, rules or patterns.
Content Browsing: navigating the rules/patterns learned by the algorithm.
Prediction: applying the rules/patterns learned by the algorithm to new input cases. This is commonly known as scoring.
Persistence: loading/saving learned content.

On-Disk Storage

All mining model and related storage objects live in on-disk databases managed by the Data Mining Engine (Analysis Server) storage engine subsystem. The Data Mining Engine allows you to store your algorithm-specific content as part of its corresponding mining model object using persistence interfaces that it makes available to you. This means that you simply write to and read from these streaming interfaces without needed to worry about developing your own transactional storage system. You are of course free to define and manage your own in-memory structures that you load from the Analysis Server storage via these interfaces.

Memory Management

Memory objects – such as strings, variants and arrays – must be allocated and freed using per-algorithm or per-request service provider interfaces passed to plug-in algorithms by the Data Mining Engine. This allows the Data Mining Engine to efficiently manage memory resources and balance them across multiple requests. As a plug-in algorithm developer, this is another complex area that you no longer need to spend development effort on.

Error-Handling

Error reporting is done through the standard COM mechanism of creating and populating errorinfo objects. The Data Mining Engine consumes your errorinfo objects and raises your errors in the same way that it reports errors from other external components. Like native errors, your errors may be logged to the console, administrator-specified server log files or the Windows event log.

Algorithm Registration

Your algorithm is registered like a standard COM DLL, using the Windows regsvr32 utility. In addition, you need to edit the Analysis Server configuration file to add an entry for your algorithm and point it your COM server’s ProgID. This allows Analysis Server administrators to control which plug-in algorithms can be loaded into the Analysis Server process.

Plug-In Viewers

In addition to server-side plug-in algorithms, the SQL Server Data Mining Plug-in Framework allows you build custom viewers that plug into Business Intelligence Development Studio and into SQL Server Management Studio.

Plug-in viewers are Winforms UserControls that implement a simple mining viewer interface. They can be implemented in any .NET language.

Your plug-in viewer can be associated with one or more plug-in or native Microsoft algorithms via the Windows registry.

The User Experience

The deep, low-level integration of plug-in algorithms and viewers enables a uniform user experience at all levels of interaction with the SQL Server Data Mining platform. This means that each of the following classes of users sees no difference between a native (Microsoft-supplied) algorithm/viewer and a plug-in algorithm/viewer:

users of BI Development Studio (analysts, developers)
users of SQL Management Studio (administrators/DBAs)
users of client and server side programming models (developers)

For example, a registered plug-in algorithm would simply show up in the list of algorithms displayed by the data mining wizard as shown below:

Once the user picks an algorithm from this list, the rest of the wizard follows the same path that is taken for a built-in algorithm.

Resources

The following resources are available to help you develop plug-in algorithms quickly:
A technical white paper that describes the implementation architecture including interface descriptions as well as control and data flow in detail.

Tutorials that step you through the process of creating your plug-in algorithm and viewer, including shell code that you can re-use.

Complete C++ source code for a sample plug-in algorithm implementation.

Links to these resources are included in the References section below.

Conclusion

The plug-in algorithm architecture in SQL Server 2005 Data Mining enables you to invest your development resources in new, innovative mining algorithms instead of infrastructure work. By plugging into SQL Server 2005 Data Mining, you can take your single-user, niche product to a new market – the enterprise – and share in the growth of Microsoft’s database and business intelligence platform.

References

SQL Server Data Mining: Plug-In Algorithms by Raman Iyer and Bogdan Crivat, Microsoft Corporation
A Tutorial for Constructing a Plug-In Algorithm by Max Chickering and Raman Iyer, Microsoft Corporation,
A Tutorial for Constructing a Plug-In Viewer by Jesper Lind and Scott Oveson, Microsoft Corporation