Introducing the SQL Server 2005 Data Mining Add-ins for Office 2007
The SQL Server 2005 Data Mining Add-ins for Office 2007 allow you to uncover hidden patterns and relationships in your data and then put them to work to enhance the quality of your analysis. To use the Microsoft SQL Server 2005 Data Mining Add-Ins for Office 2007, you must be connected to a SQL Server 2005 Analysis Services database. However, you don’t need to have prior experience with any SQL Server 2005 Analysis Services components.
The Data Mining Add-ins for Office 2007 package is available as a free download that includes three add-ins:
· Table Analysis Tools for Excel – Allows you to analyze your spreadsheet data in powerful ways with just a few mouse clicks, by employing SQL Server Data Mining behind the scenes.
· Data Mining Client for Excel – Enables you to go through the entire lifecycle of a data mining project, including preparing data, building, evaluating and managing mining models, and predicting results using either spreadsheet data or external data accessible through your Analysis Services database.
· Data Mining Templates for Visio – Enables you to render, annotate and share your mining model patterns as Visio diagrams with colleagues and decision-makers in your company.
The rest of this article will give you an overview of the system requirements, installation process and functionality of the add-ins.
Before you can dive into the add-ins, you need to have the right pieces in place. Here are the requirements:
1. Microsoft .NET Framework 2.0 – The add-ins are built on the capabilities of the .NET Framework 2.0 and the .NET programmability support provided by Office 2007.
2. Microsoft Office 2007 – You must have Excel 2007 installed for the Table Analysis Tools and Data Mining Client add-ins. You need Visio Professional 2007 for the Data Mining Templates add-in for Visio. The installs must include .NET Programmability Support (this feature is included in the default install for Excel 2007 and Visio Professional 2007).
3. Microsoft SQL Server 2005 Data Mining Add-ins for Office 2007 – The add-ins themselves are installed by downloading and running the appropriate free package from this page, which also includes pointers for downloading the prerequisites. The add-ins are available in ten languages. Note: The default install only includes Table Analysis Tools. Be sure to select all the components during the install to get the other two add-ins.
4. Connectivity to SQL Server 2005 Analysis Services – You need to be connected to an instance of SQL Server 2005 Analysis Services to use all three add-ins. This may be on your local machine or a remote server that you have access to. In either case, the Analysis Services instance must be correctly configured to support the add-ins. The add-ins installation includes a Getting Started wizard that walks you through the process of getting connected to and configuring the Analysis Services instance.
With the requirements and setup out of the way, we can now explore the functionality provided by each of the add-ins.
Table Analysis Tools for Excel
This add-in provides you with the ability to perform rich analyses on your spreadsheet data with a couple of mouse clicks. Although it leverages the power of SQL Server 2005 Data Mining, as an end-user you don’t need to have any knowledge of data mining to make effective use of these tasks.
Since these are called Table Analysis Tools, you do need to know how to create a table or convert your existing spreadsheet data into a table in Excel. (Don’t worry if you don’t know how to do this though – the Getting Started with Table Analysis Tools video tutorial on this page will show you.)
If you click anywhere inside a table in Excel after installing the Table Analysis Tools add-in, you will see the following ribbon under Table Tools:
Each of the tools on this ribbon brings up a simple UI like the one below for Analyze Key Influencers and produces one or more rich, easy-to-understand reports that provide deeper insight into the source data:
Let us now look at each of the tools on the Table Analysis Tools ribbon.
Analyze Key Influencers
This tool analyzes the patterns in data that have the strongest influence on a certain outcome. For example, if you have a list of sales for all customers, Analyze Key Influencers can analyze the factors that are key to determining which customers will spend the most money. It may not be obvious which fields are the key influencers. For example, a field representing annual income may not be the biggest factor for customers who spend the most money. It could be other factors, such as number of children in the household, geographic location, or a combination of these fields.
Here’s an example of the output produced by this tool:
The Detect Categories tool detects the rows in the data table that contain similar characteristics and groups them into categories. Each category identified is described by the distinguishing characteristics of the rows assigned to the category. Based on those characteristics, you may choose to give the category a more understandable name. For example, a category that contains Age=45-60 and Income>100000 could be renamed to Affluent Baby Boomers. Breaking the data into categories enables you to quickly identify the natural groupings in your data. This identification can be useful in helping you run targeted marketing campaigns.
The tool also labels each row in the source sheet with the category that it was assigned to, in addition to producing a report like one below:
Fill From Example
This tool automatically fills in missing data in a specific column for all rows in the selected table, based on exemplary values provided by you for some of those rows. The tool employs SQL Server Data Mining to detect patterns in the sample you provide and apply those to the remaining missing values.
In addition to filling in the missing values, the Fill From Example tool also produces a patterns report similar to the Analyze Key Influencers tool that describes the rules that were applied to fill in the missing values.
The Forecast tool predicts future values based on trends in existing time series data. For example, you can forecast what annual sales will be two years from now, based on current conditions and factors in your data. The tool appends the forecast values as new rows to the table selected for forecasting.
The tool also produces a chart that graphically plots the existing and forecast values as shown below:
This tool analyzes and highlights data rows that do not appear to match the general patterns found in the rest of the table. These exceptions could be either be due to data entry errors or they may be genuinely unusual values that require further analysis. Highlighting exceptions can be useful because these values can significantly skew averages and trends. If exceptions are caused by data entry errors, it is likely that you will want to fix these errors before doing additional analysis or trending.
Highlight Exceptions not only highlights exception rows, it also highlights the specific column value in each row that is likely to be the cause of the exception. You can change the column value in-place and the row is immediately re-evaluated to determine if it is still an exception.
In addition, the tool produces an interactive exceptions report that can be used to dynamically vary the exception threshold so you can choose to see more or less exceptions.
Here’s sample output from the Highlight Exceptions tool, with the source data and the exceptions report sheets displayed side-by-side:
This tool allows you to model two types of scenarios and report the effect on either a single row or the entire table of input data:
· Goal Seek – Analyzes and reports the factors that need to change to achieve a specific goal. For example, if a company wants to grow from ten million dollars to fifty million dollars, this tool can help to determine how to reach this goal, based on factors that it analyzes in your data. The sample output below shows the output produced by the tool for a single row in the sample data:
· What-If – Analyzes and reports the impact of a proposed or hypothetical change. For example, you can see the effect of change in sales for your target customers if the price of a product increases. With “what-if” scenarios, you can determine, for example, that increasing price does not achieve the results you need, but increasing the warranty period might. The sample output below shows the output produced by the tool for a single row in the sample data:
The Connection button allows to create and manage connections to Analysis Services databases – a connection to an Analysis Services database is required before you can run any of the Table Analysis Tools as described in the Getting Started section earlier.
The Help button provides access to the documentation included with the add-ins as well as the Getting Started wizard and online tutorials.
Data Mining Client for Excel
The Data Mining Client add-in enables you to go through the full data mining lifecycle within Excel by using your spreadsheet data, or by using external data accessible through your Analysis Services database. If you have the Data Mining Client add-in installed, you should see the following ribbon when you launch Excel 2007:
The left-to-right organization and grouping of buttons on the Data Mining ribbon reflects the typical order of tasks a user would go through when working on a data mining project. Each of the sections (known in Office parlance as chunks) is described below.
Selecting the right attributes from the source data and getting them into the right format for mining typically takes up a large percentage of the time in a typical data mining project. This section provides tools that address common data preparation needs for data mining:
· Explore Data - Visually plot the distribution of discrete and continuous values and possibly add groupings back to the source data.
· Clean Data - Remove outliers and re-label discrete state values (for example, your source data may contain “M” and “F” as states for the Gender column and you prefer to use “Male” and “Female” for clarity when presenting the model results.)
· Partition Data - Split the source data into training and test sets, take a random sample of the input data or perform oversampling to adjust for skewed distributions.
Here is a sample screen shot that shows how you can remove values beyond a certain threshold:
This section covers the actual model definition and processing. It provides wizards that help you easily build common types of mining models without worrying about the actual mining algorithms and associated parameters supported on the server.
Also included in this section are advanced options that allow the user to pick the exact mining algorithm and tweak additional parameters.
Here’s a page from the Associate wizard that helps you build a model to find associations in transactional data:
Accuracy and Validation
This section includes charts for validating and verifying the accuracy of your mining models. The three charts provided are:
1. Accuracy Chart – Evaluates the performance of your model against test data by drawing a lift chart for classification models and a scatter plot for estimation models.
2. Classification Matrix - Displays a matrix of correct and incorrect classifications by evaluating your model against test data
3. Profit Chart – Graphically models profit for targeted campaigns based on user-supplied parameters for cost.
Here’s a sample accuracy chart generated by the Data Mining Client in Excel, showing a typical model whose performance lies between the “random guess” model and the ideal model:
This section covers the two standard tasks that you would perform with trained mining models:
· Browse – Explore the patterns and rules learned by the mining algorithm from the training data. This “mining model content” is visualized in different ways depending on the type of model you’re browsing. Here is an example of the visualization of a decision tree model:
· Query – Use the trained model to make predictions on new data. The Query task supports a wizard for building simple queries as well as an advanced editor where you can use DMX templates to build queries or manually type in the DMX statement.
This gives you the ability to manage existing models in the Analysis Services database you are connected to. You can rename, delete, clear, reprocess, export, or import mining structures and models as shown below:
The Connection and Help buttons are identical in function to those on the Table Analysis Tools ribbon described earlier. There is an additional Trace button that allows you to trace the commands that are sent by the data mining add-in(s) to the Analysis Services instance you’re connected to.
Data Mining Templates for Visio
With the Data Mining Templates add-in installed, you have the ability to render and share your mining model content as annotatable Visio diagrams. You can also publish these diagrams as interactive web pages.
If you launch Visio 2007 after installing the Data Mining Template add-in, you should see the Microsoft Data Mining template listed under Recent Templates as shown below:
Once you open the template, you can drag one of the following Visio shapes onto the design surface – the drag-and-drop action launches a dialog that allows you to connect to an existing mining model in your Analysis Services database and customize the diagram:
1. Decision Tree – Creates diagrams from mining models that were created using the Microsoft Decision Trees, Microsoft Linear Regression and Microsoft Logistical Regression algorithms. The diagram support different node schemes for classification and regression trees.
Here’s an example of the output of the Decision Tree shape for a classification tree:
For a regression tree, the nodes show the average and standard deviation with a distribution graph instead of the state histograms.
2. Dependency Network – Creates diagrams from mining models that were created using the Microsoft Naive Bayes, Microsoft Decision Trees, and Microsoft Association Rules algorithms.
The user interface allows you to fetch an initial set of nodes that match a query condition and then bring related items for selected nodes later. This is useful for interactive drilldown with the final goal of presenting the most interesting segments of a dependency network to others.
3. Cluster – Creates diagrams from mining models that were created using the Microsoft Clustering algorithm.
You can choose to render the clusters with just the shapes or with cluster characteristics or discrimination charts. The sample below shows the output with cluster characteristics included:
In addition to supporting standard and custom color schemes and providing the ability to annotate the visualization using Visio drawing and text tools, all the DM Template shapes allow you to
· expand and collapse sub-trees
· copy or move a subset of nodes to a new page
· modify color schemes
· replace standard shapes with custom shapes
· preview shading and shape options
Of course, you can render the content of mining models created with the Data Mining Client for Excel 2007 as well using the above shapes, allowing you complete the full data mining cycle, including presentation of data mining results, entirely within Office 2007.
Regardless of your familiarity with data mining concepts and practice, you can use one or more of the SQL Server 2005 Data Mining Add-Ins for Office 2007 to enrich your data analysis tasks using Office 2007.
· DM Add-Ins Download Page:
· Online Tutorials for the Add-Ins: