### A scorecard for Logistic Regression models

Scorecards are a common way of displaying the patterns found by a logistic regression model. They display in a clear, intuitive way the regression coefficients and can be used to perform risk evaluation operations (simplified predictions). This article shows how to generate and interpret a logistic regression score card with Microsoft SQL Server Data Mining. The code and sample data for this article can be downloaded from here

#### How Microsoft Logistic Regression works (Skip this)

First, we'll review the behavior of the logistic regression algorithm for discrete (multinomial) variables.

A mining model contains a set of input attributes (features), X1, X2 etc. that are used to predict a target attribute, Y. Logistic regression works by modeling the probability for each state of the target attribute as a function of the states of the input features. In SQL Server Data Mining, the logistic regression algorithm treats each input attribute state as a separate, boolean, feature. Therefore, if X1 has 3 states, X2 has 2 states and so on, the Microsoft Logistic Regression algorihm will model the cases as:

- x1, x2, x3 - for X1's 3 states
- x4, x5 - for X2's 2 states
- etc.
- y1, y2 etc.- the possible states of the target attribute, Y

Example: Let's assume that X1 is Marital Status, with 3 states (Single, Married, Divorced) and X2 is Gender, with 2 states, M and F. Each case observed in training will have a specific value for the Gender attribute, either M or F. each of these two states has a score associated with it (z-score), a score that represents the normalized weight of the state. The z-score is used to map the raw distribution into a space where the mean is represented by 0 and the standard deviation by 1. In our implementation, we use positive scores when a state is present and negative scores when a state is not present. And so on for all the other attributes/states. As a result, a training case like (Marital Status=Married, Gender=M) will be translated into :

- Marital Status:
- x1 = -0.03 (score for NOT Single)
- x2 = 0.92 (score for Married)
- x3 = -0.05 (score for NOT Divorced)

- Gender:
- x4 = 0.89 (score for M)
- x5 = -0.11 (score for NOT F)

For each state of the target attribute (y0, y1 and so on) logistic regression computes the coefficients c0, c1, c2 etc. of a function

z = c0 + c1 * x1 + c2 * x2 +...

Teh algorithm then transforms z to an s-shaped curve by taking the linear regression (above), which could produce any y-value between minus infinity and plus infinity, and transforming it with:

p = Exp(z) / ( 1 + Exp(z) )

which produces p-values between 0 (z moves towards minus infinity) and 1 (z moves towards plus infinity).

The coefficients are determined with the objective of getting the p function to model the probability of the target state as close as possible to what is observed in the training set.

The p probability is enough to make predictions if the target attribute is binary (i.e. it has only two states, y0 and y1, which can also be considered NOT y0). A slighlty modified formula uses the coefficients for each target state to preform predictions on multinomial attributes.

#### How does the scorecard work

The scorecard can be computed for each state of the predicted attribute. For one particular state, y1, we start by extracting the coefficients (c0,c1, ...) that describe the logistic regression formula for that state. These coefficients are then normalized between, say, 0 and 1000, giving an intuitive perspective on the relative importance of each coefficient.

As each coefficient corresponds to a state of an input attribute, the normalized values are also describing the relative importance of each input attribute state. The score card presented here is computing these relative importance scores.

As a result of the training, all the coefficients of the regression formulae are stored inside the nodes of the model content. The download package is available here and it contains a stored procedure that:

- traverses the content of the mining model
- extracts the coefficients of the targer state
- normalizes the values of these coefficients between 0 and a user specified limit (by default,1000)
- returns a table containing the coefficients and their normalized values

the package also contains the following:

- A sample dataset and an Analysis Services deployment script, building a Logistic Regression mining model over the sample dataset
- A sample Reporting Services project, using the attached stored procedure to display a scorecard
- A file containing various DMX statements used in building the report

The resulting report looks like below:

To build and deploy the stored procedure:

- Open the project in "Stored Procedure Project" in Visual Studio .Net (Whidbey)
- Add a reference to the msmgdsrv.dll (Microsoft.AnalysisServices.AdomdServer.dll) class library

This class library is available in the location where Analysis Services 2005 is installed By default, this location has the form:

C:\Program Files\Microsoft SQL Server\MSSQL.2\OLAP\bin

Please note that, depending on your installation options, MSSQL.2 might be MSSQL.1 or MSSQL.3 or so
- Build the project
- Deploy the stored procedure on the server:
- Open SQL Server Workbench
- Connect to the target Analysis Services server
- In the Object Explorer, select the 'Assemblies' node at the server level
- Right-click and select 'New Assembly'
- For the FileName, field, use the Browse (...) button to navigate to the location where the stored procedure was built. A file named LogRegHelper.dll should be available at that location, under the bin\debug or bin\release folder. Select that file
- Assembly Name should be automatically filled with LogRegHelper
- Select OK and deploy the stored procedure

To use the stored procedure:

- Open SQL Server Workbench
- Connect to the targer Analysis Services server
- Open a new query of type Analysis Services DMX Query
- select the current database for the query (use the database that contains the logistic regression model, say the sample TestlogReg)
- Invoke the stored procedure:

CALL LogRegHelper.ScoreCard('College Plans', 'College Plans', 'Plans to attend', 1, 0, 1000, true)

In the call above:
- Param1: 'College Plans' is the model name
- Param2: 'College Plans' is the target attribute name
- Param3: 'Plans to attend' is the target attribute value
- Param4: 1 is a value descriptor, can have the value 1, if Param3 describes a regular state

or 0 if Param3 describes the special Missing state. Usually, this must be 1
- Param5: 0 scaling minimum for the scores
- Param6: 1000 scaling maximum for the scores
- Param7: true : Use true for ascending, 0 for descending assignment of scores among one attribute's states

Additional helper stored procedures included in the project:

(These stored procedures are used in the included Reporting Services project to allow selection of models, target attributes and target values)

- CALL LogRegHelper.SimpleScoreCard('College Plans', 'College Plans', 'Plans to attend', 1)

A simpler form of the stored procedure described above, it defaults Min to 0, Max to 1000 and sorting to Ascending
- CALL LogRegHelper.ModelsSupportingScoreCard()

- no parameters, returns the list of models in the current database that support the Score card procedure (i.e. Neural Network models, HIDDEN_NODE_RATIO=0 or Logistic Regression models)
- CALL LogRegHelper.OutputAttributes('College Plans')

- returns the list of possible target attributes for the specified model
- CALL LogRegHelper.OutputAttributeStates('College Plans', 'College Plans')

- returns the list of possible target values for the specified model and target attribute