Generalized Additive Modeling Using
SCAB34S SPLINES and SCA WorkBench

Generalized Additive Modeling (GAM) is provided by the B34S® ProSeries Econometric System and SCAB34S SPLINES software products.  SCA WorkBench provides the user interface to shell a MARSPLINE modeling and validation environment in the B34S program suite. 

SCAB34S SPLINES provides a subset of the capabilities in the B34S® ProSeries Econometric System and we refer to these products interchangeably within this document.  SCAB34S SPLINES runs conveniently as an integrated component to SCA WorkBench.  The WorkBench product is a companion to the SCA Statistical System and SCAB34S software, providing a graphical user interface for GAM models with various link functions and error distribution settings.  Within the context of GAM model validation, the predictive performance of these models may be validated by comparing the in-sample and out-of-sample predictive values to linear regression models using OLS, MINIMAX, or L1 estimation methods.  Within the context of Logistic GAM model validation, the classification performance of these models may be validated by viewing the Confusion Matrices and Lift-Gains between the GAM model and a linear regression, probit, or logistic model.

The SCAB34S SPLINES product provides a number of procedures to perform common data manipulation tasks, organizational tasks, and statistical/econometric analysis tasks.  It also contains a comprehensive matrix programming language that may be used to customize procedures for specialized use.  No attempt will be made to cover all features of the SCAB34S product in this document or the full range of applications that may be solved using the B34S matrix programming facilities.[1]  Instead, we shall exclusively use the graphical user interface of SCA WorkBench to specify, estimate, and diagnostically test GAM models in SCAB34S SPLINES.  SCA WorkBench automatically specifies the command script executed in the SCAB34S SPLINES product based on menu selections.  A command file is then executed in the SCAB34S engine and the results are read back into WorkBench for examination.  The user may save the program file and modify the command script to address additional analysis requirements that may arise.

A major assumption of any linear process is that the coefficients are stable across all levels of the explanatory variables and, in the case of a time series model, across all time periods.  The GAM model is a very useful method of analysis when it is suspected that certain predictor variables may be nonlinear with respect to the dependent variable.  There are many theoretical reasons consistent with this occurring in many different applications including energy, finance, economics, medical, social science, and manufacturing. 

GAM models, at the very least, can be used as a diagnostic tool in determining potential nonlinear relationships of predictor variables with respect to the dependent variable.  Here, the user can investigate the curvature of the variable relationship that can later be used in parametric models by adding cubic terms, quadratic terms, et cetera, to capture the functional form of the variable.  Since GAM models are not limited by imposed functional form, the data itself suggests the functional form of the predictors in the final model.  GAMFIT uses nonparametric fitting based on a scatter plot smoother to fit a smooth relationship between two or more variables. The smoother summarizes the trend of the response variable as a function of the predictor variables by iteratively smoothing partial residuals in a process known as back-fitting. Degrees of freedom are approximated as penalties to keep the complexity of the nonparametric curve fitted to the data in check.  By examining the curvature plots of the transformations employed by GAM on the predictor variables, the functional form of the predictor variable entering the model can be evaluated and interpreted. 

 

Nonparametric models can have problems related to dimensionality when data is sparse and inflates the variance of the estimates.  This is associated with the use of a large number of predictor variables in the model and is often cited as the “curse of dimensionality”. Nonparametric regression methods that use kernel estimation or smoothing splines may also be difficult to interpret. Stone (1985) originally proposed additive models to help overcome these shortcomings.  Hastie and Tibshirani (1990) later proposed generalized additive models where the mean of the dependent variable depends upon the additive predictor through a nonlinear link function.  A GAM model replaces the coefficients that would otherwise be found in a linear regression model by a linear smoother.  The smoothing technique is based on local averaging of the values of the dependent variable, grouping values of a predictor variable that are near a target value.

 

GAM MODELING USING SCAB34S SPLINES AND WORKBENCH

Assume a nonlinear model of the form

 

                                                                                                                  

 

where xi and y are one dimensional vectors, a GAM model (see Hastie-Tibshirani (1986, 1990), Faraway (2006, 240)) can be written as

 

                                                                                       

 

where  are smoothing functions standardized  (to remove free constants) so that  . The smoothing functions are estimated one at a time using a forward stepwise estimation method. When is estimated with OLS, the expected coefficients are all 1.0. 

 

The user sets the degree of the smoothing (DF) for each predictor variable. For example, setting DF=3 imposes a cubic fit, DF=2 imposes a quadratic fit, and DF=1 imposes a linear fit. The SCAB34S GAM model summary also provides a significance test (LIN_RES) that measures the difference of the sum of squares of the residuals for the linear restriction case and the transformed case of the GAM model for each predictor variable. An overall diagnostic test   where  = the number of parameters in the model is also provided.

 

The first step in GAM estimation is to remove the means from all right hand side data and add the spline to form the smoothed series that have 0.0 expectation such that

 

                                                                                                                        

 

If there are n observations,  is an n element spline series. When OLS is applied to the model  , the expected value of the coefficients are 1.0 as noted above. This allows the GAM coefficients  to be estimated using OLS in terms of the original right hand side variables such that    where

 

                                                                                                                             

 

 

                                                                                                          

 

In effect, the nonlinear effect is removed from the y series to obtain the model.

 

After estimation  the predicted left hand side values can be recovered as

 

                                                                                                                             

 

If out-of-sample forecasts are desired, one way to proceed is to use a polynomial regression to approximate the spline functions of the GAM model. Given the new x data  a new estimated spline vector  can be obtained. Using the estimated GAM coefficients   , the forecasts  can be calculated as

 

                                                                                                                 

 

For more information on the use of polynomial regression in forecasting GAM models, refer to Stokes (2008, Chapter 14).

 

The GAMFIT procedure, by default, uses Identity as the nonlinear link function.  However, other link functions may be specified depending upon the underlying problem.  The types of lik functions supported in GAMFIT are Identity, Inverse, Logit, and Logarithmic and are defined below.

 

We begin by defining z such that

 

                                                                                                                    

 

The linking functions are then specified as

 

Identity                                                                                                                             

 

Inverse                                                                                                                        

 

Logit                                                                                                                 

 

Logarithmic                                                                                                            

 

In addition, alternative probability error distribution functions may be specified including

 

Gaussian (default)                                                                    

 

Binomial                                                           

 

Poisson                                                                                         

 

Gamma                                                                  

 

SCA WorkBench: A Graphical User Interface

 

SCA WorkBench provides a convenient graphical user interface to SCAB34S SPLINES for GAM modeling.  The WorkBench interface builds the data loading steps and commands based on the user’s menu selections.  The associated commands are then organized as an SCAB34S program file and submitted to the SCAB34S engine. 

The GAM modeling environment in WorkBench is organized by tabs shown below. 

 

The Model  tab is used to specify the variables, variable types, and lagged components of the GAM model.  The Options tab sets the estimation limits placed on a GAM model, sets the linking function and error distribution type, and controls the detail of output and graphics that are produced.  The Validation tab provides settings to evaluate the performance of GAM model prediction and to compare the results with a linear regression model (OLS, MINIMAX, L1, LOGIT, or PROBIT estimation).  The Results tab displays the input/output from the model estimation, diagnostics, and forecasting.  The Graphs tab displays a variety of high resolution graphics such as time series plots, residual plots, autocorrelation plots, surface plots, and others.

Once the SCAB34S program file is created by SCA WorkBench, you may save the file for future reference or make changes directly to the commands and re-execute the script from SCA WorkBench.

 

Model Specification Tab

 

This tab is central to specifying the variables and lagged components of the GAM model.  Use the dropdown combo boxes to select your dependent variable and predictor variables.  Click on the Add button to add a predictor variable component to the model.  A categorical variable can be added by putting a check in the Categorical checkbox before clicking on the Add button.  To allow a GAM model to be compared with a linear model, a categorical variable is automatically expanded into 0-1 binary variables which are then substituted in both the GAM and linear comparison model.  When a variable is added into the model, the component will appear in the Model Components grid as they are added.  In the example below, DAYLOAD is selected as the dependent variable.  TEMPERTR is selected as an independent variable with a contemporaneous effect and a lag 1 effect.  The lags are specified in the Lags textbox.  Multiple lags for explanatory variable components can be specified using the word “TO” to separate contiguous lags (e.g., 0 TO 1) or commas to separate non-contiguous lags (e.g., 0, 1, 3). 

 

A component may be deleted or modified by placing your cursor on the specific row of the Model Components grid and then by clicking on the Del or Edit buttons.  If you click on Edit, the Add button will be replaced by the Mod button.  You may make changes using the dropdown combo box for the independent variable and other components in the Specification frame.  Click on the Mod button to complete the modification.

 

 

The features of the Model specification tab are presented below.

 

Menu Item

Description

Specification Frame

This frame organizes various controls that you may use to specify GAM model components including the dependent variable, independent variables, and lag coefficients.

 

If a categorical variable is specified for an independent variable, the GAMFIT routine will automatically identify it as categorical when it is processed and expand it into 0-1 binary variables.

 

 

Dependent Variable

Use this drop-down list to specify the series that you wish to analyze.

 

 

Logit

Checkbox

Specifies that the independent variable is a 0-1 variable.  When specified, the GAM model estimates the probability of success/failure based on the independent variables in the model using the logit linking function.

 

 

Categorical Checkbox

Specifies that the dependent variable is a categorical variable.  When specified, the application will automatically determine the number of categories (must be coded as integer) and expand the categorical variable into binary (0-1) variables.

 

 

Independent Variable

Use this drop-down list to specify a predictor or categorical variable components in the model.

 

 

Lags

Specifies the lag parameters associated with a random variables or categorical variables.  A categorical variable may contain more than one lag parameter; however only one lag specification may be added to the model at a time.  For random variables, multiple lag parameters may be added to the model as a group.  Multiple lags may be specified using the “TO” keyword to separate contiguous lags.  Individual lags may be separated by commas.  For example, the user could specify contiguous lags as “0, 1, 3” or as “0 TO 1, 3”.

 

 

D.P. (NL fit)

Specifies the number of degrees of freedom to be used on the variable for smoothing. Specifying the degrees of freedom to 1 restricts the variable as linear. The default is 3 (cubic).

 

 

Add

Clicking on Add appends a new component to the GAM model which is displayed in the model component grid.  Multiple instances of the same independent variable may be added to the model as long as the lag operators are unique.  For example, in the above form, the user could specify TEMPERTR{0} and TEMPERTR{1} components separately.

 

 

Model Components Frame

The model components frame organizes form controls to display the GAM model components in a grid format, as well as to edit and delete model components.

 

 

Model Component Grid

The components of the GAM model and their attributes are displayed in this grid.  The first column displays the independent variable name, the second column displays the individual or grouped lag operators within braces, the third column indicates whether the independent variable is predetermined as a predictor or categorical.  The fourth column indicates the number of degrees of freedom for smoothing, and fifth column indicates that a categorical variable is specified and that the number of unique categories will be determined by the program.

 

 

Edit

The user can modify a model component by first placing the mouse cursor on the grid row of interest and then clicking on the Edit button.  The Specification Frame will reflect the current attributes of the model component and the Add button will be replaced by the Mod button.  Make the necessary changes in the Specification Frame and then click on the Mod button to complete the changes.

 

 

Del

The user can delete a model component by placing the mouse cursor on the grid row of interest and then clicking on the Del button.

 

 

Clear

Clears all model components from the model component grid.

 

 

Save

Saves the information in  the model component grid to a specified tab-delimited file. 

 

 

Recall

Recalls the model component grid information from a specified tab-delimited file created (see Save option above).

 

 

Set Data Range Frame

This frame organizes form controls related to how the data is indexed (by date or none), and what data span is modeled and analyzed.

 

 

Date Variable

Use this drop-down list to specify the date variable associated with your series.  If your SCA Data Macro contains a variable named "DATE", it is automatically assigned by SCA WorkBench.

 

If you have an alternative index variable or date variable, you may select it from the drop-down list.  If your SCA Data Macro does not contain a DATE variable, leave the dropdown list empty.  WorkBench will then use the observation number as a date index.

 

If your time series is more than 10,000 observations, WorkBench will not use your DATE variable for indexing.  Instead, observation number will be used.

 

 

Begin Span

Use the Begin drop-down list to omit observations from the beginning of a time series being analyzed. 

 

 

End Span

Use the End drop-down list to omit observations from the back of a time series being analyzed.

 

 

Back

Depending on the tab you are currently working in, clicking on the Back button will move you one tab to the left.  If you are in the Model tab, you will move to the GAM Data Viewer dialog box where you may choose a new SCA data macro or leave the GAM Modeling Environment.

 

 

Exit

Exits the GAM modeling environment.

 

 

Execute

Executes GAM model estimation, validation, linear model comparison, diagnostics, and graphs by submitting a dynamically created program script to SCAB34S SPLINES.  When completed, you will automatically be placed in the Results tab.

 

 

Options Tab

 

The Options tab sets the estimation limits placed on a GAM model, controls the detail of output and graphics that is produced, and allocates the workspace size of the SCAB34S SPLINES product.  More estimation options are available in the GAMFIT matrix subroutines that are not exposed in this GAM Modeling Environment interface.  The user may employ these other options by directly editing the B34S script generated by WorkBench. 

 

 

 

Menu Item

Description

GAM Estimation Limits Frame

This frame organizes various controls that set options in GAM model estimation.  Here, the user specifies the convergence tolerance for inner and outer looping, and the maximum number of iterations for back-fitting and local scoring.

 

 

Convergence Tolerance (Inner Loop)

Set the convergence tolerance for inner looping in the GAM smoothing algorithm. The default value is 0.1D-8

 

 

Convergence Tolerance (Outer Loop)

Set the convergence tolerance for outer looping in the GAM smoothing algorithm. The default value is 0.1D-8

 

 

Max. Interactions (Back-fitting)

Set the maximum number of iterations for back-fitting. The default is 1000.

 

 

Max. Iterations (Local scoring)

Set the maximum number of iterations for local scoring. The default is 1000.

 

 

Diagnostics and Graphics Frame

This frame organizes controls related to the amount of output produced for GAM estimation and diagnostics.  The diagnostic charts option produces surface (or leverage) charts for all variables that are used in the final model. 

 

 

Display Output for Model

Typically, you want to see the GAM model summary and the OLS model summary. 

 

 

Display Forecast Table

The forecast table displays the original series and the predicted series for both the GAM model and OLS models.  This can slow down the display of output for larger datasets.

 

 

Show Diagnostic Tables

Several diagnostics are available for the dependent variable and the residuals from the estimated models.  Among the diagnostics are a statistical description tables, sample autocorrelation tables, and Hinich nonlinear testing.  The Hinich test wil only be displayed for residual series greater than 50 cases.

 

 

Show Graphics

Several graphics are created including time plot of the dependent variable, Actual vs. Predicted, ACF and PACF plots, and modified Q-Statistic plot.

 

 

Workspace Size

The SCAB34S SPLINES product requires its workspace size to be set when the program is initiated.  The default workspace is of 2000000 is adequate to handle moderate size datasets.  The user may increase the workspace size if needed.  Please note that workspace limit is imposed by the amount of available RAM memory of the computer.

 

 

GAM Linking Function Frame

The GAM model requires the specification of a nonlinear link function to declare how the mean of the dependent variable is dependent upon the additive predictor.  The error distribution can also be specified.

 

 

GAM Linking Function

Specify the nonlinear link function between the mean of the dependent variable and the additive predictor.  The available options are identity, inverse, logit, logarithm, and Cox. The default is identity.

 

 

GAM Error Distribution

Specify the assumed error distribution for fitting. The available options are Gaussian, Binomial, Poisson, Gamma, and Cox.  The default is Gaussian.

 

 

 

Validation Tab

 

This tab allows you to evaluate the performance of GAM model prediction and validate the GAM model against a linear regression model method using simple OLS, MINIMAX, L1, Logit or Probit estimation.  A common problem with most nonlinear modeling methods is over-fitting.  Models that over-fit the data often perform well within the sample, but do substantially worse when predicting out of sample.  Comparing in-sample fit and out-of-sample prediction performance allows the user to evaluate problems related to over-fitting.  If over-fitting is suspected, the number of degrees of freedom for GAM smoothing should be reduced for one or more variables of concern. Also, since out-of-sample GAM prediction is accomplished using a polynomial regression approach to approximate the smoothing splines, the setting for number of D.F. for polynomial regression may also affect out-of-sample prediction performance.  A low setting may not be able to adequately approximate the curvature whereas a high setting may cause an estimation error. A setting between 3-9 is reasonable for most situations.

 

 

 

The GAM modeling approach can be used effectively for both cross-sectional data and time series data.  The GAM user interface offered in WorkBench leverages its utility in time series applications by allowing the dependent variable and predictor variables to be lagged. 

 

The default validation setting compares the in-sample fit of the estimated GAM model against the in-sample fit of a simple OLS regression model.  All available observations are used to evaluate fit using root mean squared error (RMSE) and mean absolute percentage error (MAPE) criteria. 

 

Other options are available to validate the GAM model.  For example, if the user is primarily interested in evaluating the fit of the model in the later part of the series, a holdout sample can be specified by typing the number of observations (or percentage) to be marked from the back of the series.  After specifying the holdout, the user can evaluate in-sample fit for the “holdout period” only by setting the option “Include holdout in estimation (compare holdout only)”.  The user also has two choices to evaluate the prediction performance of the model where the holdout period is not used in training the model. 

 

As another validation criterion, the user can compare the improvement of a GAM model versus a regression model with the same right-hand side variables.  Diagnostics are produced for both the GAM  and regression models.  If the dependent variable is nonlinear in its response to the transformed (smoothed) regressor variables, the GAM model should reveal significant improvement in model fit and out-of-sample forecasting performance. 

 

A confusion matrix is produced for the GAM-Logit model and the comparison linear model for evaluating classification power of the models.  The user has a choice for determining the probability cut-off value for classification of positive and negative cases for the final confusion matrix. The user can allow the system to set the probability cut-off automatically using the maximum G-MEAN values as the criteria, or using specific cut-off values.  If GMEAN1 is used, the cut-off will slightly favor True-Positive classifications and if GMEAN2 is used, the cut-off will consider equally True-Positive and True-Negative classifications.  Since the determination of cut-off  probability thresholds is subjective, a table of ratio statistics for a range of cut-off probability values is also provided in the output. 

 

 

Menu Item

Description

Validation Settings Frame

This frame organizes controls for specifying a holdout sample for forecast performance and model validation.  It also provides controls for the user to specify the type of validation for in-sample or out-of-sample forecasting.

 

 

# to holdout

Specifies the number of observations that are to be reserved from the back of the dependent variable for evaluating forecast performance.  The percentage of the holdout sample relative to the series length is computed and is displayed in % to holdout.

 

 

% to holdout

Specifies the size of the holdout sample as a percentage of the length of the dataset.  The actual number of observations reserved from the back of the series is computed and displayed in # to holdout.

 

 

Compare all obs

Evaluate the in-sample fit of the model for all observations.

 

 

Compare holdout only for in-sample fit

Evaluate the in-sample fit of the model for the defined holdout sample only

 

 

Compare holdout for out-of-sample fit

Evaluate the out-of-sample forecasts defined by the holdout sample.  The model is estimated using observations up to the first forecast origin only

 

 

OLS Method Comparison Frame

This frame organizes controls to validate the GAM model against a regression model with the same right-hand-side variables used in the GAM model. 

 

 

Logistic Method Comparsion Frame

This frame organizes controls to validate the logistic GAM model against a Logit or Probit model with the same right-hand side variables used in the logistic GAM model. 

 

 

Perform comparison

By default a comparison is made to GAM using a simple OLS regression estimation method if the dependent variable is random.  A comparison is not automatically performed if the dependent variable is specified as a logistical variable. 

 

 

OLS model

Estimates a regression model using the ordinary least squares (OLS) method.

 

 

MINIMAX model

Estimates a regression model using the MINIMAX method which

minimizes  .  This estimation method is more sensitive to outliers.

 

 

L1 model

Estimates a regression model using the L1 method which minimizes .  This estimation method is not as sensitive to outliers as OLS or MINIMAX.             

 

 

Logistic model

Estimates a logistic regression model in comparison to a logistic GAM model. 

 

 

Probit model

Estimates a probit regression model in comparison to a logistic GAM model.

 

 

Probability thresholds

The threshold values for classifying a predicted case as a positive or negative instance. 

 

 

 

Results Tab

 

The results tab provides a convenient facility to view output from GAM model estimation.  It also allows you to view the input commands for SCAB34S SPLINES execution.  If there are errors during estimation, you can view the log file for a detailed account of all commands executed and error messages.

 

After the user executes the GAM model application by clicking on the Execute button, SCAB34S SPLINES will display a graph of the actual versus fitted data.  This indicates that the GAMFIT procedure has completed.  The user should click anywhere on the graph (an example is shown below) to close it. 

 

 

After the graph disappears, the user will be placed on the Results tab of the GAM Modeling environment where the output is listed.

 

 

 

Menu Item

Description

View GAM Output File

Displays the GAM modeling results and tabulated diagnostics.

 

 

View GAM Input Commands

Displays the input commands submitted to SCAB34S SPLINES.  You can modify the commands directly in this window and submit the modified command file by clicking on the Execute button.

 

 

View GAM Log File

Displays a detailed command and error log for jobs submitted to SCAB34S SPLINES

 

 

Print

Send information displayed in the viewer to the printer.

 

 

Save

Saves the information in the viewer to a file.  You may want to use this feature to save the modeling script with intentions of executing it later from the System -> Run SCA with Macro menu, or the System -> Run SCAB34S Program File menu.

 

 

Execute

While you are in the Results tab, if you click on Execute, you will send the information in the viewer to SCAB34S SPLINES for processing. 

 

 

Graphs Tab

 

The Graphics tab provides a facility to view high-resolution plots that were generated.  If you previously selected the Show/Create Graphs option, the individual graphs will initially be displayed on screen.  When you click on the graph, the next generated graph will appear until all graphics have been created.  As the graphs are displayed, they are also being saved as Windows Meta Files using fixed names such as “yvar.wmf” or “acfa.wmf”. 

 

 

You can review all created graphic files by selecting the graph from the set of radio buttons provided in the small tabbed area to the left of the viewer control.  In the example above, we are viewing the OLS and GAM residuals overlaid on each other.  The name of the graphic file (gam_res.wmf) is displayed for reference.  Since the graphs are saved to fixed file names, they are overwritten each time you generate a new set of graphs from the GAM modeling environment.  If you wish to save the graphic file for future reference, please use the Save button on this tab to copy the file to a new name.  Please do not rename the file extension because the Save button only renames the file.  It does not convert it to a new format.  You can view those renamed files by using the Load Graph from File facility.  You may send the graph to the printer by clicking on the Print command button.  If you double-click on the graph image it will load in the external program that is associated with WMF files on your computer (e.g., Windows FAX/Picture Viewer).

 

If you elected to create diagnostic charts, curvature plots of the transformed predictor variables in the GAM model are displayed relative to the dependent variable.  Since a variable number of charts may be created based on number of explanatory variables, the file names are sequenced from SCOEF___1 – SCOEF__## and may be viewed by selecting the file name from the list box provided.  An example of a curvature chart is displayed below:

 

 

In the above graph, we are viewing the curvature of smoothed temperature.  The SCOEF*.wmf files are overwritten; therefore the file should be renamed or moved to another location if the graph is to be saved for future reference.

 

 

 



[1] The text, Specifying and Diagnostically Testing Econometric Models, by Houston H. Stokes Greenwood Press (1997) documents the basic B34S capability.  A comprehensive document covering the B34S matrix command facilities is under preparation.