Multivariate Adaptive Regression Spline Modeling Using
SCAB34S SPLINES and SCA WorkBench

Multivariate Adaptive Regression Spline (MARSPLINE) modeling and estimation is provided by the B34S® ProSeries Econometric System and SCAB34S SPLINES software products.  SCA WorkBench provides the user interface to shell a MARSPLINE modeling and validation environment in the B34S program suite. 

SCAB34S SPLINES provides a subset of the capabilities in the B34S® ProSeries Econometric System and we refer to these products interchangeably within this document.  SCAB34S SPLINES runs conveniently as an integrated component to SCA WorkBench.  The WorkBench product is a companion to the SCA Statistical System and SCAB34S software, providing a graphical user interface for several applications including MARSPLINE modeling for cross sectional data, MARSPLINE-Probit modeling for ordered data, and TSMARS (or ASTAR) for time series data.   Within the context of MARSPLINE and TSMARS model validation, the predictive performance of these models may be validated by comparing the in-sample and out-of-sample predictive values to linear regression models using OLS, MINIMAX, or L1 estimation methods.  Within the context of MARSPLINE-Probit model validation, the predictive performance of these models may be validated by viewing the Confusion Matrices and Lift-Gains between the MARSPLINE-Probit model and an OLS, Probit, or Logistic model.

The SCAB34S SPLINES product provides a number of procedures to perform common data manipulation tasks, organizational tasks, and statistical/econometric analysis tasks.  It also contains a comprehensive matrix programming language that may be used to customize procedures for specialized use.  SCA WorkBench automatically specifies the command script executed in the SCAB34S SPLINES product based on menu selections.  A command file is then executed in the SCAB34S engine and the results are read back into WorkBench for examination.  The user may save the program file and modify the command script to address additional analysis requirements that may arise.

A major assumption of any linear process is that the coefficients are stable across all levels of the explanatory variables and, in the case of a time series model, across all time periods.  The MARSPLINE model is a very useful method of analysis when it is suspected that the model's coefficients have different optimal values across different levels of the explanatory variables.  There are many theoretical reasons consistent with this occurring in many different applications including energy, finance, economics, social science, and manufacturing. 

SCAB34S SPLINES integrates the General Public License (GPL) code from Trevor Hastie and Robert Tibshirani for multivariate adaptive regression spline (MARSPLINE) modeling.  These routines were coded from scratch by the authors and do not use any of Friedman’s routines.   

Friedman (1991) originally introduced multivariate adaptive regression splines methodology.  He shows how to systematically identify and estimate a MARS model whose coefficients differ based on the levels of the explanatory variables.  Lewis and Stevens (1991) discussed the application of MARS models on lagged values of a time series as Adaptive Spline Autoregression (ASTAR).  The ASTAR model in later published work is also referred to more generally as Time Series Multivariate Adaptive Regression Splines (TSMARS).  The SCAB34S SPLINES product is applicable to data organized in both cross-sectional and time series formats.

The breakpoints or thresholds that define a change in a model coefficient are termed spline-knots and can be thought of similar to a piecewise regression.  An advantage of the MARSPLINE approach is that the spline knots are determined automatically by the procedure.  In addition, complex nonlinear interactions between variables can also be specified.  The MARSPLINE model is particularly powerful in situations where there are large numbers of explanatory variables and low-order interaction effects. The equation switching model, in which the slope of the model abruptly changes for a given value of the X variable, is a special case of the MARSPLINE model. The MARSPLINE methodology can detect and fit models in situations where there are distinct breaks in the data, such as are found if there is a change in the underlying probability density function of the coefficients and where there are complicated variable interactions. After an overview of MARSPLINE modeling, some examples will be presented to illustrate the use of these procedures.

Model specification in SCA WorkBench is intuitive and easy to use.  It is also quite flexible, providing options to control knot optimization, perform model validation, and evaluate forecast performance.

SCA WorkBench: A Graphical User Interface

SCA WorkBench provides a convenient graphical user interface to SCAB34S SPLINES for MARSPLINE modeling.  The WorkBench interface builds the data loading steps and MARSPLINE commands based on the user’s menu selections.  The associated commands are then organized as an SCAB34S program file and submitted to the SCAB34S engine. 

The MARSPLINE modeling environment in WorkBench is organized by tabs shown below. 

MARS017

 

The Model  tab is used to specify the variables, variable types, and lagged components of the MARSPLINE model.  The Options tab sets the estimation limits placed on a MARSPLINE model, and controls the detail of output and graphics that are produced.  The Validation tab provides settings to evaluate the performance of MARSPLINE model prediction and to compare the results with a linear regression model using simple OLS, MINIMAX or L1 estimation.  The Results tab displays the output from the model estimation, diagnostics, and forecasting.  The Graphs tab displays a variety of high resolution graphics such as time series plots, residual plots, autocorrelation plots, and others.

Once the SCAB34S program file is created by SCA WorkBench, you may save the file for future reference or make changes directly to the commands and re-execute the script from SCA WorkBench.

Model Specification Tab

 

This tab is central to specifying the variables and lagged components of the MARSPLINE model.  Use the dropdown combo boxes to select your dependent variable and independent variables.  Click on the Add button to add an independent variable component to the model.  The components will appear in the Model Components grid as they are added.  In the example below, DAYLOAD is selected as the dependent variable.  TEMPERTR is selected as an independent variable with a contemporaneous effect and a lag 1 effect.  The lags are specified in the Lags textbox.  Multiple lags for explanatory variable components can be specified using the word “TO” to separate contiguous lags (e.g., 0 TO 1) or commas to separate non-contiguous lags (e.g., 0, 1, 3). 

 

A component may be deleted or modified by placing your cursor on the specific grid row and then by clicking on the Del or Edit buttons.  If you click on Edit, the Add button will be replaced by the Mod button.  You may make changes using the dropdown combo box for the independent variable and other components in the Specification frame.  Click on the Mod button to complete the modification.

 

 

The features of the Model specification tab are presented below.

 

Menu Item

Description

Specification Frame

This frame organizes various controls that you may use to specify MARSPLINE model components including the dependent variable, independent variables, and lag coefficients.

 

If a categorical variable is specified for an independent variable, the MARSPLINE routine will automatically identify it as categorical when it is processed.

 

 

Dependent Variable

Use this drop-down list to specify the series that you wish to analyze.

 

 

Probit Checkbox

Specifies that the independent variable is a 0-1 variable.  When specified, the MARSPLINE model estimates the probability of success/failure based on the independent variables in the model.

 

 

Categorical Checkbox

Specifies that the dependent variable is a categorical variable.  When specified, the application will automatically determine the number of categories (must be coded as integer) and expand the categorical variable into multiple binary (0-1) variables.

 

 

Independent Variable

Use this drop-down list to specify a random or categorical regressor variable component in the model.

 

 

Lags

Specifies the lag parameters associated with a random variables or categorical variables.  A categorical variable may contain more than one lag parameter; however only one lag specification may be added to the model at a time.  For random variables, multiple lag parameters may be added to the model as a group.  Multiple lags may be specified using the “TO” keyword to separate contiguous lags.  Individual lags may be separated by commas.  For example, the user could specify contiguous lags as “0, 1, 3” or as “0 TO 1, 3”.

 

 

Add

Clicking on Add appends a new component to the MARSPLINE model which is displayed in the model component grid.  Multiple instances of the same independent variable may be added to the model as long as the lag operators are unique.  For example, in the above form, the user could have optioned to specify the TEMPERTR{0} and TEMPERTR{1} components separately.

 

 

Model Components Frame

The model components frame organizes form controls to display the MARSPLINE model components in a grid format, as well as to edit and delete model components.

 

 

Model Component Grid

The components of the MARSPLINE model and their attributes are displayed in this grid.  The first column displays the independent variable name, the second column displays the individual or grouped lag operators within braces, the third column indicates whether the independent variable is predetermined as random or categorical, the fourth and fifth columns are not used at this time. 

 

 

Edit

The user can modify a model component by first placing the mouse cursor on the grid row of interest and then clicking on the Edit button.  The Specification Frame will reflect the current attributes of the model component and the Add button will be replaced by the Mod button.  Make the necessary changes in the Specification Frame and then click on the Mod button to complete the changes.

 

 

Del

The user can delete a model component by placing the mouse cursor on the grid row of interest and then clicking on the Del button.

 

 

Clear

Clears all model components from the model component grid.

 

 

Save

Saves the information in  the model component grid to a specified tab-delimited file. 

 

 

Recall

Recalls the model component grid information from a specified tab-delimited file created (see Save option above).

 

 

Set Data Range Frame

This frame organizes form controls related to how the data is indexed (by date or none), and what data span is modeled and analyzed.

 

 

Date Variable

Use this drop-down list to specify the date variable associated with your series.  If your SCA Data Macro contains a variable named "DATE", it is automatically assigned by SCA WorkBench.

 

If you have an alternative index variable or date variable, you may select it from the drop-down list.  If your SCA Data Macro does not contain a DATE variable, leave the dropdown list empty.  WorkBench will then use the observation number as a date index.

 

If your time series is more than 10,000 observations, WorkBench will not use your DATE variable for indexing.  Instead, observation number will be used.

 

 

Begin Span

Use the Begin drop-down list to omit observations from the beginning of a time series being analyzed. 

 

 

End Span

Use the End drop-down list to omit observations from the back of a time series being analyzed.

 

 

Back

Depending on the tab you are currently working in, clicking on the Back button will move you one tab to the left.  If you are in the Model tab, you will move to the MARSPLINE Data Viewer dialog box where you may choose a new SCA data macro or leave the MARSPLINE Modeling Environment.

 

 

Exit

Exits the MARSPLINE modeling environment.

 

 

Execute

Executes MARSPLINE model estimation, validation, linear model comparison, diagnostics, and graphs by submitting a dynamically created program script to SCAB34S SPLINES.  When completed, you will automatically be placed in the Results tab.

 

 


Options Tab

 

The Options tab sets the estimation limits placed on a MARSPLINE model, controls the detail of output and graphics that is produced, and allocates the workspace size of the SCAB34S SPLINES product.  More estimation options are available in the MARSPLINE matrix subroutine that are not exposed in this MARSPLINE Modeling Environment interface.  The user may employ these other options by directly editing the B34S script generated by WorkBench. 

 

 

 

Menu Item

Description

MARSPLINE Estimation Limits Frame

This frame organizes various controls that set options in MARSPLINE model estimation.  Here, the user specifies the maximum number of basis functions (or spline knots) that may be included in the MARSPLINE model, the degrees of freedom (or penalty) imposed for knot optimization, and the maximum number of interactions that may occur between regressor variables.  The minimum span between knot placements is automatically determined by MARSPLINE.

 

 

Maximum Number of Knots

The maximum number of possible knots  limits the number of knots that may be included in the MARSPLINE model during its selection process.  Increasing this setting allows for a greater number of basis functions to be evaluated before pruning and sometimes results in a better performing model. The default is 5.

 

 

DF (knot optimization)

Sets the number of degrees of freedom charged for unrestricted knot optimization.

 

 

Maximum Interactions

Sets the maximum number of interactions between variables for any given basis function.

 

 

Minimum Span between Knots

The minimum span allowed between spline knots is automatically determined by the MARSPLINE subroutine when this option is set to 0.  The number of regressor variables and the number of observations in the series determine the minimum span between knots.  This setting currently can not be modified by the user.

 

 

Diagnostics and Graphics Frame

This frame organizes controls related to the amount of output produced for MARSPLINE estimation and diagnostics.  The math form option will display the model in a form that is easily transferred into a standard program language.  The alternative is a model displayed in summation form.  The contribution chart option produces contribution (or leverage) charts for all variables that are used in the final model.  If the variable is additive (no interations with other variables) the companion variables are set to their median values.  If interactions exist, three charts are produced that set the companion variables to their minimum, median, and maximum.  The vertical axis displays the predicted values when the target variable takes on a range between its minimum and maximum while all other variables are held constant to their median values.

 

 

Display Output for Model

Typically, you want to see the MARSPLINE model summary and the OLS model summary. 

 

 

Display Forecast Table

The forecast table displays the original series and the predicted series for both the MARSPLINE model and OLS models.

 

 

Show Diagnostic Tables

Several diagnostics are available for the dependent variable and the residuals from the estimated models.  Among the diagnostics are a statistical description tables, sample autocorrelation tables, and Hinich nonlinear testing.  The Hinich test wil only be displayed for residual series greater than 50 cases.

 

 

Show Graphics

Several graphics are created including time plot of the dependent variable, Actual vs. Predicted, ACF and PACF plots, and modified Q-Statistic plot.

 

 

Workspace Size

The SCAB34S SPLINES product requires its workspace size to be set when the program is initiated.  The default workspace is of 2000000 is adequate to handle moderate size datasets.  The user may increase the workspace size if needed.  Please note that workspace limit is imposed by the amount of available RAM memory of the computer.

 

 

 


Validation Tab

 

This tab allows you to evaluate the performance of MARSPLINE model prediction and validate the MARSPLINE model against a linear regression model method using simple OLS, MINIMAX or L1 estimation.  A common problem with most nonlinear modeling methods is over-fitting.  Models that over-fit the data often perform well within the sample, but do substantially worse when predicting out of sample.  The MARSPLINE routine addresses this problem by penalizing overly complex models; controlling the minimum span between spline knots and allowing the user to set the number of degrees of freedom for unrestricted knot optimization, among others. 

 

 

 

Although the MARSPLINE modeling approach is well suited to cross-sectional data, it can also be employed successfully on time series data.  Lewis and Stevens (1991) discuss the application of MARSPLINE models (ASTAR) on lagged values of Y as an alternative to Threshold Autoregressive (SETAR) models (see Tong, 1983).  The MARSPLINE user interface offered in WorkBench leverages its utility in time series applications by allowing the dependent variable to be lagged in the model thus addressing some issues related to serial correlation in the data. 

 

The default validation setting compares the in-sample fit of the estimated MARSPLINE model against the in-sample fit of a simple OLS regression model.  All available observations are used to evaluate fit using root mean squared error (RMSE) and mean absolute percentage error (MAPE) criteria. 

 

Other options are available to validate the MARSPLINE model.  For example, if the user is primarily interested in evaluating the fit of the model in the later part of the series, a holdout sample can be specified by typing the number of observations (or percentage) to be marked from the back of the series.  After specifying the holdout, the user can evaluate in-sample fit for the “holdout period” only by setting the option “Include holdout in estimation (compare holdout only)”.  The user also has two choices to evaluate the prediction performance of the model where the holdout period is not used in training the model.  Those options are to estimate the model up to the first forecast origin only or to re-estimate the model at each forecast origin.  Note that re-estimating the model at each forecast origin is computing intensive and the spline knot placements and model structure may change across the sample period since the knots and variable inclusion is performed automatically during MARSPLINE estimation.

 

The MARSPLINE model can be visually examined as a multi-dimensional surface.  The boundaries of this estimated surface are defined by the response of the dependent variable to a range of levels in the regressor variables.  Problems arise from out of sample forecasting when the values of the regressor variables fall outside their minimum and maximum range in model estimation.  When this occurs, the dependent variable’s responses to these new levels are not known.  Consequently, it is important that the MARSPLINE model be re-examined occasionally and the model retrained if new maximums or minimums are present in the updated data.

 

As another validation criterion, the user can compare the improvement of a MARSPLINE model versus a regression model with the same right-hand side variables.  Diagnostics are produced for both the MARSPLINE and regression models.  If the dependent variable is nonlinear in its response to levels of the regressor variables, the MARSPLINE model should reveal vast improvement in model fit and out-of-sample forecasting performance criteria. 

 

A confusion matrix is produced for the MARSPLINE-Probit model and the comparison linear model for evaluating classification power of the models.  The user has a choice for determining the probability cut-off value for classification of positive and negative cases for the final confusion matrix. The user can allow the system to set the probability cut-off automatically using the maximum G-MEAN values as the criteria, or using specific cut-off values.  If GMEAN1 is used, the cut-off will slightly favor True-Positive classifications and if GMEAN2 is used, the cut-off will consider equally True-Positive and True-Negative classifications.  Since the determination of cut-off  probability thresholds is subjective, a table of ratio statistics for a range of cut-off probability values is also provided in the output. 

 

Menu Item

Description

Validation Settings Frame

This frame organizes controls for specifying a holdout sample for forecast performance and model validation.  It also provides controls for the user to specify the type of validation for in-sample or out-of-sample forecasting.

 

 

# to holdout

Specifies the number of observations that are to be reserved from the back of the dependent variable for evaluating forecast performance.  The percentage of the holdout sample relative to the series length is computed and is displayed in % to holdout.

 

 

% to holdout

Specifies the size of the holdout sample as a percentage of the length of the dataset.  The actual number of observations reserved from the back of the series is computed and displayed in # to holdout.

 

 

Compare all obs

Evaluate the in-sample fit of the model for all observations.

 

 

Compare holdout only for in-sample fit

Evaluate the in-sample fit of the model for the defined holdout sample only

 

 

Compare holdout for out-of-sample fit

Evaluate the out-of-sample forecasts defined by the holdout sample.  The model is estimated using observations up to the first forecast origin only

 

 

OLS Method Comparison Frame

This frame organizes controls to validate the MARSPLINE model against a regression model with the same right-hand-side variables used in the MARSPLINE model. 

 

 

Logistic Method Comparsion Frame

This frame organizes controls to validate the MARS-Probit model against a Logit or Probit model with the same right-hand side variables used in the MARSPLINE model. 

 

 

Perform comparison

By default a comparison is made to MARSPLINE using a simple OLS regression estimation method if the dependent variable is random.  A comparison is not automatically performed if the dependent variable is specified as a logistical variable. 

 

 

OLS model

Estimates a regression model using the ordinary least squares (OLS) method.

 

 

MINIMAX model

Estimates a regression model using the MINIMAX method which

minimizes  .  This estimation method is more sensitive to outliers.

 

 

L1 model

Estimates a regression model using the L1 method which minimizes .  This estimation method is not as sensitive to outliers as OLS or MINIMAX.             

 

 

Logistic model

Estimates a logistic regression model in comparison to a MARSprobit model. 

 

 

Probit model

Estimates a probit regression model in comparison to a MARSprobit model.

 

 

Probability thresholds

The threshold values for classifying a predicted case as a positive or negative instance. 

 

 

Results Tab

The results tab provides a convenient facility to view output from MARSPLINE model estimation.  It also allows you to view the input commands for SCAB34S SPLINES execution.  If there are errors during estimation, you can view the log file for a detailed account of all commands executed and error messages.

 

After the user executes the MARSPLINE model application by clicking on the Execute button, SCAB34S SPLINES will display a graph of the actual versus fitted data.  This indicates that the MARSPLINE procedure has completed.  The user should click anywhere on the graph  to close it. 

 

After the graph disappears, the user will be placed on the Results tab of the MARSPLINE Modeling environment where the output is listed.

 

 

 

Menu Item

Description

View MARSPLINE Output File

Displays the MARSPLINE modeling results and tabulated diagnostics.

 

 

View MARSPLINE Input Commands

Displays the input commands submitted to SCAB34S SPLINES.  You can modify the commands directly in this window and submit the modified command file by clicking on the Execute button.

 

 

View MARSPLINE Log File

Displays a detailed command and error log for jobs submitted to SCAB34S SPLINES

 

 

Print

Send information displayed in the viewer to the printer.

 

 

Save

Saves the information in the viewer to a file.  You may want to use this feature to save the modeling script with intentions of executing it later from the System -> Run SCA with Macro menu, or the System -> Run SCAB34S Program File menu.

 

 

Execute

While you are in the Results tab, if you click on Execute, you will send the information in the viewer to SCAB34S SPLINES for processing. 

 

 

 


Graphs Tab

 

The Graphics tab provides a facility to view high-resolution plots that were generated.  If you previously selected the Show/Create Graphs option, the individual graphs will initially be displayed on screen.  When you click on the graph, the next generated graph will appear until all graphics have been created.  As the graphs are displayed, they are also being saved as Windows Meta Files using fixed names such as “yvar.wmf” or “acfa.wmf”. 

 

 

You can review all created graphic files by selecting the graph from the set of radio buttons provided in the small tabbed area to the left of the viewer control.  In the example below, we are viewing the sample autocorrelations of the MARSPLINE model residuals.  The name of the graphic file (acfa.wmf) is displayed for reference.  Since the graphs are saved to fixed file names, they are overwritten each time you generate a new set of graphs from the MARSPLINE modeling environment.  If you wish to save the graphic file for future reference, please use the Save button on this tab to copy the file to a new name.  Please do not rename the file extension because the Save button only renames the file.  It does not convert it to a new format.  You can view those renamed files by using the Load Graph from File facility.  You may send the graph to the printer by clicking on the Print command button.  If you double-click on the graph image it will load in the external program that is associated with WMF files on your computer (e.g., Windows FAX/Picture Viewer).

 

If you elected to create contribution charts, at least one contribution (leverage) chart will be generated for each variable used in the final model.  Since a variable number of charts may be created based on number of explanatory variables and interactions, the file names are sequenced from CChart01 – CChart## and may be viewed by selecting the file name from the listbox provided.  An example of a contribution chart is displayed below:

 

 

The contribution chart reveals how the predicted value (vertical axis) is influenced by a particular explanatory variable (horizontal axis) in the model when all other variables are held constant.  In the above graph, we are viewing the leverage of temperature at lag=0 when all other variables in the model are set to their median values.  The title in the chart includes information on the lag, number of interaction terms, and companion variable value setting. The CChart*.wmf files are overwritten, therefore the file should be renamed or moved to another location if the file is to be saved for future reference.