`cubist_rules()`

is a way to generate a *specification* of a model
before fitting. The main arguments for the model are:

`committees`

: The number of sequential models included in the ensemble (similar to the number of trees in boosting).`neighbors`

: The number of neighbors in the post-model instance-based adjustment.

These arguments are converted to their specific names at the
time that the model is fit. Other options and argument can be
set using `parsnip::set_engine()`

. If left to their defaults
here (`NULL`

), the values are taken from the underlying model
functions. If parameters need to be modified, `update()`

can be used
in lieu of recreating the object from scratch.

cubist_rules( mode = "regression", committees = NULL, neighbors = NULL, max_rules = NULL ) # S3 method for cubist_rules update( object, parameters = NULL, committees = NULL, neighbors = NULL, max_rules = NULL, fresh = FALSE, ... )

mode | A single character string for the type of model. The only possible value for this model is "regression". |
---|---|

committees | A non-negative integer (no greater than 100 for the number of members of the ensemble. |

neighbors | An integer between zero and nine for the number of training set instances that are used to adjust the model-based prediction. |

max_rules | The largest number of rules. |

object | A Cubist model specification. |

parameters | A 1-row tibble or named list with |

fresh | A logical for whether the arguments should be modified in-place or replaced wholesale. |

... | Not used for |

An updated `parsnip`

model specification.

Cubist is a rule-based ensemble regression model. A basic model tree
(Quinlan, 1992) is created that has a separate linear regression model
corresponding for each terminal node. The paths along the model tree is
flattened into rules these rules are simplified and pruned. The parameter
`min_n`

is the primary method for controlling the size of each tree while
`max_rules`

controls the number of rules.

Cubist ensembles are created using *committees*, which are similar to
boosting. After the first model in the committee is created, the second
model uses a modified version of the outcome data based on whether the
previous model under- or over-predicted the outcome. For iteration *m*, the
new outcome `y*`

is computed using

If a sample is under-predicted on the previous iteration, the outcome is adjusted so that the next time it is more likely to be over-predicted to compensate. This adjustment continues for each ensemble iteration. See Kuhn and Johnson (2013) for details.

After the model is created, there is also an option for a post-hoc
adjustment that uses the training set (Quinlan, 1993). When a new sample is
predicted by the model, it can be modified by its nearest neighbors in the
original training set. For *K* neighbors, the model based predicted value is
adjusted by the neighbor using:

where `t`

is the training set prediction and `w`

is a weight that is inverse
to the distance to the neighbor.

Note that `cubist_rules()`

does not require that categorical predictors be
converted to numeric indicator values. Note that using `parsnip::fit()`

will
*always* create dummy variables so, if there is interest in keeping the
categorical predictors in their original format, `parsnip::fit_xy()`

would
be a better choice. When using the `tune`

package, using a recipe for
pre-processing enables more control over how such predictors are encoded
since recipes do not automatically create dummy variables.

The only available engine is `"Cubist"`

.

Quinlan R (1992). "Learning with Continuous Classes." Proceedings of the 5th Australian Joint Conference On Artificial Intelligence, pp. 343-348.

Quinlan R (1993)."Combining Instance-Based and Model-Based Learning." Proceedings of the Tenth International Conference on Machine Learning, pp. 236-243.

Kuhn M and Johnson K (2013). *Applied Predictive Modeling*. Springer.

cubist_rules()#> Cubist Model Specification (regression) #> #> Computational engine: Cubist #># Parameters can be represented by a placeholder: cubist_rules(committees = 7)#> Cubist Model Specification (regression) #> #> Main Arguments: #> committees = 7 #> #> Computational engine: Cubist #># ------------------------------------------------------------------------------ data(car_prices, package = "modeldata") car_rules <- cubist_rules(committees = 1) %>% fit(log10(Price) ~ ., data = car_prices) car_rules#> parsnip model object #> #> Fit time: 74ms #> #> Call: #> cubist.default(x = x, y = y, committees = 1) #> #> Number of samples: 804 #> Number of predictors: 17 #> #> Number of committees: 1 #> Number of rules: 3 #>#> #> Call: #> cubist.default(x = x, y = y, committees = 1) #> #> #> Cubist [Release 2.07 GPL Edition] Wed Jun 10 19:51:04 2020 #> --------------------------------- #> #> Target attribute `outcome' #> #> Read 804 cases (18 attributes) from undefined.data #> #> Model: #> #> Rule 1: [280 cases, mean 4.113645, range 3.93646 to 4.2505, est err 0.028847] #> #> if #> Cylinder <= 4 #> Saab <= 0 #> then #> outcome = 4.190694 - 3.5e-06 Mileage + 0.07 Saab - 0.072 hatchback #> - 0.035 Chevy + 0.061 wagon + 0.035 Leather - 0.009 sedan #> #> Rule 2: [410 cases, mean 4.362136, range 4.13308 to 4.84976, est err 0.041804] #> #> if #> Cylinder > 4 #> then #> outcome = 3.87139 + 0.07 Cylinder + 0.26 Cadillac + 0.154 convertible #> + 0.07 Chevy + 0.107 Buick - 3.5e-06 Mileage + 0.054 Cruise #> + 0.055 Pontiac - 0.015 Doors + 0.018 Saab - 0.008 hatchback #> - 0.004 sedan #> #> Rule 3: [114 cases, mean 4.466658, range 4.34723 to 4.58348, est err 0.022301] #> #> if #> Saab > 0 #> then #> outcome = 4.642401 - 3.6e-06 Mileage - 0.033 Doors - 0.022 sedan #> + 0.023 Leather + 0.01 Saab #> #> #> Evaluation on training data (804 cases): #> #> Average |error| 0.031589 #> Relative |error| 0.22 #> Correlation coefficient 0.97 #> #> #> Attribute usage: #> Conds Model #> #> 86% 51% Cylinder #> 49% 100% Saab #> 100% Mileage #> 100% sedan #> 86% Chevy #> 86% hatchback #> 65% Doors #> 51% Cruise #> 51% Buick #> 51% Cadillac #> 51% Pontiac #> 51% convertible #> 49% Leather #> 35% wagon #> #> #> Time: 0.0 secs #># ------------------------------------------------------------------------------ model <- cubist_rules(committees = 10, neighbors = 2) model#> Cubist Model Specification (regression) #> #> Main Arguments: #> committees = 10 #> neighbors = 2 #> #> Computational engine: Cubist #>#> Cubist Model Specification (regression) #> #> Main Arguments: #> committees = 1 #> neighbors = 2 #> #> Computational engine: Cubist #>#> Cubist Model Specification (regression) #> #> Main Arguments: #> committees = 1 #> #> Computational engine: Cubist #>