Turn regression rule models into tidy tibbles

# S3 method for cubist
tidy(x, ...)

# S3 method for xrf
tidy(x, penalty = NULL, unit = c("rules", "columns"), ...)

Arguments

x

A Cubist or xrf object.

...

Not currently used.

penalty

A single numeric value for the lambda penalty value.

unit

What data should be returned? For unit = 'rules', each row corresponds to a rule. For unit = 'columns', each row is a predictor column. The latter can be helpful when determining variable importance.

Value

The Cubist method has columns committee, rule_num, rule, estimate, and statistics. The latter two are nested tibbles. estimate contains the parameter estimates for each term in the regression model and statistics has statistics about the data selected by the rules and the model fit.

The xrf results has columns rule_id, rule, and estimate. The rule_id column has the rule identifier (e.g., "r0_21") or the feature column name when the column is added directly into the model. For multiclass models, a class column is included.

In each case, the rule column has a character string with the rule conditions. These can be converted to an R expression using rlang::parse_expr().

Examples

#> #> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’: #> #> filter, lag
#> The following objects are masked from ‘package:base’: #> #> intersect, setdiff, setequal, union
data(ames, package = "modeldata") ames <- ames %>% mutate(Sale_Price = log10(ames$Sale_Price), Gr_Liv_Area = log10(ames$Gr_Liv_Area)) # ------------------------------------------------------------------------------ # \donttest{ cb_fit <- cubist_rules(committees = 10) %>% set_engine("Cubist") %>% fit(Sale_Price ~ Neighborhood + Longitude + Latitude + Gr_Liv_Area + Central_Air, data = ames) cb_res <- tidy(cb_fit) cb_res
#> # A tibble: 157 x 5 #> committee rule_num rule estimate statistic #> <int> <int> <chr> <list> <list> #> 1 1 1 ( Central_Air == 'N' ) & ( Gr_Liv… <tibble [4<tibble [1… #> 2 1 2 ( Gr_Liv_Area <= 3.0326188 ) & ( … <tibble [4<tibble [1… #> 3 1 3 ( Neighborhood %in% c( 'Old_Town… <tibble [3<tibble [1… #> 4 1 4 ( Neighborhood %in% c( 'Old_Town… <tibble [4<tibble [1… #> 5 1 5 ( Central_Air == 'N' ) & ( Gr_Liv… <tibble [4<tibble [1… #> 6 1 6 ( Longitude <= -93.652023 ) & ( N… <tibble [4<tibble [1… #> 7 1 7 ( Gr_Liv_Area > 3.2284005 ) & ( N… <tibble [4<tibble [1… #> 8 1 8 ( Neighborhood %in% c( 'North_Am… <tibble [4<tibble [1… #> 9 1 9 ( Latitude <= 42.009399 ) & ( Nei… <tibble [3<tibble [1… #> 10 1 10 ( Neighborhood %in% c( 'College_… <tibble [4<tibble [1… #> # … with 147 more rows
cb_res$estimate[[1]]
#> # A tibble: 4 x 2 #> term estimate #> <chr> <dbl> #> 1 (Intercept) -408. #> 2 Longitude -1.43 #> 3 Latitude 6.6 #> 4 Gr_Liv_Area 0.7
cb_res$statistic[[1]]
#> # A tibble: 1 x 6 #> num_conditions coverage mean min max error #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2 154 4.94 4.11 5.31 0.0956
# } # ------------------------------------------------------------------------------ # \donttest{ library(recipes)
#> #> Attaching package: ‘recipes’
#> The following object is masked from ‘package:stats’: #> #> step
xrf_reg_mod <- rule_fit(trees = 10, penalty = .001) %>% set_engine("xrf") %>% set_mode("regression") # Make dummy variables since xgboost will not ames_rec <- recipe(Sale_Price ~ Neighborhood + Longitude + Latitude + Gr_Liv_Area + Central_Air, data = ames) %>% step_dummy(Neighborhood, Central_Air) %>% step_zv(all_predictors()) ames_processed <- prep(ames_rec) %>% bake(new_data = NULL) set.seed(1) xrf_reg_fit <- xrf_reg_mod %>% fit(Sale_Price ~ ., data = ames_processed)
#> [16:18:45] WARNING: amalgamation/../src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror. #> [16:18:45] WARNING: amalgamation/../src/learner.cc:541: #> Parameters: { nrounds } might not be used. #> #> This may not be accurate due to some parameters are only used in language bindings but #> passed down to XGBoost core. Or some parameters are not used but slip through this #> verification. Please open an issue if you find above cases. #> #>
#> New names: #> * . -> ....1 #> * . -> ....2 #> * . -> ....3 #> * . -> ....4 #> * . -> ....5 #> * ...
xrf_rule_res <- tidy(xrf_reg_fit) xrf_rule_res$rule[nrow(xrf_rule_res)] %>% rlang::parse_expr()
#> (Gr_Liv_Area < 3.30210185) & (Gr_Liv_Area < 3.38872266) & (Gr_Liv_Area >= #> 2.94571471) & (Gr_Liv_Area >= 3.24870872) & (Latitude < 42.0271072) & #> (Neighborhood_Old_Town >= -9.53674316e-07)
xrf_col_res <- tidy(xrf_reg_fit, unit = "columns") xrf_col_res
#> # A tibble: 149 x 3 #> rule_id term estimate #> <chr> <chr> <dbl> #> 1 r0_1 Gr_Liv_Area -1.27e- 2 #> 2 r2_4 Gr_Liv_Area -3.70e-10 #> 3 r2_2 Gr_Liv_Area 7.59e- 3 #> 4 r2_4 Central_Air_Y -3.70e-10 #> 5 r3_5 Longitude 1.06e- 1 #> 6 r3_6 Longitude 2.65e- 2 #> 7 r3_5 Latitude 1.06e- 1 #> 8 r3_6 Latitude 2.65e- 2 #> 9 r3_5 Longitude 1.06e- 1 #> 10 r3_6 Longitude 2.65e- 2 #> # … with 139 more rows
# }