Skip to content

Turn rule models into tidy tibbles

Usage

# S3 method for C5.0
tidy(x, trials = x$trials["Actual"], ...)

# S3 method for cubist
tidy(x, committees = x$committee, ...)

# S3 method for xrf
tidy(x, penalty = NULL, unit = c("rules", "columns"), ...)

Arguments

x

A Cubist, C5.0, or xrf object.

trials

The number of boosting iterations to tidy (defaults to the entire ensemble).

...

Not currently used.

committees

The number of committees to tidy (defaults to the entire ensemble).

penalty

A single numeric value for the lambda penalty value.

unit

What data should be returned? For unit = 'rules', each row corresponds to a rule. For unit = 'columns', each row is a predictor column. The latter can be helpful when determining variable importance.

Value

The Cubist method has columns committee, rule_num, rule, estimate, and statistic. The latter two are nested tibbles. estimate contains the parameter estimates for each term in the regression model and statistic

has statistics about the data selected by the rules and the model fit.

The C5.0 method has columns trial, rule_num, rule, and statistics. The latter two are nested tibbles. statistic

has statistics about the data selected by the rules.

The xrf results has columns rule_id, rule, and estimate. The rule_id column has the rule identifier (e.g., "r0_21") or the feature column name when the column is added directly into the model. For multiclass models, a class column is included.

In each case, the rule column has a character string with the rule conditions. These can be converted to an R expression using rlang::parse_expr().

Details

An example

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data(ames, package = "modeldata")

ames <-
  ames %>%
  mutate(Sale_Price = log10(ames$Sale_Price),
         Gr_Liv_Area = log10(ames$Gr_Liv_Area))

# ------------------------------------------------------------------------------

cb_fit <-
  cubist_rules(committees = 10) %>%
  set_engine("Cubist") %>%
  fit(Sale_Price ~ Neighborhood + Longitude + Latitude + Gr_Liv_Area + Central_Air,
      data = ames)

cb_res <- tidy(cb_fit)
cb_res

## # A tibble: 157 × 5
##    committee rule_num rule                                    estimate statistic
##        <int>    <int> <chr>                                   <list>   <list>   
##  1         1        1 ( Central_Air == 'N' ) & ( Gr_Liv_Area… <tibble> <tibble> 
##  2         1        2 ( Gr_Liv_Area <= 3.0326188 ) & ( Neigh… <tibble> <tibble> 
##  3         1        3 ( Neighborhood  %in% c( 'Old_Town','Ed… <tibble> <tibble> 
##  4         1        4 ( Neighborhood  %in% c( 'Old_Town','Ed… <tibble> <tibble> 
##  5         1        5 ( Central_Air == 'N' ) & ( Gr_Liv_Area… <tibble> <tibble> 
##  6         1        6 ( Longitude <= -93.652023 ) & ( Neighb… <tibble> <tibble> 
##  7         1        7 ( Gr_Liv_Area > 3.2284005 ) & ( Neighb… <tibble> <tibble> 
##  8         1        8 ( Neighborhood  %in% c( 'North_Ames','… <tibble> <tibble> 
##  9         1        9 ( Latitude <= 42.009399 ) & ( Neighbor… <tibble> <tibble> 
## 10         1       10 ( Neighborhood  %in% c( 'College_Creek… <tibble> <tibble> 
## # … with 147 more rows

cb_res$estimate[[1]]

## # A tibble: 4 × 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)  -408.  
## 2 Longitude      -1.43
## 3 Latitude        6.6 
## 4 Gr_Liv_Area     0.7

cb_res$statistic[[1]]

## # A tibble: 1 × 6
##   num_conditions coverage  mean   min   max  error
##            <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1              2      154  4.94  4.11  5.31 0.0956

# ------------------------------------------------------------------------------

library(recipes)

## 
## Attaching package: 'recipes'

## The following object is masked from 'package:stats':
## 
##     step

xrf_reg_mod <-
  rule_fit(trees = 10, penalty = .001) %>%
  set_engine("xrf") %>%
  set_mode("regression")

# Make dummy variables since xgboost will not
ames_rec <-
  recipe(Sale_Price ~ Neighborhood + Longitude + Latitude +
         Gr_Liv_Area + Central_Air,
         data = ames) %>%
  step_dummy(Neighborhood, Central_Air) %>%
  step_zv(all_predictors())

ames_processed <- prep(ames_rec) %>% bake(new_data = NULL)

set.seed(1)
xrf_reg_fit <-
  xrf_reg_mod %>%
  fit(Sale_Price ~ ., data = ames_processed)

## New names:
## • `.` -> `....1`
## • `.` -> `....2`
## • `.` -> `....3`
## • `.` -> `....4`
## • `.` -> `....5`
## • `.` -> `....6`
## • `.` -> `....7`
## • `.` -> `....8`
## • `.` -> `....9`
## • `.` -> `....10`
## • `.` -> `....11`
## • `.` -> `....12`
## • `.` -> `....13`
## • `.` -> `....14`
## • `.` -> `....15`
## • `.` -> `....16`
## • `.` -> `....17`
## • `.` -> `....18`
## • `.` -> `....19`
## • `.` -> `....20`
## • `.` -> `....21`
## • `.` -> `....22`
## • `.` -> `....23`
## • `.` -> `....24`
## • `.` -> `....25`
## • `.` -> `....26`
## • `.` -> `....27`
## • `.` -> `....28`
## • `.` -> `....29`
## • `.` -> `....30`
## • `.` -> `....31`
## • `.` -> `....32`
## • `.` -> `....33`
## • `.` -> `....34`
## • `.` -> `....35`
## • `.` -> `....36`
## • `.` -> `....37`
## • `.` -> `....38`
## • `.` -> `....39`
## • `.` -> `....40`
## • `.` -> `....41`
## • `.` -> `....42`
## • `.` -> `....43`
## • `.` -> `....44`
## • `.` -> `....45`
## • `.` -> `....46`
## • `.` -> `....47`
## • `.` -> `....48`
## • `.` -> `....49`
## • `.` -> `....50`
## • `.` -> `....51`
## • `.` -> `....52`
## • `.` -> `....53`
## • `.` -> `....54`
## • `.` -> `....55`
## • `.` -> `....56`
## • `.` -> `....57`
## • `.` -> `....58`
## • `.` -> `....59`
## • `.` -> `....60`
## • `.` -> `....61`
## • `.` -> `....62`
## • `.` -> `....63`
## • `.` -> `....64`
## • `.` -> `....65`
## • `.` -> `....66`
## • `.` -> `....67`
## • `.` -> `....68`
## • `.` -> `....69`
## • `.` -> `....70`
## • `.` -> `....71`
## • `.` -> `....72`
## • `.` -> `....73`
## • `.` -> `....74`
## • `.` -> `....75`
## • `.` -> `....76`
## • `.` -> `....77`
## • `.` -> `....78`
## • `.` -> `....79`
## • `.` -> `....80`
## • `.` -> `....81`
## • `.` -> `....82`
## • `.` -> `....83`
## • `.` -> `....84`
## • `.` -> `....85`
## • `.` -> `....86`
## • `.` -> `....87`
## • `.` -> `....88`
## • `.` -> `....89`
## • `.` -> `....90`
## • `.` -> `....91`
## • `.` -> `....92`
## • `.` -> `....93`
## • `.` -> `....94`
## • `.` -> `....95`
## • `.` -> `....96`
## • `.` -> `....97`
## • `.` -> `....98`
## • `.` -> `....99`
## • `.` -> `....100`
## • `.` -> `....101`
## • `.` -> `....102`
## • `.` -> `....103`
## • `.` -> `....104`
## • `.` -> `....105`
## • `.` -> `....106`
## • `.` -> `....107`
## • `.` -> `....108`
## • `.` -> `....109`
## • `.` -> `....110`
## • `.` -> `....111`
## • `.` -> `....112`
## • `.` -> `....113`
## • `.` -> `....114`
## • `.` -> `....115`
## • `.` -> `....116`
## • `.` -> `....117`
## • `.` -> `....118`

xrf_rule_res <- tidy(xrf_reg_fit)
xrf_rule_res$rule[nrow(xrf_rule_res)] %>% rlang::parse_expr()

## (Central_Air_Y >= 0.5) & (Gr_Liv_Area < 3.38872266) & (Gr_Liv_Area >= 
##     2.94571471) & (Gr_Liv_Area >= 3.24870872) & (Latitude >= 
##     42.0271072) & (Neighborhood_Old_Town >= 0.5)

xrf_col_res <- tidy(xrf_reg_fit, unit = "columns")
xrf_col_res

## # A tibble: 417 × 3
##    rule_id term          estimate
##    <chr>   <chr>            <dbl>
##  1 r0_1    Gr_Liv_Area    -0.0138
##  2 r2_3    Gr_Liv_Area    -0.0310
##  3 r2_2    Gr_Liv_Area     0.0127
##  4 r2_3    Central_Air_Y  -0.0310
##  5 r3_5    Longitude       0.0859
##  6 r3_6    Longitude       0.0171
##  7 r3_2    Longitude      -0.0109
##  8 r3_5    Latitude        0.0859
##  9 r3_6    Latitude        0.0171
## 10 r3_5    Longitude       0.0859
## # … with 407 more rows