31 Feature Engineering

Feature engineering is a fancy machine learning way of saying “prepare your data for analysis”. But it’s also more than just getting your data in the right format. It’s about transforming variables to help the model provide more stable predictions – for example, encoding, modeling, or omitting missing data; creating new variables from the available dataset; and otherwise helping improve the model performance through transformations and modifications to the existing data before analysis.

One of the biggest challenges with feature engineering is what’s referred to as data leakage, which occurs when information from the test dataset leaks into the training dataset. Consider the simple example of normalizing or standardizing a variable (subtracting the variable mean from each value and then dividing by the standard deviation). Normalizing your variables is necessary for a wide range of predictive models, including any model using regularization methods, principal components analysis, and \(k\)-nearest neighbor models, among others. But when we normalize the variable, it is critical we do so relative to the training set, not relative to the full dataset. If we normalize the numeric variables in our full dataset and then divide it into test and training datasets, information from the test dataset has leaked into the training dataset. This seems simple enough - just wait to standardize until after splitting - but it becomes more complicated when we consider tuning a model. If we use a process like \(k\)-fold cross-validation, then we have to normalize within each fold to get an accurate representation of out-of-sample predictive accuracy.

In this chapter, we’ll introduce feature engineering using the {recipes} package from the {tidymodels} suite of packages which, as we will see, helps you to be explicit about the decisions you’re making, while avoiding potential issues with data leakage.

31.1 Basics of {recipes}

The {recipes} package is designed to replace the stats::model.matrix() function that you might be familiar with. For example, if you fit a model like the one below

library(palmerpenguins)
m1 <- lm(bill_length_mm ~ species, data = penguins)
summary(m1)
## 
## Call:
## lm(formula = bill_length_mm ~ species, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9338 -2.2049  0.0086  2.0662 12.0951 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       38.7914     0.2409  161.05   <2e-16 ***
## speciesChinstrap  10.0424     0.4323   23.23   <2e-16 ***
## speciesGentoo      8.7135     0.3595   24.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.96 on 339 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7078, Adjusted R-squared:  0.7061 
## F-statistic: 410.6 on 2 and 339 DF,  p-value: < 2.2e-16

You can see that our species column, which has the values Adelie, Gentoo, Chinstrap, is automatically dummy-coded for us, with the first level in the factor variable (Adelie) set as the reference group. The {recipes} package forces you to be a bit more explicit in these decisions. But it also has a much wider range of modifications it can make to the data.

Using {recipes} also allows you to more easily separate the pre-processing and modeling stages of your data analysis workflow. In the above example, you may not have even realized stats::model.matrix() was doing anything for you because it’s wrapped within the stats::lm() modeling code. But with {recipes}, you make the modifications to your data first and then conduct your analysis.

How does this separation work? With {recipes}, you can create a blueprint (or recipe) to apply to a given dataset, without actually applying those operations. You can then use this blueprint iteratively across sets of data (e.g., folds) as well as on new (potentially unseen) data that has the same structure (variables). This process helps avoid data leakage because all operations are carried forward and applied together, and no operations are conducted until explicitly requested.

31.2 Creating a recipe

Let’s read in some data and begin creating a basic recipe. We’ll work with the simulated statewide testing data introduced previously. This is a fairly decent sized dataset, and since we’re just illustrating concepts here, we’ll pull a random sample of 2% of the total data to make everything run a bit quicker. We’ll also remove the classification variable, which is just a categorical version of score, our outcome.

In the chunk below, we read in the data, sample a random 2% of the data (being careful to set a seed first so our results are reproducible), split it into training and test sets, and extract just the training dataset. We’ll hold off on splitting it into CV folds for now.

library(tidyverse)
library(tidymodels)

set.seed(8675309)
full_train <- read_csv("https://github.com/uo-datasci-specialization/c4-ml-fall-2020/raw/master/data/train.csv") %>% 
  slice_sample(prop = 0.02) %>% 
  select(-classification)

splt <- initial_split(full_train)
train <- training(splt)

A quick reminder, the data look like this

And you can see the full data dictionary on the Kaggle website here.

When creating recipes, we can still use the formula interface to define how the data will be modeled. In this case, we’ll say that the score column is predicted by everything else in the data frame.

rec <- recipe(score ~ ., data = train)

Notice that I still declare the dataset (in this case, the training data), even though this is just a blueprint. It uses the dataset I provide to get the names of the columns, but it doesn’t actually do anything with this dataset (unless we ask it to). Let’s look at what this recipe looks like.

rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38

Notice it just states that this is a data recipe in which we have specified 1 outcome variable and 38 predictors.

We can prep this recipe to learn more.

prep(rec)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
## 
## Training data contained 2841 data points and 2841 incomplete rows.

Notice we now get an additional message about how many rows are in the data, and how many of these rows contain missing (incomplete data). So the recipe is the blueprint, and we prep the recipe to get it to actually go into the data and conduct the operations. The dataset it has now, however, is just a placeholder than can be substituted in for any other dataset with an equivalent structure.

But of course, modeling score as the outcome with everything else predicting it is not a reasonable choice in this case. For example, we have many ID variables, and we also have multiple categorical variables. For some methods (like tree-based models) it might be okay to leave these categorical variables as they are, but for others (like any model in the linear regression family) we’ll want to encode them somehow (e.g., dummy code).

We can address these concerns by adding steps to our recipe. In the first step, we’ll update the role of all the ID variables so they are not included among the predictors. In the second, we will dummy code all nominal (i.e. categorical) variables.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_dummy(all_nominal())

When updating the roles, we can change the variable label (text passed to the new_role argument) to be anything we want, so long as it’s not "predictor" or "outcome".

Notice in the above I am also using helper functions to apply the operations to all variables of a specific type. There are five main helper functions for creating recipes: all_predictors(), all_outcomes(), all_nominal(), all_numeric() and has_role(). You can use these together, including with negation (e.g., -all_outcomes() to specify the operation should not apply to the outcome variable(s)) to select any set of variables you want to apply the operation to.

Let’s try to prep this updated recipe.

prep(rec)
## Error: Only one factor level in lang_cd

Uh oh! We have an error. Our recipe is trying to dummy code the lang_cd variable, but it has only one level. It’s kind of hard to dummy-code a constant!

Luckily, we can expand our recipe to first remove any zero-variance predictors, like so:

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal())

The _zv part stands for “zero variance” and should take care of this problem. Let’s try again.

prep(rec)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    id vars          6
##    outcome          1
##  predictor         32
## 
## Training data contained 2841 data points and 2841 incomplete rows. 
## 
## Operations:
## 
## Zero variance filter removed calc_admn_cd, lang_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, ... [trained]

Beautiful! Note we do still get a warning here, but I’ve omitted it in the text (we’ll come back to this in the section on Missing data). Our recipe says we now have 6 ID variables, 1 outcome, and 32 predictors, with 2841 data points (rows of data). The calc_admn_cd and lang_cd variables have been removed because they have zero variance, and several variables have been dummy coded, including gndr and ethnic_cd, among others.

Let’s dig just a bit deeper here though. What’s going on with these zero-variance variables? Let’s look back at the training data.

train %>% 
  count(calc_admn_cd)
## # A tibble: 1 x 2
##   calc_admn_cd     n
##   <lgl>        <int>
## 1 NA            2841
train %>% 
  count(lang_cd)
## # A tibble: 2 x 2
##   lang_cd     n
##   <chr>   <int>
## 1 S          80
## 2 <NA>     2761

So at least in our sample, calc_admn_cd really is just fully missing, which means it might as well be dropped because it’s providing us exactly nothing. But that’s not the case with lang_cd. It has two values, NA and "S" (for “Spanish”). This variable represents the language the test was administered in and the NA values are actually meaningful here because they are the the “default” administration, meaning English. So rather than dropping these, let’s mutate them to transform the NA values to "E" for English. We could reasonably do this inside or outside the recipe, but a good rule of thumb is, if it can go in the recipe, put it in the recipe. It can’t hurt, and doing operations outside of the recipe risks data leakage.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_mutate(lang_cd = ifelse(is.na(lang_cd), "E", lang_cd)) %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal())

Let’s take a look at what our data would actually look like when applying this recipe now. First, we’ll prep the recipe

prepped <- prep(rec)
prepped
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    id vars          6
##    outcome          1
##  predictor         32
## 
## Training data contained 2841 data points and 2841 incomplete rows. 
## 
## Operations:
## 
## Variable mutation for lang_cd [trained]
## Zero variance filter removed calc_admn_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, ... [trained]

And we see that lang_cd is no longer being caught by the zero variance filter. Next we’ll bake the prepped recipe to actually apply it to our data. If we specify new_data = NULL, bake() will apply the operation to the data we originally specified in the recipe. But we can also pass new data as an additional argument and it will apply the operations to that data instead of the data specified in the recipe.

bake(prepped, new_data = NULL)
## # A tibble: 2,841 x 106
##        id attnd_dist_inst… attnd_schl_inst… enrl_grd partic_dist_ins…
##     <dbl>            <dbl>            <dbl>    <dbl>            <dbl>
##  1  62576             2083             1353        7             2083
##  2  71424             2180              878        6             2180
##  3 179893             2244             1334        3             2244
##  4 136083             2142             4858        5             2142
##  5 196809             2212             1068        3             2212
##  6  13931             2088              581        8             2088
##  7 103344             1926              102        6             1926
##  8 105122             2142              766        6             2142
##  9 172543             1965              197        4             1965
## 10  45153             2083              542        6             2083
## # … with 2,831 more rows, and 101 more variables: partic_schl_inst_id <dbl>,
## #   lang_cd <fct>, ncessch <dbl>, lat <dbl>, lon <dbl>, score <dbl>,
## #   gndr_M <dbl>, ethnic_cd_B <dbl>, ethnic_cd_H <dbl>, ethnic_cd_I <dbl>,
## #   ethnic_cd_M <dbl>, ethnic_cd_P <dbl>, ethnic_cd_W <dbl>,
## #   tst_bnch_X2B <dbl>, tst_bnch_X3B <dbl>, tst_bnch_G4 <dbl>,
## #   tst_bnch_G6 <dbl>, tst_bnch_G7 <dbl>, tst_dt_X3.21.2018.0.00.00 <dbl>,
## #   tst_dt_X3.22.2018.0.00.00 <dbl>, tst_dt_X3.23.2018.0.00.00 <dbl>,
## #   tst_dt_X3.8.2018.0.00.00 <dbl>, tst_dt_X3.9.2018.0.00.00 <dbl>,
## #   tst_dt_X4.10.2018.0.00.00 <dbl>, tst_dt_X4.11.2018.0.00.00 <dbl>,
## #   tst_dt_X4.12.2018.0.00.00 <dbl>, tst_dt_X4.13.2018.0.00.00 <dbl>,
## #   tst_dt_X4.16.2018.0.00.00 <dbl>, tst_dt_X4.17.2018.0.00.00 <dbl>,
## #   tst_dt_X4.18.2018.0.00.00 <dbl>, tst_dt_X4.19.2018.0.00.00 <dbl>,
## #   tst_dt_X4.2.2018.0.00.00 <dbl>, tst_dt_X4.20.2018.0.00.00 <dbl>,
## #   tst_dt_X4.23.2018.0.00.00 <dbl>, tst_dt_X4.24.2018.0.00.00 <dbl>,
## #   tst_dt_X4.25.2018.0.00.00 <dbl>, tst_dt_X4.26.2018.0.00.00 <dbl>,
## #   tst_dt_X4.27.2018.0.00.00 <dbl>, tst_dt_X4.30.2018.0.00.00 <dbl>,
## #   tst_dt_X4.5.2018.0.00.00 <dbl>, tst_dt_X4.6.2018.0.00.00 <dbl>,
## #   tst_dt_X4.9.2018.0.00.00 <dbl>, tst_dt_X5.1.2018.0.00.00 <dbl>,
## #   tst_dt_X5.10.2018.0.00.00 <dbl>, tst_dt_X5.11.2018.0.00.00 <dbl>,
## #   tst_dt_X5.14.2018.0.00.00 <dbl>, tst_dt_X5.15.2018.0.00.00 <dbl>,
## #   tst_dt_X5.16.2018.0.00.00 <dbl>, tst_dt_X5.17.2018.0.00.00 <dbl>,
## #   tst_dt_X5.18.2018.0.00.00 <dbl>, tst_dt_X5.2.2018.0.00.00 <dbl>,
## #   tst_dt_X5.21.2018.0.00.00 <dbl>, tst_dt_X5.22.2018.0.00.00 <dbl>,
## #   tst_dt_X5.23.2018.0.00.00 <dbl>, tst_dt_X5.24.2018.0.00.00 <dbl>,
## #   tst_dt_X5.25.2018.0.00.00 <dbl>, tst_dt_X5.29.2018.0.00.00 <dbl>,
## #   tst_dt_X5.3.2018.0.00.00 <dbl>, tst_dt_X5.30.2018.0.00.00 <dbl>,
## #   tst_dt_X5.31.2018.0.00.00 <dbl>, tst_dt_X5.4.2018.0.00.00 <dbl>,
## #   tst_dt_X5.7.2018.0.00.00 <dbl>, tst_dt_X5.8.2018.0.00.00 <dbl>,
## #   tst_dt_X5.9.2018.0.00.00 <dbl>, tst_dt_X6.1.2018.0.00.00 <dbl>,
## #   tst_dt_X6.4.2018.0.00.00 <dbl>, tst_dt_X6.5.2018.0.00.00 <dbl>,
## #   tst_dt_X6.6.2018.0.00.00 <dbl>, tst_dt_X6.7.2018.0.00.00 <dbl>,
## #   tst_dt_X6.8.2018.0.00.00 <dbl>, migrant_ed_fg_Y <dbl>, ind_ed_fg_Y <dbl>,
## #   sp_ed_fg_Y <dbl>, tag_ed_fg_Y <dbl>, econ_dsvntg_Y <dbl>, ayp_lep_B <dbl>,
## #   ayp_lep_E <dbl>, ayp_lep_F <dbl>, ayp_lep_M <dbl>, ayp_lep_N <dbl>,
## #   ayp_lep_S <dbl>, ayp_lep_W <dbl>, ayp_lep_X <dbl>, ayp_lep_Y <dbl>,
## #   stay_in_dist_Y <dbl>, stay_in_schl_Y <dbl>, dist_sped_Y <dbl>,
## #   trgt_assist_fg_Y <dbl>, ayp_dist_partic_Y <dbl>, ayp_schl_partic_Y <dbl>,
## #   ayp_dist_prfrm_Y <dbl>, ayp_schl_prfrm_Y <dbl>, rc_dist_partic_Y <dbl>,
## #   rc_schl_partic_Y <dbl>, rc_dist_prfrm_Y <dbl>, rc_schl_prfrm_Y <dbl>,
## #   tst_atmpt_fg_Y <dbl>, grp_rpt_dist_partic_Y <dbl>,
## #   grp_rpt_schl_partic_Y <dbl>, grp_rpt_dist_prfrm_Y <dbl>, …

And now we can actually see the dummy-coded categorical variables, along with the other operations we requested. For example, calc_admn_cd is not in the dataset. Notice the ID variables are output though, which makes sense because they are often neccessary for joining with other data sources. But it’s important to realize that they are output (i.e., all variables are returned, regardless of role) because if we passed this directly to a model they would be included as predictors. Note that there may be reasons you would want to include a school and/or district level ID variable in your modeling, but you certainly would not want redundant variables.

We do still have one minor issue with this recipe though, which is pretty evident when looking at the column names of our baked dataset. The tst_dt variable, which specifies the date the test was taken, was treated as a categorical variable because it was read in as a character vector. That means all the dates are being dummy coded! Let’s fix this by just transforming it to a date within our step_mutate.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
              tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal())

And now when we prep/bake the dataset it’s still a date variable, which is what we probably want (it will be modeled as a numeric variable).

rec %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 2,841 x 55
##        id attnd_dist_inst… attnd_schl_inst… enrl_grd tst_dt             
##     <dbl>            <dbl>            <dbl>    <dbl> <dttm>             
##  1  62576             2083             1353        7 2018-05-16 00:00:00
##  2  71424             2180              878        6 2018-04-24 00:00:00
##  3 179893             2244             1334        3 2018-05-25 00:00:00
##  4 136083             2142             4858        5 2018-05-24 00:00:00
##  5 196809             2212             1068        3 2018-05-16 00:00:00
##  6  13931             2088              581        8 2018-06-06 00:00:00
##  7 103344             1926              102        6 2018-06-04 00:00:00
##  8 105122             2142              766        6 2018-05-08 00:00:00
##  9 172543             1965              197        4 2018-05-23 00:00:00
## 10  45153             2083              542        6 2018-05-10 00:00:00
## # … with 2,831 more rows, and 50 more variables: partic_dist_inst_id <dbl>,
## #   partic_schl_inst_id <dbl>, ncessch <dbl>, lat <dbl>, lon <dbl>,
## #   score <dbl>, gndr_M <dbl>, ethnic_cd_B <dbl>, ethnic_cd_H <dbl>,
## #   ethnic_cd_I <dbl>, ethnic_cd_M <dbl>, ethnic_cd_P <dbl>, ethnic_cd_W <dbl>,
## #   tst_bnch_X2B <dbl>, tst_bnch_X3B <dbl>, tst_bnch_G4 <dbl>,
## #   tst_bnch_G6 <dbl>, tst_bnch_G7 <dbl>, migrant_ed_fg_Y <dbl>,
## #   ind_ed_fg_Y <dbl>, sp_ed_fg_Y <dbl>, tag_ed_fg_Y <dbl>,
## #   econ_dsvntg_Y <dbl>, ayp_lep_B <dbl>, ayp_lep_E <dbl>, ayp_lep_F <dbl>,
## #   ayp_lep_M <dbl>, ayp_lep_N <dbl>, ayp_lep_S <dbl>, ayp_lep_W <dbl>,
## #   ayp_lep_X <dbl>, ayp_lep_Y <dbl>, stay_in_dist_Y <dbl>,
## #   stay_in_schl_Y <dbl>, dist_sped_Y <dbl>, trgt_assist_fg_Y <dbl>,
## #   ayp_dist_partic_Y <dbl>, ayp_schl_partic_Y <dbl>, ayp_dist_prfrm_Y <dbl>,
## #   ayp_schl_prfrm_Y <dbl>, rc_dist_partic_Y <dbl>, rc_schl_partic_Y <dbl>,
## #   rc_dist_prfrm_Y <dbl>, rc_schl_prfrm_Y <dbl>, lang_cd_E <dbl>,
## #   tst_atmpt_fg_Y <dbl>, grp_rpt_dist_partic_Y <dbl>,
## #   grp_rpt_schl_partic_Y <dbl>, grp_rpt_dist_prfrm_Y <dbl>,
## #   grp_rpt_schl_prfrm_Y <dbl>

31.2.1 Order matters

It’s important to realize that the order of the steps matters. In our recipe, we first declare ID variables as having a different role than predictors or outcomes. We then modify two variables, remove zero-variance predictors, and finally dummy code all categorical (nominal) variables. What happens if we instead dummy code and then remove zero-variance predictors?

rec <- recipe(score ~ ., train) %>% 
  step_dummy(all_nominal()) %>% 
  step_zv(all_predictors()) 

prep(rec)
## Error: Only one factor level in lang_cd

We end up with our original error. We don’t get this error if we remove zero variance predictors and then dummy code.

rec <- recipe(score ~ ., train) %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal()) 

prep(rec)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         38
## 
## Training data contained 2841 data points and 2841 incomplete rows. 
## 
## Operations:
## 
## Zero variance filter removed calc_admn_cd, lang_cd [trained]
## Dummy variables from gndr, ethnic_cd, tst_bnch, tst_dt, ... [trained]

The fact that order matters may occasionally require that you apply the same operation at multiple steps (e.g., a near zero variance filter could be applied before and after dummy-coding).

All of the above serves as a basic introduction to developing a recipe, and what follows goes into more detail on specific feature engineering pieces. For complete information on all possible recipe steps, please see the documentation.

31.3 Encoding categorical data

For many (but not all) modeling frameworks, categorical data must be transformed somehow. The most common strategy for this is dummy coding, which stats::model.matrix() will do for you automatically using the stats::contrasts() function. There are also other coding schemes you can use with base R, such as Helmert and polynomial coding (see ?contrasts and related functions). Dummy coding leaves one group out (the first level of the factor, by default) and creates new columns for all the other groups coded \(0\) or \(1\) depending on whether the original variable represented that value or not. For example:

f <- factor(c("red", "red", "blue", "green", "green", "green"))
contrasts(f)
##       green red
## blue      0   0
## green     1   0
## red       0   1

In the above, "blue" has been assigned as the reference category (note that factor levels are assigned in alphabetical order by default), and dummy variables have been created for "green" and "red". In a linear regression framework, "blue" would become the intercept.

We can recreate this same coding scheme with {recipes}, but we need to first put it in a data frame.

df <- data.frame(f, score = rnorm(length(f)))
df
##       f      score
## 1   red  1.0734291
## 2   red -0.2675359
## 3  blue  0.7512238
## 4 green  0.5436071
## 5 green  0.6940371
## 6 green -0.6446104
recipe(score ~ f, data = df) %>% 
  step_dummy(f) %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 6 x 3
##    score f_green f_red
##    <dbl>   <dbl> <dbl>
## 1  1.07        0     1
## 2 -0.268       0     1
## 3  0.751       0     0
## 4  0.544       1     0
## 5  0.694       1     0
## 6 -0.645       1     0

In the above, we’ve created the actual columns we need, while in the base example we only created the contrast matrix (although it’s relatively straightforward to then create the columns).

The {recipes} version is, admittedly, a fair amount of additional code, but as we saw in the previous section, {recipes} is capable of making a wide range of transformation in a systematic way.

31.3.1 Transformations beyond dummy coding

Although less used in inferential statistics, there are a number of additional transformations we can use to encode categorical data. The most straightforward is one-hot encoding. One-hot encoding is essentially equivalent to dummy coding except we create the variables for all levels in the categorical variable (i.e., we do not leave one out as a reference group). This generally makes them less useful in linear regression frameworks (unless the model intercept is dropped), but they can be highly useful in a number of other frameworks, such as tree-based methods (covered later in the book).

To use one-hot encoding, we pass the additional one_hot argument to step_dummy().

recipe(score ~ f, data = df) %>% 
  step_dummy(f, one_hot = TRUE) %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 6 x 4
##    score f_blue f_green f_red
##    <dbl>  <dbl>   <dbl> <dbl>
## 1  1.07       0       0     1
## 2 -0.268      0       0     1
## 3  0.751      1       0     0
## 4  0.544      0       1     0
## 5  0.694      0       1     0
## 6 -0.645      0       1     0

Another relatively common encoding scheme, particularly within natural language processing frameworks, is integer encoding, where each level is associated with a unique integer. For example

recipe(score ~ f, data = df) %>% 
  step_integer(f) %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 6 x 2
##       f  score
##   <dbl>  <dbl>
## 1     3  1.07 
## 2     3 -0.268
## 3     1  0.751
## 4     2  0.544
## 5     2  0.694
## 6     2 -0.645

Notice that the syntax is essentially equivalent to the previous dummy-coding example, but we’ve just swapped out step_dummy() for step_integer(). Integer encoding can be useful in natural language processing in particular because words can be encoded as integers, and then the algorithm can search for patterns in the numbers.

31.3.2 Handling new levels

One other very common problem with encoding categorical data is how to handle new, unseen levels. For example, let’s take a look at the recipe below:

rec <- recipe(score ~ f, data = df) %>% 
  step_dummy(f)

We will have no problem creating dummy variables within this recipe as long as the levels of \(f\) are within those contained in df$f (or, more mathematically, where \(f \in F\)). But what if, in a new sample, \(f =\) “purple” or \(f =\) “gray”? Let’s try and see what happens.

df2 <- data.frame(f = factor(c("blue", "green", "purple", "gray")),
                  score = rnorm(4))
rec %>% 
  prep() %>% 
  bake(new_data = df2)
## # A tibble: 4 x 3
##    score f_green f_red
##    <dbl>   <dbl> <dbl>
## 1  1.39        0     0
## 2  0.616       1     0
## 3 -2.35       NA    NA
## 4 -0.165      NA    NA

We end up propagating missing data, which is obviously less than ideal. Luckily, the solution is pretty straightforward. We just add a new step to our recipe to handle novel (or new) categories, lumping them all in their own level (labeled with the suffix _new).

rec <- recipe(score ~ f, data = df) %>% 
  step_novel(f) %>% 
  step_dummy(f)

rec %>% 
  prep() %>% 
  bake(new_data = df2)
## # A tibble: 4 x 4
##    score f_green f_red f_new
##    <dbl>   <dbl> <dbl> <dbl>
## 1  1.39        0     0     0
## 2  0.616       1     0     0
## 3 -2.35        0     0     1
## 4 -0.165       0     0     1

This is not perfect, because "purple" and "orange" may be highly different, and we’re modeling them as a single category. But at least we’re able to move forward with our model without introducing new missing data. As an aside, this is a small example of why having good training data is so important. If you don’t have all the levels of a categorical variable represented, you may end up essentially collapsing levels when there is meaningful variance that could be parsed out.

You can also use a similar approach with step_other() if you have a categorical variable with lots of levels (and a small-ish \(n\) by comparison). Using step_other(), you specify a threshold below which levels should be collapsed into a single “other” category. The threshold can be passed as a proportion or a frequency.

31.3.3 Final thoughts on encoding categorical data

There are, of course, many other ways you can encode categorical data. One important consideration is whether or not the variable is ordered (e.g., low, medium, high) in which case it may make sense to have a corresponding ordered numeric variable (e.g., \(0\), \(1\), \(2\)). Of course, the method of coding these ordered values will relate to assumptions of your modeling process. For example, in the previous example, we assumed that there is a linear, constant change across categories. In our experience, however, the combination of dummy coding (with potentially a one-hot alternative used), integer coding, or simply leaving the categorical variables as they are (for specific frameworks, like tree-based methods) is sufficient most (but not all) of the time. For a more complete discussion of encoding categorical data for predictive modeling frameworks, we recommend Chapter 5 of Kuhn & Johnson (2019).

31.4 Dealing with low variance predictors

Occasionally you have (or can create) variables that are highly imbalanced. A common example might include a gender variable that takes on the values “male”, “female”, “non-binary”, “other”, and “refused to answer”. Once you dummy-code a variable like this, it is possible that one or more of the categories may be so infrequent that it makes modeling that category difficult. This is not to say that these categories are not important, particularly when considering the representation of your training dataset to real-world applications (and any demographic variable is going to be associated with issues of ethics). Ignoring this variation may lead to systematic biases in model predictions. However, you also regularly have to make compromises to get models to work and be useful. One of those compromises often includes (with many types of variables, not just demographics) dropping highly imbalanced predictors.

Let’s look back at our statewide testing data. Let’s bake the final recipe from our Creating a recipe section on the training data and look at the dummy variables that are created.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
              tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal())

baked <- rec %>% 
  prep() %>% 
  bake(new_data = NULL)

Below is a table of just the categorical variables and the frequency of each value.

The relative frequency of many of these looks fine, but for some one category has very low frequency. For example, ayp_lep_M has 576 observations (from our random 2% sample) that were \(0\), and only 2 that were \(1\). This is the same for ayp_lep_S. We may therefore consider applying a near-zero variance filter to drop these columns. Let’s try this, and then we’ll talk a bit more about what the filter is actually doing.

rec_nzv <- rec %>% 
  step_nzv(all_predictors())

baked_rm_nzv <- rec_nzv %>% 
  prep() %>% 
  bake(new_data = NULL)

Let’s look at what columns are in baked that were removed from baked_rm_nzv.

removed_columns <- names(baked)[!(names(baked) %in% names(baked_rm_nzv))]
removed_columns
##  [1] "ethnic_cd_B"           "ethnic_cd_I"           "ethnic_cd_P"          
##  [4] "migrant_ed_fg_Y"       "ind_ed_fg_Y"           "ayp_lep_B"            
##  [7] "ayp_lep_M"             "ayp_lep_S"             "ayp_lep_W"            
## [10] "stay_in_dist_Y"        "stay_in_schl_Y"        "dist_sped_Y"          
## [13] "trgt_assist_fg_Y"      "ayp_dist_partic_Y"     "ayp_schl_partic_Y"    
## [16] "ayp_dist_prfrm_Y"      "ayp_schl_prfrm_Y"      "rc_dist_partic_Y"     
## [19] "rc_schl_partic_Y"      "rc_dist_prfrm_Y"       "rc_schl_prfrm_Y"      
## [22] "lang_cd_E"             "tst_atmpt_fg_Y"        "grp_rpt_dist_partic_Y"
## [25] "grp_rpt_schl_partic_Y" "grp_rpt_dist_prfrm_Y"  "grp_rpt_schl_prfrm_Y"

As you can see, the near-zero variance filter has been quite aggressive here, removing 27 columns. Looking back at our table of variables, we can see that, for example, there are 55 students coded Black out of 2841, and it could be reasonably argued that this column is worth keeping in the model.

So how is step_nzv working and how can we adjust it to be not quite so aggressive? Variables are flagged for being near-zero variance if they

  • Have very few unique values, and
  • The frequency ratio for the most common value to the second most common value is large

These criteria are implemented in step_nzv through the unique_cut and freq_cut arguments, respectively. unique_cut is estimated as the number of unique values divided by the total number of samples (length of the column) times 100 (i.e., it is a percent). freq_cut is estimated by dividing the most common level frequency by the second most common level frequency. The default for unique_cut is \(10\), while the default for freq_cut is \(95/5 = 19\). For a column to be “caught” by a near-zero variance filter, and removed from the training set, it must be below the specified unique_cut and above the specified freq_cut.

In the case of ethnic_cd_B, we see that there are two unique values, \(0\) and \(1\) (because it’s a dummy-coded variable). There are 2841 rows, so the unique_cut value is \((2 / 2841) \times 100 = 0.07\). The frequency ratio is \(2786/55 = 50.65\). It therefore meets both of the default criteria (below unique_cut and above freq_cut) and is removed.

If you’re applying a near-zero variance filter on dummy variables, there will always be only 2 values, leading to a small unique_cut. This might encourage you to up the freq_cut to a higher value. Let’s try this approach

rec_nzv2 <- rec %>% 
  step_nzv(all_predictors(), 
           freq_cut = 99/1)

baked_rm_nzv2 <- rec_nzv2 %>% 
  prep() %>% 
  bake(new_data = NULL)

removed_columns2 <- names(baked)[!(names(baked) %in% names(baked_rm_nzv2))]
removed_columns2
##  [1] "ethnic_cd_P"           "ayp_lep_M"             "ayp_lep_S"            
##  [4] "ayp_lep_W"             "dist_sped_Y"           "ayp_dist_partic_Y"    
##  [7] "ayp_schl_partic_Y"     "rc_dist_partic_Y"      "rc_schl_partic_Y"     
## [10] "tst_atmpt_fg_Y"        "grp_rpt_dist_partic_Y" "grp_rpt_schl_partic_Y"
## [13] "grp_rpt_dist_prfrm_Y"

Removing near-zero variance dummy variables can be a bit tricky because they will essentially always meet the unique_cut criteria. But it can be achieved by fiddling with the freq_cut variable and, actually, could be considered part of your model tuning process. In this case, we’ve set it so variables will be removed if greater than 99 out of every 100 cases is the same. This led to only 13 variables being flagged and removed. But we could continue on even further specifying, for example, that 499 out of every 500 must be the same for the variable to be flagged. At some point, however, you’ll end up with variables that have such low variance that model estimation becomes difficult, which is the purpose of applying the near-zero variance filter in the first place.

31.5 Missing data

If we look closely at our statewide testing data, we will see that there is a considerable amount of missingness. In fact, every row of the data frame has at least one value that is missing. The amount to which missing data impacts your work varies by field, but in most fields you’re likely to run into situations where you have to handle missing data in some way. The purpose of this section is to discuss a few approaches using {recipes} to handle missingness for predictive modeling purposes. Note that this is not a comprehensive discussion on the topic (for which, we recommend Little and Rubin (2002)), but is instead an applied discussion of what you can do. As with many aspects of data analysis, generally, there is no single approach that will always work best, and it’s worth trying a few different approaches in your model development to see how different choices impact your model performance.

There are three basic ways of handling missing data:

  • Omit rows of the data frame that include missing values
  • Encode or Impute the missing data
  • Ignore the missing data and estimate from the available data

The last option is not always feasible and will depend the modeling framework you’re working within. Some estimation procedures can also lend themselves to efficient handling of missing data (for example, imputation via the posterior distribution with Bayesian estimation). In this section, we’ll mostly focus on the first two approaches. Additionally, we will only really be concerned with missingness on the predictor variables here, rather than the outcome. Generally, missing data in the predictors is a much more difficult problem than missingness in the outcome, because most models assume you have complete data across your predictors. This is not to say that missingness on your outcome is not challenging (it can be highly challenging, and can make model performance evaluation more difficult). However, without handling missing data on your predictors, you generally cannot even fit the model. So we’ll mostly focus there.

31.5.1 Omission

We can omit missing data with step_naomit(). This will remove any row that has any missing data. Let’s see how this impacts our data, working with our same recipe we finished up with in the Creating a recipe section. I’ve placed the recipe here again so we don’t have to go back to remind ourselves what we did previously.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
              tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  step_zv(all_predictors())

na_omit_data <- rec %>% 
  step_naomit(all_predictors()) %>% 
  step_dummy(all_nominal()) %>% 
  prep() %>% 
  bake(new_data = NULL)

nrow(na_omit_data)
## [1] 573

As can be seen above, when we omit any row with missing data we end up with only 573 rows out of the original 2841 rows in the training data (or approximately 20% of the original data). This level of data omission is highly likely to introduce systematic biases into your model prediction. Generally, step_naomit() should only be used when developing preliminary models, where you’re just trying to get code to run. When you get to the point where you’re actually trying to improve performance, you should consider alternative approaches.

31.5.2 Encoding and simple imputation

Encoding missing data is similar to imputation. In imputation, we replace the missing value with something we think could have reasonably been the real value, if it were observed. When we encode missing data we are creating values that will be included in the modeling process. For example, with categorical variables, we could replace the missingess with a “missing” level, which would then get its own dummy code (if we were using dummy coding to encode the categorical variables).

I mentioned in the Creating a recipe section that we were getting warnings but I was omitting them in the text. The reason for these warnings is that some of these columns have missing data. If we want to avoid this warning, we have to add an additional step to our recipe to encode the missing data in the categorical variables. This step is called step_unknown(), and it replaces missing values with "unknown". Let’s do this for all categorical variables and omit any rows that are missing on numeric columns.

na_encode_data <- rec %>% 
  step_unknown(all_nominal()) %>% 
  step_naomit(all_predictors()) %>% 
  step_dummy(all_nominal()) %>% 
  prep() %>% 
  bake(new_data = NULL)

nrow(na_encode_data)
## [1] 2788

Notice in the above that when I call step_naomit() I state that it should be applied to all_predictors() because I’ve already encoded the nominal predictors in the previous step. This approach allows us to capture 98% of the original data. And as a bonus, we’ve removed the warnings. (Note: we might also want to apply step_novel() for any future data that had levels outside of our training data - see Handling new levels).

Just a slight step up in complexity from omission of rows with missing data is to impute them with sample descriptive statistics, such as the mean or the median. Generally, I’ve found that median imputation works better than mean imputation, but that could be related to the types of data I work with most frequently. Let’s switch datasets so we can see what’s happening more directly.

Let’s look at the airquality dataset, which ships with R.

head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

As we can see, Solar.R is missing for observations 5 and 6. Let’s compute the sample mean and median for this column.

mean(airquality$Solar.R, na.rm = TRUE)
## [1] 185.9315
median(airquality$Solar.R, na.rm = TRUE)
## [1] 205

If we use mean or median imputation, we just replace the missing values with these sample statistics. Let’s do this in a new recipe, assuming we’ll be fitting a model where Ozone is the outcome, predicted by all other variables in the dataset.

recipe(Ozone ~ ., data = airquality) %>% 
  step_meanimpute(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 153 x 6
##    Solar.R  Wind  Temp Month   Day Ozone
##      <int> <dbl> <int> <int> <int> <int>
##  1     190   7.4    67     5     1    41
##  2     118   8      72     5     2    36
##  3     149  12.6    74     5     3    12
##  4     313  11.5    62     5     4    18
##  5     186  14.3    56     5     5    NA
##  6     186  14.9    66     5     6    28
##  7     299   8.6    65     5     7    23
##  8      99  13.8    59     5     8    19
##  9      19  20.1    61     5     9     8
## 10     194   8.6    69     5    10    NA
## # … with 143 more rows

As we can see, the value \(186\) has been imputed for rows 5 and 6, which is the integer version of the sample mean (an integer was imputed because the column was already an integer, and not a double).

Let’s try the same thing with median imputation

recipe(Ozone ~ ., data = airquality) %>% 
  step_medianimpute(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 153 x 6
##    Solar.R  Wind  Temp Month   Day Ozone
##      <int> <dbl> <int> <int> <int> <int>
##  1     190   7.4    67     5     1    41
##  2     118   8      72     5     2    36
##  3     149  12.6    74     5     3    12
##  4     313  11.5    62     5     4    18
##  5     205  14.3    56     5     5    NA
##  6     205  14.9    66     5     6    28
##  7     299   8.6    65     5     7    23
##  8      99  13.8    59     5     8    19
##  9      19  20.1    61     5     9     8
## 10     194   8.6    69     5    10    NA
## # … with 143 more rows

And as we would expect, the missingness has now been replaced with values of \(205\).

Sometimes you have time series data, or there is a date variable in the dataset that accounts for a meaningful proportion of the variance. In these cases, you might consider step_rollimpute(), which provides a conditional median imputation based on time, and you can set the size of the window from which to calculate the median. In still other cases it may make sense to just impute with the lowest observed value (i.e., assume a very small amount of the predictor), which can be accomplished with step_lowerimpute().

These simple imputation techniques are fine to use when developing models. However, it’s an area that may be worth returning to as you start to refine your model to see if you can improve performance.

31.5.3 Modeling the missingness

Another alternative for imputation is to fit a statistical model with the column you want to impute modeled as the outcome, with all other columns (minus the actual outcome) predicting it. We then use that model for the imputation. Let’s first consider a linear regression model. We’ll fit the same model we specified in our recipe, using the airquality data.

m <- lm(Solar.R ~ ., data = airquality[ ,-1])
summary(m)
## 
## Call:
## lm(formula = Solar.R ~ ., data = airquality[, -1])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -182.945  -67.348    5.295   73.781  170.068 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.5960    84.4424  -0.173 0.863016    
## Wind          2.1661     2.2633   0.957 0.340171    
## Temp          3.7023     0.9276   3.991 0.000105 ***
## Month       -13.2640     5.4525  -2.433 0.016242 *  
## Day          -1.0631     0.8125  -1.308 0.192875    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 85.14 on 141 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.131,  Adjusted R-squared:  0.1063 
## F-statistic: 5.313 on 4 and 141 DF,  p-value: 0.0005145

Notice that I’ve dropped the first column here, which is Ozone, our actual outcome. The model above has been fit using the equivalent of step_naomit(), otherwise known as listwise deletion, where any row with any missing data is removed.

We can now use the coefficients from this model to impute the missing values in Solar.R. For example, row 6 in the dataset had a missing value on Solar.R and the following values for all other variables:

row6 <- data.frame(Wind = 14.9, 
                   Temp = 66, 
                   Month = 5, 
                   Day = 6)

Using our model, we would predict the following score for this missing value

predict(m, newdata = row6)
##        1 
## 189.3325

Let’s try this using {recipes}.

recipe(Ozone ~ ., data = airquality) %>% 
  step_impute_linear(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 153 x 6
##    Solar.R  Wind  Temp Month   Day Ozone
##      <int> <dbl> <int> <int> <int> <int>
##  1     190   7.4    67     5     1    41
##  2     118   8      72     5     2    36
##  3     149  12.6    74     5     3    12
##  4     313  11.5    62     5     4    18
##  5     152  14.3    56     5     5    NA
##  6     189  14.9    66     5     6    28
##  7     299   8.6    65     5     7    23
##  8      99  13.8    59     5     8    19
##  9      19  20.1    61     5     9     8
## 10     194   8.6    69     5    10    NA
## # … with 143 more rows

And we see two important things here. First, row 6 for Solar.R is indeed as we expected it to be (albeit, in integer form). Second, the imputed values for rows 5 and 6 are now different, which is the first time we’ve seen this via imputation.

The same basic approach can be used for essentially any statistical model. The {recipes} package has currently implemented linear imputation (as above), \(k\)-nearest neighbor imputation, and bagged imputation (via bagged trees). Let’s see how rows 5 and 6 differ with these approaches.

# k-nearest neighbor imputation
recipe(Ozone ~ ., data = airquality) %>% 
  step_knnimpute(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 153 x 6
##    Solar.R  Wind  Temp Month   Day Ozone
##      <int> <dbl> <int> <int> <int> <int>
##  1     190   7.4    67     5     1    41
##  2     118   8      72     5     2    36
##  3     149  12.6    74     5     3    12
##  4     313  11.5    62     5     4    18
##  5     159  14.3    56     5     5    NA
##  6     220  14.9    66     5     6    28
##  7     299   8.6    65     5     7    23
##  8      99  13.8    59     5     8    19
##  9      19  20.1    61     5     9     8
## 10     194   8.6    69     5    10    NA
## # … with 143 more rows
# bagged imputation
recipe(Ozone ~ ., data = airquality) %>% 
  step_bagimpute(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 153 x 6
##    Solar.R  Wind  Temp Month   Day Ozone
##      <int> <dbl> <int> <int> <int> <int>
##  1     190   7.4    67     5     1    41
##  2     118   8      72     5     2    36
##  3     149  12.6    74     5     3    12
##  4     313  11.5    62     5     4    18
##  5      99  14.3    56     5     5    NA
##  6     252  14.9    66     5     6    28
##  7     299   8.6    65     5     7    23
##  8      99  13.8    59     5     8    19
##  9      19  20.1    61     5     9     8
## 10     194   8.6    69     5    10    NA
## # … with 143 more rows

These models are quite a bit more flexible than linear regression, and can potentially overfit. You can, however, control some of the parameters to the models through additional arguments (e.g., \(k\) for \(knn\), which defaults to 5). The benefit of these models is that they may provide better estimates of what the imputed value would have been, were it not missing, which may then improve model performance. The downside is that they are quite a bit more computationally intensive. Generally, you use recipes within processes like \(k\)-fold cross-validation, with the recipe being applied to each fold. In this case, a computationally expensive approach may significantly bog down hyperparameter tuning.

31.5.4 A few words of caution

Missing data is a highly complex topic. This section was meant to provide a basic overview of some of the options you can choose from when building a predictive model. None of these approaches, however, will “fix” data that are missing not at random (MNAR). Unfortunately, it is usually impossible to know if your data are MNAR, and we therefore assume that data are missing at random (MAR), or missing at random conditional on the observed data. For example, if boys were more likely to have missing data on the outcome than girls, we could account for this by including a gender variable in the model, and the resulting data would be MAR.

If you have significant missing data, this section is surely incomplete. We recommended Little and Rubin (2002) previously, and there are a number of other good resources, including a chapter in Kuhn and Johnson (2019).

31.6 Transformations

In standard inferential statistics, we are often concerned with the distribution of the outcome. Linear regression, for example, assumes the outcome is at least reasonably normally distributed. If this is not the case, the standard errors (in particular) can be misrepresented. We therefore generally inspect the outcome before modeling it, and, if it is not approximately normally distributed, we either transform it to make it more closely approximate a normal distribution, or we use an analysis technique that does not assume normality in the outcome.

In predictive modeling, transformations of the predictors or the outcome(s) (or both) can sometimes help improve model performance. For example, let’s quickly simulate some data.

set.seed(3)

# parameters
alpha <- 10
b1 <- 5

# simulate predictor variable
x <- 1:100
log_x <- log(x)

# residual SD
e <- rnorm(length(x), sd = 0.8)

y <- alpha + b1*log_x + e

sim <- data.frame(x, y)

As you can see from the above, we have simulated the data according to \(\log x\), but our data frame only has \(x\). This is a common situation where we don’t know the true functional form. But of course, if we fit a linear regression model to these data, we’ll end up with high bias, particularly in the lower tail (and issues with heteroscedasticity).

ggplot(sim, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE)

But all we have to do, in this case, is use a log transformation to the \(x\) variable and our linear model fits great.

sim %>% 
  mutate(log_x = log(x)) %>% 
  head()
##   x         y     log_x
## 1 1  9.230453 0.0000000
## 2 2 13.231715 0.6931472
## 3 3 15.700092 1.0986123
## 4 4 16.009766 1.3862944
## 5 5 18.203816 1.6094379
## 6 6 18.982897 1.7917595
sim %>% 
  mutate(log_x = log(x)) %>% 
  ggplot(aes(log_x, y)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE)

Note that the model is linear in the transformed units, but curvilinear on the raw scale.

sim %>% 
  mutate(log_x = log(x)) %>% 
  ggplot(aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE,
              formula = y ~ log(x))

So in this case, a log transformation to the x variable works perfectly (as we would expect, given that we simulated the data to be this way). But how do we know how to transform variables?

31.6.1 Box-Cox and similar transformations

A more general formula for transforming variables is given by the Box-Cox transformation, defined by

\[ \begin{equation} x^* = \begin{cases} \frac{x^\lambda-1}{\lambda}, & \text{if}\ \lambda \neq 0 \\ \log\left(x\right), & \text{if}\ \lambda = 0 \end{cases} \end{equation} \] where \(x\) represents the variable in its raw units, and \(x^*\) represents the transformed variable. The Box-Cox transformation is a power transformation, where the intent is to estimate \(\lambda\). Note that if \(\lambda\) is estimated as zero, the power transformation is the same as a log transformation; otherwise the top portion of the equation is used. Fortunately, specific values of \(\lambda\) map to common transformations.

  • \(\lambda = 1\): No transformation
  • \(\lambda = 0.5\): square root transformation
  • \(\lambda = 0\): log transformation
  • \(\lambda = -1\): inverse

Given the above, we would expect that \(\lambda\) would be estimated close to zero with our simulated data. Let’s try using {recipes}. To access the actual \(\lambda\) value, we’ll need to take a brief foray into tidying recipes.

31.6.1.1 Tidying recipes

Let’s first specify the recipe with a Box-Cox transformation to our \(x\) variable.

rec <- recipe(y ~ x, data = sim) %>% 
  step_BoxCox(all_predictors())

Now we can tidy the recipe.

tidy(rec)
## # A tibble: 1 x 6
##   number operation type   trained skip  id          
##    <int> <chr>     <chr>  <lgl>   <lgl> <chr>       
## 1      1 step      BoxCox FALSE   FALSE BoxCox_dj7Ww

In this case, our recipe is incredibly simple. We have one step, which is a Box-Cox transformation. Let’s make the recipe a bit more complicated just for completeness.

rec <- recipe(y ~ x, data = sim) %>% 
  step_impute_linear(all_predictors()) %>% 
  step_nzv(all_predictors()) %>% 
  step_BoxCox(all_numeric(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes())

Most of these steps won’t do anything in this case, but let’s look at the tidied recipe now.

tidy(rec)
## # A tibble: 4 x 6
##   number operation type          trained skip  id                 
##    <int> <chr>     <chr>         <lgl>   <lgl> <chr>              
## 1      1 step      impute_linear FALSE   FALSE impute_linear_z6myS
## 2      2 step      nzv           FALSE   FALSE nzv_NxJt6          
## 3      3 step      BoxCox        FALSE   FALSE BoxCox_8d2bJ       
## 4      4 step      dummy         FALSE   FALSE dummy_J8I0L

Now we have four steps. We can look at any one step by declaring the step number. Let’s look at the linear imputation

tidy(rec, n = 1)
## # A tibble: 1 x 3
##   terms            model id                 
##   <chr>            <lgl> <chr>              
## 1 all_predictors() NA    impute_linear_z6myS

Notice there’s nothing there, because at this point the recipe is still just a blueprint. We have to prep() the recipe if we want it to actually do any work. Let’s prep the recipe and try again.

lm_imputation <- rec %>% 
  prep() %>% 
  tidy(n = 1)

lm_imputation
## # A tibble: 1 x 3
##   terms model        id                 
##   <chr> <named list> <chr>              
## 1 x     <lm>         impute_linear_z6myS

And now we can see a linear model has been fit. We can even access the model itself.

lm_imputation$model
## $x
## 
## Call:
## NULL
## 
## Coefficients:
## (Intercept)  
##        50.5

What we get is actually a list of models, one for each predictor. But in this case there’s only one predictor, so the list is only of length 1.

31.6.1.2 Estimating \(\lambda\)

We can do the same thing to find \(\lambda\) by tidying the Box-Cox step

rec %>% 
  prep() %>% 
  tidy(n = 3)
## # A tibble: 1 x 3
##   terms value id          
##   <chr> <dbl> <chr>       
## 1 x     0.717 BoxCox_8d2bJ

And without any further work we can see that we estimated \(\lambda = 0.72\), which is pretty much directly between a square-root transformation and no transformation. Why did it not estimate a \(log\) transformation as most appropriate? Because the log transformation is only ideal when viewing \(x\) relative to \(y\). Put differently, the Box-Cox transformation is an unsupervised approach that attempts to make each variable approximate a univariate normal distribution. As we’ll see in the next section, there are other methods that can be used to help with issues of non-linearity.

For completeness, let’s see if the transformation helped us. We’ll use \(\lambda = 0.72\) to manually transform \(x\), then plot the result.

# transform x
sim <- sim %>% 
  mutate(x_bc = ((x^0.72) - 1) / 0.72)

# fit the model using the transformed data
m <- lm(y ~ x_bc, sim)

# add the model predictions to the data frame
sim <- sim %>% 
  mutate(pred = predict(m))

# plot the model fit using raw data on the x-axis
ggplot(sim, aes(x, y)) +
  geom_point() +
  geom_line(aes(y = pred))

As we can see, it’s better than the raw data, but still insufficient.

31.6.2 An applied example

Let’s look at an applied example. We’ll use the violence data (see the full data dictionary here) and see if we can predict the neighborhoods where the number of murders are greater than zero, using the percentage of people living in poverty and the percentage of people living in dense housing units (more than one person per room) as predictors. Let’s start with a basic plot.

library(sds)
violence <- get_data("violence")

violence <- violence %>% 
  mutate(murder = ifelse(murders > 0, "Yes", "No"))

ggplot(violence, aes(pctPoverty, houseVacant)) +
  geom_point(aes(color = murder),
             alpha = 0.5,
             stroke = 0)

As you can see, it’s pretty difficult to see much separation here. Let’s look at the univariate views of each predictor.

ggplot(violence, aes(pctPoverty)) +
  geom_histogram()

ggplot(violence, aes(pctPopDenseHous)) +
  geom_histogram()

Both predictors are quite skewed. What do they look like after transformation?

murder_rec <- recipe(murder ~ ., violence) %>% 
  step_BoxCox(all_numeric(), -all_outcomes()) 

transformed_murder <- murder_rec %>% 
  prep() %>% 
  bake(new_data = NULL) 

ggplot(transformed_murder, aes(pctPoverty)) +
  geom_histogram()

ggplot(transformed_murder, aes(pctPopDenseHous)) +
  geom_histogram()

Each of these look considerably better. What about the bivariate view?

ggplot(transformed_murder, aes(pctPoverty, pctPopDenseHous)) +
  geom_point(aes(color = murder),
             alpha = 0.5,
             stroke = 0)

We can much more clearly see the separation here. We could almost draw a diagonal line in the data separating the classes, as below

There’s of course still some misclassification going on here, and that line was drawn by just eye-balling it, but even by hand we can do this much easier after the transformation.

What were the lambda values estimated at for these variables? Let’s check.

murder_rec %>% 
  prep() %>% 
  tidy(1) %>% 
  filter(terms %in% c("pctPoverty", "pctPopDenseHous"))
## # A tibble: 2 x 3
##   terms             value id          
##   <chr>             <dbl> <chr>       
## 1 pctPoverty       0.177  BoxCox_syfpc
## 2 pctPopDenseHous -0.0856 BoxCox_syfpc

Both are fairly close to zero, implying they are similar to log transformations.

31.7 Nonlinearity

Very regularly, predictor variables will have a nonlinear relation with the outcome. In the previous section, we simulated data to follow a log-linear trend, which is linear on the log scale and curvilinear on the raw scale. As a reminder, the data looked like this

ggplot(sim, aes(x, y)) +
  geom_point()

If we just looked at a plot of these data, we’d probably think to ourselves “that looks a whole lot like a log function!”. But let’s pretend for a moment that we’re not quite as brilliant as we actually are and we, say, didn’t plot the data first (always plot the data first!). Or maybe we’re just not all that familiar with the log function, and we didn’t realize that a simple transformation could cure all our woes. How would we go about modeling this? Clearly a linear relation would be insufficient.

There are many different options for modeling curvilinnear relations, but we’ll focus on two: basis expansion via polynomial transformations and natural splines.

Basis expansion is defined, mathematically, as

\[ f(X) = \sum_{m = 1}^M\beta_mh_m(X) \] where \(h_m\) is the \(m\)th transformation of \(X\). Once the transformations are defined, multiple coefficients (\(\beta\)) are estimated to represent each transformation, which sum to form the non-linear trend. However, the model is itself still linear. We illustrate below.

31.7.1 Polynomial transformations

Polynomial transformations include the variable in its raw units, plus additional (basis expansion) units raised to a given power. For example, a cubic trend could be fit by transforming \(X\) into \(X\), \(X^2\), and \(X^3\), and estimating coefficients for each. Let’s do this manually with the sim data from above.

# remove variables from previous example
# keep only `x` and `y`
sim <- sim %>% 
  select(x, y)

sim <- sim %>% 
  mutate(x2 = x^2,
         x3 = x^3) %>% 
  select(starts_with("x"), y)

head(sim)
##   x x2  x3         y
## 1 1  1   1  9.230453
## 2 2  4   8 13.231715
## 3 3  9  27 15.700092
## 4 4 16  64 16.009766
## 5 5 25 125 18.203816
## 6 6 36 216 18.982897

Now let’s fit a model to the data using the polynomial expansion, and plot the result.

poly_m <- lm(y ~ ., data = sim)
sim <- sim %>% 
  mutate(poly_pred = predict(poly_m))
head(sim)
##   x x2  x3         y poly_pred
## 1 1  1   1  9.230453  14.99051
## 2 2  4   8 13.231715  15.63371
## 3 3  9  27 15.700092  16.25849
## 4 4 16  64 16.009766  16.86514
## 5 5 25 125 18.203816  17.45393
## 6 6 36 216 18.982897  18.02516
ggplot(sim, aes(x, y)) +
  geom_point() +
  geom_line(aes(y = poly_pred))

Not bad!

Now let’s replicate it using {recipes}.

sim <- sim %>% 
  select(x, y)

poly_rec <- recipe(y ~ ., sim) %>% 
  step_poly(all_predictors(), degree = 3)

poly_d <- poly_rec %>% 
  prep() %>% 
  bake(new_data = NULL)

poly_d
## # A tibble: 100 x 4
##        y x_poly_1 x_poly_2 x_poly_3
##    <dbl>    <dbl>    <dbl>    <dbl>
##  1  9.23   -0.171    0.217  -0.249 
##  2 13.2    -0.168    0.204  -0.219 
##  3 15.7    -0.165    0.191  -0.190 
##  4 16.0    -0.161    0.178  -0.163 
##  5 18.2    -0.158    0.166  -0.137 
##  6 19.0    -0.154    0.154  -0.113 
##  7 19.8    -0.151    0.142  -0.0904
##  8 21.3    -0.147    0.131  -0.0690
##  9 20.0    -0.144    0.119  -0.0489
## 10 22.5    -0.140    0.108  -0.0302
## # … with 90 more rows

Notice that these values look quite a bit different than the ones we had before because these are orthogonal polynomials. Orthogonal polynomials have some advantages over raw polynomials, including that, in a standard linear regression modeling framework, each higher-order term can be directly evaluated for its contribution to the model. This is mostly because non-orthogonal polynomials are highly correlated. However, the fitted values remain the same, as we can see below.

poly_m2 <- lm(y ~ ., data = poly_d)
sim <- sim %>% 
  mutate(poly_pred2 = predict(poly_m2))

ggplot(sim, aes(x, y)) +
  geom_point() +
  geom_line(aes(y = poly_pred2))

But let’s be even more precise and verify that they are, indeed, identical in their fitted values.

all(
  round(predict(poly_m), 7) == round(predict(poly_m2), 7)
)
## [1] TRUE

This shows that, to seven decimal places, the results are identical (note, the only reason I had to round at all was because of floating point hell).

31.7.2 Splines

When thinking about splines, we find it most helpful to first think about discontinuous splines. Splines divide up the “x” region of the predictor into bins, and fits a model within each bin. Let’s first look at a very simple model, where we divide our sim dataset into lower and upper portions.

# get only x and y back
sim <- sim %>% 
  select(x, y)

# Fit a model to the lower and upper parts
lower_part <- lm(y ~ x, filter(sim, x <= 50))
upper_part <- lm(y ~ x, filter(sim, x > 50))

# make predictions from each part
lower_pred <- predict(lower_part, newdata = data.frame(x = 0:50))
upper_pred <- predict(upper_part, newdata = data.frame(x = 51:100))

# graph the results
ggplot(sim, aes(x, y)) +
  annotate("rect", 
           xmin = -Inf, ymin = -Inf, ymax = Inf, xmax = 50,
           fill = "#7EC1E7", alpha = 0.3) +
  geom_point() +
  geom_line(data = data.frame(x = 0:50, y = lower_pred)) +
  geom_line(data = data.frame(x = 51:100, y = upper_pred))

As you can see, this gets at the same non-linearity, but with two models. We can better approximate the curve by increasings the number of “bins” in the x-axis. Let’s use 10 bins instead.

ten_bins <- sim %>% 
  mutate(tens = as.integer(x/10)) %>% 
  group_by(tens) %>% 
  nest() %>% 
  mutate(m = map(data, ~lm(y ~ x, .x)),
         pred_frame = map2(data, m, ~data.frame(x = .x$x, y = predict(.y))))

ggplot(sim, aes(x, y)) +
  geom_vline(xintercept = seq(10, 90, 10),
             color = "gray30") +
  geom_point() +
  map(ten_bins$pred_frame, ~geom_line(data = .x))

This approximates the underlying curve quite well.

We can try again by fitting polynomial models within each bin to see if that helps. Note we’ve place the point where x == 100 into the bin for 9, because there’s only one point in that bin and the model can’t be fit to a single point.

ten_bins <- sim %>% 
  mutate(tens = as.integer(x / 10),
         tens = ifelse(tens == 10, 9, tens)) %>% 
  group_by(tens) %>% 
  nest() %>% 
  mutate(m = map(data, ~lm(y ~ poly(x, 3), .x)),
         pred_frame = map2(data, m, ~data.frame(x = .x$x, 
                                                y = predict(.y))))

ggplot(sim, aes(x, y)) +
  geom_vline(xintercept = seq(10, 90, 10),
             color = "#7EC1E7") +
  geom_point() +
  map(ten_bins$pred_frame, ~geom_line(data = .x))

But now we have 10 different models! Wouldn’t it be better if we had a single curve? YES! Of course. And that’s exactly what a spline is. It forces each of these curves to connect in a smooth way. The points defining the bins are referred to as “knots”.

Let’s fit a spline to these data, using 8 “interior” knots (for 10 knots total, including one on each boundary).

library(splines)
ns_mod <- lm(y ~ ns(x, 8), sim)

ggplot(sim, aes(x, y)) +
  geom_point() +
  geom_line(data = data.frame(x = 1:100, y = predict(ns_mod)))

And now we have our single curve! The purpose of our discussion here is to build conceptual understandings of splines, and we will therefore not go into the mathematics behind the constraints leading to a single curve. However, for those interested, we recommend Hastie, Tibshirani, and Friedman.

One of the drawbacks of polynomial regression is that they can be wild in the tails. Natural splines avoid this problem by constraining the tails of the curve to be linear. Sometimes, however, this property may not be desirable, and that’s when we instead use a B-spline.

bs_mod <- lm(y ~ bs(x, 8), sim)

ggplot(sim, aes(x, y)) +
  geom_point() +
  geom_line(data = data.frame(x = 1:100, y = predict(bs_mod)))

Let’s look at the true data-generating curve with each of the splines, omitting the points for clarity.

ns_d <- data.frame(x = 1:100, 
                   y = predict(ns_mod), 
                   method = "Natural Spline")
bs_d <- data.frame(x = 1:100, 
                   y = predict(bs_mod), 
                   method = "B-Spline")

ggplot(rbind(ns_d, bs_d), aes(x, y)) +
  geom_smooth(data = sim,
              method = "lm",
              formula = y ~ log(x),
              se = FALSE, 
              color = "#5E8192") +
  geom_line(aes(color = method)) +
  facet_wrap(~method)

Both are quite close, but you can see the natural spline does not dip quite as close to the true value for zero as the B-spline. It’s also a tiny bit more wiggly in the upper tail (because it’s not constrained to be linear). The number of knots is also likely a bit too high here and was chosen fairly randomly.

31.7.2.1 Basis expansion with {recipes}

Basis expansion via recipes proceeds in much the same way as we’ve seen before. We just specify a step identifying the the type of basis expansion we want to conduct, along with any additional arguments like the degrees of freedom (number of knots).

Let’s try building a dataset using basis expansion with a B-Spline

bs_d <- recipe(y ~ x, data = sim) %>% 
  step_bs(x, deg_free = 8) %>% 
  prep() %>% 
  bake(new_data = NULL)

bs_d
## # A tibble: 100 x 9
##        y x_bs_1  x_bs_2    x_bs_3 x_bs_4 x_bs_5 x_bs_6 x_bs_7 x_bs_8
##    <dbl>  <dbl>   <dbl>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1  9.23  0     0       0              0      0      0      0      0
##  2 13.2   0.166 0.00531 0.0000371      0      0      0      0      0
##  3 15.7   0.301 0.0204  0.000297       0      0      0      0      0
##  4 16.0   0.407 0.0441  0.00100        0      0      0      0      0
##  5 18.2   0.488 0.0751  0.00237        0      0      0      0      0
##  6 19.0   0.545 0.112   0.00464        0      0      0      0      0
##  7 19.8   0.580 0.154   0.00801        0      0      0      0      0
##  8 21.3   0.596 0.200   0.0127         0      0      0      0      0
##  9 20.0   0.596 0.248   0.0190         0      0      0      0      0
## 10 22.5   0.582 0.298   0.0270         0      0      0      0      0
## # … with 90 more rows

And as you can see, we get similar output to polynomial regression, but we’re just using a spline for our basis expansion now instead of polynomials. Note that the degrees of freedom can also be set to tune() and trained withing tune::tune_grid(). In other words, the amount of basis expansion (degree of “wiggliness”) can be treated as a hyperparamter to be tuned.

Once the basis expansion is conducted, we can model it just like any other linear model.

bs_mod_2 <- lm(y ~ x_bs_1 + x_bs_2 + x_bs_3 +
                     x_bs_4 + x_bs_5 + x_bs_6 +
                     x_bs_7 + x_bs_8,
                 data = bs_d)

And this will give us the same results we got before.

all(
  round(predict(bs_mod), 7) == round(predict(bs_mod_2), 7)
)
## [1] TRUE

To use polynomial transformations or natural splines instead, we would just use step_poly() or step_ns(), respectively.

One rather important note on tuning splines, you will probably want to avoid using code like

recipe(murders ~ ., data = violence) %>% 
  step_ns(all_predictors(), deg_free = tune())
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor        147
## 
## Operations:
## 
## Natural Splines on all_predictors()

because this will constrain the smooth to be the same for all predictors. Instead, use exploratory analyses and plots to figure out which variables should be fit with splines, and then tune them separately. For example

recipe(murders ~ ., data = violence) %>% 
  step_ns(pctPoverty, deg_free = tune()) %>% 
  step_ns(houseVacant, deg_free = tune())
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor        147
## 
## Operations:
## 
## Natural Splines on pctPoverty
## Natural Splines on houseVacant

Of course, the more splines you are tuning, the more computationally expensive it will become, so you’ll need to balance this with your expected return in model improvement (i.e., you likely can’t tune every predictor in your dataset).

31.8 Interactions

An interaction occurs when the slope, or relation, between one variable and the outcome depends upon a second variable. This is also often referred to as moderation or a moderating variable. For example, consider data that look like this:

benchmarks <- get_data("benchmarks")

ggplot(benchmarks, aes(rdg_fall, rdg_spr)) +
  geom_point(color = "gray70") +
  geom_smooth(aes(color = ell),
                  method = "lm")

where ell stands for “English language learner”. The plot above shows the relation between students scores on a reading assessment administered in the fall and the spring by ell status. Notice that the slope for Non-ELL is markedly steeper than the other two groups. ell status thus moderates the relation between the fall and spring assessment scores. If we’re fitting a model within a linear regression framework, we’ll want to make sure that we model the slopes separately for each of these groups. Otherwise, we would be leaving out important structure in the data, and our model performance would suffer. Interactions are powerful and important to consider, particularly if there is strong theory (or empirical evidence) suggesting that an interaction exists. However, they are not needed in many frameworks and can be derived from the data directly (e.g., tree-based methods, where the degree of possible interaction is determined by the depth of the tree). In this section, we therefore focus on interactions within a linear regression framework.

31.8.1 Creating interactions “by hand”

Even with base R, modeling interactions is pretty straightforward. In the above, we would just specify something like

lm(rdg_spr ~ rdg_fall*ell, data = benchmarks)

or equivalently,

lm(rdg_spr ~ rdg_fall + ell + rdg_fall:ell, data = benchmarks)

However, this is doing some work under the hood for you which might go unnoticed. First, it’s dummy-coding ell, then it’s creating, in this case, two new variables that are equal to \(ELL_{Monitor} \times Rdg_{fall}\) and \(ELL_{Non-ELL} \times Rdg_{fall}\) (with \(ELL_{Active}\) set as the reference group by default).

Let’s try doing this manually. First we need to dummy code ell. To make things a bit more clear, I’ll only select the variables here that we’re using in our modeling.

dummies <- benchmarks %>% 
  select(rdg_fall, ell, rdg_spr) %>% 
  mutate(ell_monitor = ifelse(ell == "Monitor", 1, 0),
         ell_non = ifelse(ell == "Non-ELL", 1, 0),)
dummies
## # A tibble: 174 x 5
##    rdg_fall ell     rdg_spr ell_monitor ell_non
##       <dbl> <chr>     <dbl>       <dbl>   <dbl>
##  1      181 Active      194           0       0
##  2      166 Non-ELL     159           0       1
##  3      216 Non-ELL     198           0       1
##  4      203 Non-ELL     204           0       1
##  5      198 Active      173           0       0
##  6      188 Active      173           0       0
##  7      202 Monitor     200           1       0
##  8      182 Active      206           0       0
##  9      194 Non-ELL     191           0       1
## 10      170 Active      185           0       0
## # … with 164 more rows

Next, we’ll multiply each of these dummy variables by rdg_fall.

interactions <- dummies %>% 
  mutate(fall_monitor = rdg_fall * ell_monitor,
         fall_non = rdg_fall * ell_non)
interactions
## # A tibble: 174 x 7
##    rdg_fall ell     rdg_spr ell_monitor ell_non fall_monitor fall_non
##       <dbl> <chr>     <dbl>       <dbl>   <dbl>        <dbl>    <dbl>
##  1      181 Active      194           0       0            0        0
##  2      166 Non-ELL     159           0       1            0      166
##  3      216 Non-ELL     198           0       1            0      216
##  4      203 Non-ELL     204           0       1            0      203
##  5      198 Active      173           0       0            0        0
##  6      188 Active      173           0       0            0        0
##  7      202 Monitor     200           1       0          202        0
##  8      182 Active      206           0       0            0        0
##  9      194 Non-ELL     191           0       1            0      194
## 10      170 Active      185           0       0            0        0
## # … with 164 more rows

As would be expected, these values are zero if they are not for the corresponding group, and are otherwise equal to rdg_fall. If we enter all these variables in our model, then our model intercept will represent the intercept for the active group, with the corresponding slope estimated by rdg_fall. The ell_monitor and ell_non terms represent the intercepts for the monitor and non-ELL groups, respectively, and each of these slopes are estimated by the corresponding interaction. Let’s try.

m_byhand <- lm(rdg_spr ~ rdg_fall + ell_monitor + ell_non + 
                         fall_monitor + fall_non,
               data = interactions)
summary(m_byhand)
## 
## Call:
## lm(formula = rdg_spr ~ rdg_fall + ell_monitor + ell_non + fall_monitor + 
##     fall_non, data = interactions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.812  -7.307  -0.100   8.616  26.693 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  180.29666   32.95228   5.471  1.6e-07 ***
## rdg_fall       0.03939    0.18459   0.213   0.8313    
## ell_monitor   -0.47286   53.18886  -0.009   0.9929    
## ell_non      -64.12123   36.80395  -1.742   0.0833 .  
## fall_monitor   0.06262    0.28669   0.218   0.8274    
## fall_non       0.40801    0.20346   2.005   0.0465 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 168 degrees of freedom
## Multiple R-squared:  0.2652, Adjusted R-squared:  0.2433 
## F-statistic: 12.13 on 5 and 168 DF,  p-value: 4.874e-10

And we can verify that this does, indeed, get us the same results that we would have obtained with the shortcut syntax with base R.

summary(
  lm(rdg_spr ~ rdg_fall*ell, data = benchmarks)
)
## 
## Call:
## lm(formula = rdg_spr ~ rdg_fall * ell, data = benchmarks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.812  -7.307  -0.100   8.616  26.693 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         180.29666   32.95228   5.471  1.6e-07 ***
## rdg_fall              0.03939    0.18459   0.213   0.8313    
## ellMonitor           -0.47286   53.18886  -0.009   0.9929    
## ellNon-ELL          -64.12123   36.80395  -1.742   0.0833 .  
## rdg_fall:ellMonitor   0.06262    0.28669   0.218   0.8274    
## rdg_fall:ellNon-ELL   0.40801    0.20346   2.005   0.0465 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 168 degrees of freedom
## Multiple R-squared:  0.2652, Adjusted R-squared:  0.2433 
## F-statistic: 12.13 on 5 and 168 DF,  p-value: 4.874e-10

31.8.2 Creating interactions with {recipes}

Specifying interactions in a recipe is similar to other steps, with one small exception, which is that the use of the tilde, ~, is required. Note that, once again, order does matter, and you should always do your dummy coding before specifying interactions with categorical variables. In the example before, this would correspond to:

recipe(rdg_spr ~ rdg_fall + ell, benchmarks) %>% 
  step_dummy(all_nominal()) %>% 
  step_interact(~rdg_fall:starts_with("ell")) %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 174 x 6
##    rdg_fall rdg_spr ell_Monitor ell_Non.ELL rdg_fall_x_ell_Mo… rdg_fall_x_ell_N…
##       <dbl>   <dbl>       <dbl>       <dbl>              <dbl>             <dbl>
##  1      181     194           0           0                  0                 0
##  2      166     159           0           1                  0               166
##  3      216     198           0           1                  0               216
##  4      203     204           0           1                  0               203
##  5      198     173           0           0                  0                 0
##  6      188     173           0           0                  0                 0
##  7      202     200           1           0                202                 0
##  8      182     206           0           0                  0                 0
##  9      194     191           0           1                  0               194
## 10      170     185           0           0                  0                 0
## # … with 164 more rows

There are a few things to mention here. First, the data look identical (minus the column names) to what we created by hand, so that’s good news. However, we also used starts_with() here, rather than just specifying something like step_interact(~rdg_fall:ell). This is because after dummy coding, the column names change (as you can see in the above). By specifying starts_with(), we are ensuring that we get all the interactions we need. If we forget this step, we end up with the wrong output, along with a warning.

recipe(rdg_spr ~ rdg_fall + ell, benchmarks) %>% 
  step_dummy(all_nominal()) %>% 
  step_interact(~rdg_fall:ell) %>% 
  prep() %>% 
  bake(new_data = NULL)
## Warning: Interaction specification failed for: ~rdg_fall:ell. No interactions
## will be created.
## # A tibble: 174 x 4
##    rdg_fall rdg_spr ell_Monitor ell_Non.ELL
##       <dbl>   <dbl>       <dbl>       <dbl>
##  1      181     194           0           0
##  2      166     159           0           1
##  3      216     198           0           1
##  4      203     204           0           1
##  5      198     173           0           0
##  6      188     173           0           0
##  7      202     200           1           0
##  8      182     206           0           0
##  9      194     191           0           1
## 10      170     185           0           0
## # … with 164 more rows

This warning occurs because R is trying to find a column called ell to multiply with rdg_fall, but no such variable exists after dummy coding.

31.9 PCA

For some sets of models (such as linear regression), highly correlated variables may cause model instability or, perhaps more importantly, lead to wider confidence intervals around model predictions (i.e., how confident we are about our prediction). In these cases, it may help to collapse these variables, while still accounting for the majority of the variance. This is the basic idea behind Principal Components Analysis (PCA). You use PCA to collapse sets of variables into principal components, reducing the dimensionality of your data while maintaining \(X\) percent of the original variation in the data. Reducing the dimensionality generally has an associated cost of higher model bias. However, it will also nearly always reduce model variability. The proportion of the original variability to maintain can thus be considered a tuning parameter, balancing bias with variance.

Probably my favorite discussion of PCA comes from a discussion on CrossValidated on how to make sense of PCA. The opening poster asked how you would explain PCA to a layman and why it’s needed. The entire thread is worth reading through, and there’s a particularly nice example from JD Long comparing the first principal component to the line of best fit in linear regression.

For our purposes, we’re primarily interested in what proportion of the total variability in the independent variables we should maintain. That is, can we improve our model performance by collapsing groups of (correlated) columns into a smaller set of principal components?

31.9.1 PCA with {recipes}

Let’s move back to our full dataset. Our final recipe looked like this.

rec <- recipe(score ~ ., train) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_mutate(lang_cd = factor(ifelse(is.na(lang_cd), "E", lang_cd)),
              tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  step_zv(all_predictors()) %>% 
  step_dummy(all_nominal())

To conduct PCA we need to make sure that

  1. All variables are numeric
  2. All variables are on the same scale
  3. There is no missing data

Our recipe above ensures all variables are numeric, but it doesn’t handle missing data, and there is no standardization of variables. Let’s update this recipe to make sure it’s ready for PCA.

rec <- recipe(score ~ ., train) %>% 
  step_mutate(tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_zv(all_predictors()) %>% 
  step_unknown(all_nominal()) %>%
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>%
  step_normalize(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% 
  step_dummy(all_nominal(), -has_role("id vars"))

The above is not a whole lot more complex than our original recipe. It just assigns an "unknown" level to the nominal variables and imputes the remaining numeric variables with the median of that variable. It then normalizes all variables and dummy codes the nominal variables.

This recipe gives us a data frame with many columns. Specifically:

rec %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  ncol()
## [1] 80

Let’s create a recipe that retains 80% of the total variability with a smaller set of principal components.

rec_80 <- rec %>% 
  step_pca(all_numeric(), -all_outcomes(), -has_role("id vars"), 
           threshold = .80)

Notice that we just have to specify the threshold of variance we want to maintain. How many columns do we have now?

rec_80 %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  ncol()
## [1] 10

We’ve dramatically reduced the dimensionality in the data, while still retaining 80% of the total variability.

An alternative method is to specify the number of components we want. For example, let’s extract only the first five components.

rec_5 <- rec %>% 
  step_pca(all_numeric(), -all_outcomes(), -has_role("id vars"), 
           num_comp = 5)

I generally prefer the former because, if we’re thinking about PCA through a predictive modeling framework, we’re probably less concerned with the number of components and more concerned with the variation they represent.

So how do we know how much of the variabilty we should retain? This is a difficult question, but sometimes plots of the principal components can help. Let’s look at the five components we just extracted. If we prep the recipe, you’ll see there’s some extra information we can access.

prepped_pca <- prep(rec_5)
names(prepped_pca)
## [1] "var_info"       "term_info"      "steps"          "template"      
## [5] "retained"       "tr_info"        "orig_lvls"      "last_term_info"

If we access the steps we can get additional information. The PCA step is the seventh, and it has the following elements stored in the list.

names(prepped_pca$steps[[7]])
##  [1] "terms"     "role"      "trained"   "num_comp"  "threshold" "options"  
##  [7] "res"       "prefix"    "skip"      "id"

Let’s look into res.

names(prepped_pca$steps[[7]]$res)
## [1] "sdev"     "rotation" "center"   "scale"

And now we get what we’ve been (perhaps unknowingly to you, dear reader) looking for, the sdev object, which lists the standard deviation of each principal component. We can look at how much variance each component accounts for, as follows:

vars <- prepped_pca$steps[[7]]$res$sdev^2
pvar <- vars / sum(vars)
pvar
##  [1] 7.503092e-01 5.138784e-02 5.055445e-02 4.135567e-02 2.388100e-02
##  [6] 1.154347e-02 8.222078e-03 8.051600e-03 7.564274e-03 7.346244e-03
## [11] 5.926695e-03 5.187219e-03 4.420626e-03 3.803443e-03 3.418581e-03
## [16] 2.536214e-03 2.344631e-03 1.533003e-03 1.516079e-03 1.397730e-03
## [21] 1.170231e-03 1.071354e-03 1.001754e-03 8.622161e-04 7.596821e-04
## [26] 7.042076e-04 4.803264e-04 3.837786e-04 3.359652e-04 3.219790e-04
## [31] 1.786945e-04 1.426193e-04 7.831676e-05 6.866070e-05 3.886009e-05
## [36] 3.175614e-05 2.647517e-05 2.034393e-05 1.580665e-05 6.911480e-06
## [41] 1.361666e-30 5.047925e-31 1.214792e-32 9.199957e-33 6.414887e-33
## [46] 6.414887e-33 6.414887e-33 6.414887e-33 6.414887e-33 6.414887e-33
## [51] 6.414887e-33 6.414887e-33 6.414887e-33 6.414887e-33 6.414887e-33
## [56] 6.414887e-33 6.414887e-33 6.414887e-33 6.414887e-33 6.414887e-33
## [61] 6.414887e-33 6.414887e-33 6.414887e-33 6.414887e-33 6.414887e-33
## [66] 6.414887e-33 6.414887e-33 6.414887e-33 6.414887e-33 6.414887e-33
## [71] 6.414887e-33 6.414887e-33

That’s difficult to read. Let’s plot it instead

pcs <- data.frame(pc = paste0("PC", 1:length(pvar)),
                  percent_var = pvar) %>% 
  mutate(pc = reorder(factor(pc), percent_var))

ggplot(pcs, aes(pc, percent_var)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(expand = c(0, 0)) # move this to a book theme for plots

A few things are of note here. First, it appears the majority of the variance is accounted for by our first few components, which is not altogether surprising given that we already new the first 10 components accounted for approximately 85% of the total variability. However, you might also wonder why there are so many freaking components shown?! I thought we only asked for five!? That is true. But the model actually estimates all the components, and then just pulls out whatever we ask, based on either the threshold or the num_comp arguments. For example, let’s bake the five component recipe on the full training data and print the resulting variables.

rec_5 %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  names()
##  [1] "id"                  "attnd_dist_inst_id"  "attnd_schl_inst_id" 
##  [4] "tst_dt"              "partic_dist_inst_id" "partic_schl_inst_id"
##  [7] "ncessch"             "score"               "PC1"                
## [10] "PC2"                 "PC3"                 "PC4"                
## [13] "PC5"

And as expected, we get only the five components we requested.

PCA reduces the dimensionality in the data, which can help guard against overfitting. However, it also may introduce a small amount of bias. Plots like the above can help determine how many components we should retain because we can see that, for example, after about 20 components or so we’re getting almost nothing in return. If you are using PCA and your model still seems to be overfitting, try reducing the dimensionality more (extracting fewer components). On the other hand, if you do not appear to be in danger of overfitting, you might try extracting more components to help reduce bias.

31.10 Wrapping up

Feature engineering is a gigantic topic. The purpose of this chapter was to provide a basic introduction. We illustrated creating a recipe or blueprint for your feature engineering, which can be applied iteratively to slices of the data (as in a k-fold cross-validation approach) or to new data with the same structure. This process helps guard against data leakage, where information from the test set leaks into the training set. Let’s look at a quick example of this by using PCA to retain 95% of the total variation in the data. We’ll use \(k\)-fold cross-validation and see how many components we retain in each fold. First, we’ll create the recipe, which is essentially equivalent to what we saw in the previous chapter.

rec <- recipe(score ~ ., train) %>% 
  step_mutate(tst_dt = lubridate::mdy_hms(tst_dt)) %>% 
  update_role(contains("id"), ncessch, new_role = "id vars") %>% 
  step_zv(all_predictors()) %>% 
  step_unknown(all_nominal()) %>%
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>%
  step_normalize(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% 
  step_dummy(all_nominal(), -has_role("id vars")) %>% 
  step_pca(all_numeric(), -all_outcomes(), -has_role("id vars"), 
           threshold = .95)

Now let’s create our \(k\)-fold cross-validation object.

set.seed(42)
cv <- vfold_cv(train)
cv
## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits             id    
##    <list>             <chr> 
##  1 <split [2.6K/285]> Fold01
##  2 <split [2.6K/284]> Fold02
##  3 <split [2.6K/284]> Fold03
##  4 <split [2.6K/284]> Fold04
##  5 <split [2.6K/284]> Fold05
##  6 <split [2.6K/284]> Fold06
##  7 <split [2.6K/284]> Fold07
##  8 <split [2.6K/284]> Fold08
##  9 <split [2.6K/284]> Fold09
## 10 <split [2.6K/284]> Fold10

Now we’ll loop through each split and pull out the data we will build our model on, for that fold.

cv %>% 
  mutate(assessment_data = map(splits, assessment))
## #  10-fold cross-validation 
## # A tibble: 10 x 3
##    splits             id     assessment_data    
##    <list>             <chr>  <list>             
##  1 <split [2.6K/285]> Fold01 <tibble [285 × 39]>
##  2 <split [2.6K/284]> Fold02 <tibble [284 × 39]>
##  3 <split [2.6K/284]> Fold03 <tibble [284 × 39]>
##  4 <split [2.6K/284]> Fold04 <tibble [284 × 39]>
##  5 <split [2.6K/284]> Fold05 <tibble [284 × 39]>
##  6 <split [2.6K/284]> Fold06 <tibble [284 × 39]>
##  7 <split [2.6K/284]> Fold07 <tibble [284 × 39]>
##  8 <split [2.6K/284]> Fold08 <tibble [284 × 39]>
##  9 <split [2.6K/284]> Fold09 <tibble [284 × 39]>
## 10 <split [2.6K/284]> Fold10 <tibble [284 × 39]>

And finally, we’ll apply the recipe to each dataset.

cv %>% 
  mutate(assessment_data = map(splits, assessment),
         baked_data = map(assessment_data, ~prep(rec) %>%  bake(.x)))
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits             id     assessment_data     baked_data         
##    <list>             <chr>  <list>              <list>             
##  1 <split [2.6K/285]> Fold01 <tibble [285 × 39]> <tibble [285 × 17]>
##  2 <split [2.6K/284]> Fold02 <tibble [284 × 39]> <tibble [284 × 17]>
##  3 <split [2.6K/284]> Fold03 <tibble [284 × 39]> <tibble [284 × 17]>
##  4 <split [2.6K/284]> Fold04 <tibble [284 × 39]> <tibble [284 × 17]>
##  5 <split [2.6K/284]> Fold05 <tibble [284 × 39]> <tibble [284 × 17]>
##  6 <split [2.6K/284]> Fold06 <tibble [284 × 39]> <tibble [284 × 17]>
##  7 <split [2.6K/284]> Fold07 <tibble [284 × 39]> <tibble [284 × 17]>
##  8 <split [2.6K/284]> Fold08 <tibble [284 × 39]> <tibble [284 × 17]>
##  9 <split [2.6K/284]> Fold09 <tibble [284 × 39]> <tibble [284 × 17]>
## 10 <split [2.6K/284]> Fold10 <tibble [284 × 39]> <tibble [284 × 17]>

Let’s see how many principal components are in each data fold.

cv %>% 
  mutate(assessment_data = map(splits, assessment),
         baked_data = map(assessment_data, ~prep(rec) %>%  bake(.x)),
         n_components = map_int(baked_data, ~sum(grepl("^PC", names(.x)))))
## #  10-fold cross-validation 
## # A tibble: 10 x 5
##    splits             id     assessment_data     baked_data         n_components
##    <list>             <chr>  <list>              <list>                    <int>
##  1 <split [2.6K/285]> Fold01 <tibble [285 × 39]> <tibble [285 × 17…            9
##  2 <split [2.6K/284]> Fold02 <tibble [284 × 39]> <tibble [284 × 17…            9
##  3 <split [2.6K/284]> Fold03 <tibble [284 × 39]> <tibble [284 × 17…            9
##  4 <split [2.6K/284]> Fold04 <tibble [284 × 39]> <tibble [284 × 17…            9
##  5 <split [2.6K/284]> Fold05 <tibble [284 × 39]> <tibble [284 × 17…            9
##  6 <split [2.6K/284]> Fold06 <tibble [284 × 39]> <tibble [284 × 17…            9
##  7 <split [2.6K/284]> Fold07 <tibble [284 × 39]> <tibble [284 × 17…            9
##  8 <split [2.6K/284]> Fold08 <tibble [284 × 39]> <tibble [284 × 17…            9
##  9 <split [2.6K/284]> Fold09 <tibble [284 × 39]> <tibble [284 × 17…            9
## 10 <split [2.6K/284]> Fold10 <tibble [284 × 39]> <tibble [284 × 17…            9

In this case, the number of components is the same across all folds, but the representation of those components is likely slightly different. Why? Because the recipe has been applied to each fold, independent of the other folds.

Feature engineering can also regularly feel overwhelming. There are so many possible options that you could do to potentially increase model performance. This chapter just scratched the surface. The sheer number of decisions is why many believe that feature engineering is the most “art” part of machine learning. Strong domain knowledge will absolutely help, as will practice. For those looking to go further with feature engineering, we suggest Kuhn and Johnson, who wrote an entire book on the topic. It provides practical advice in an accessible way without skimping on important details.