Apply feature engineering

def apply_feature_engineering(df: pd.DataFrame, encoders: Dict[str, LabelEncoder]) -> pd.DataFrame:
    encoded: pd.DataFrame = (df.pipe(apply_label_encoders, encoders).pipe(time_component_encoding))[0]
    return encoded.pipe(fill_nan_with_average, list(encoders.keys())).pipe(remove_datetime_columns)

# ... snipped ...

        df = apply_feature_engineering(sanitized_df, model_info.encoders)

Most of the things you are doing here are exactly the same as the feature_engineering part, except here you need to make sure you sanitize your dataframe properly.

Sanitize the result properly

Consider the following training dataframe, where there is only 1 column:

column
foo
bar
baz

Then after the column has been label encoded, the result becomes:

column
0
1
2

Where foo gets encoded to 0, bar becomes 1, and baz becomes 2.

However, if the prediction dataframe contains a value never seen during training, the encoder will not be able to deduce which label it should encode the value to, and will return a NaN. For example, consider the below dataframe:

column
bar
quax
bar
foo

After we apply our encoder, the result is:

column
1
nan
1
0

If your machine learning algorithm cannot handle NaN properly, then after you've applied the feature engineering encoders there is no way for the algorithm to perform prediction.

In situations like this, a common method is to impute the missing values with a designated special value. In exodusutils the method fill_nan_with_mode is doing just that: we extract the most frequent label for an encoded column, and force the invalid cells to that most frequent label.

In our example, the final result after we've applied the fill_nan_with_mode method will be:

column
1
1
1
0