Apply feature engineering
def apply_feature_engineering(df: pd.DataFrame, encoders: Dict[str, LabelEncoder]) -> pd.DataFrame:
encoded: pd.DataFrame = (df.pipe(apply_label_encoders, encoders).pipe(time_component_encoding))[0]
return encoded.pipe(fill_nan_with_average, list(encoders.keys())).pipe(remove_datetime_columns)
# ... snipped ...
df = apply_feature_engineering(sanitized_df, model_info.encoders)
Most of the things you are doing here are exactly the same as the feature_engineering
part, except here you need to make sure you sanitize your dataframe properly.
Sanitize the result properly
Consider the following training dataframe, where there is only 1 column:
column |
---|
foo |
bar |
baz |
Then after the column has been label encoded, the result becomes:
column |
---|
0 |
1 |
2 |
Where foo
gets encoded to 0
, bar
becomes 1
, and baz
becomes 2
.
However, if the prediction dataframe contains a value never seen during training, the encoder will not be able to deduce which label it should encode the value to, and will return a NaN
. For example, consider the below dataframe:
column |
---|
bar |
quax |
bar |
foo |
After we apply our encoder, the result is:
column |
---|
1 |
nan |
1 |
0 |
If your machine learning algorithm cannot handle NaN
properly, then after you've applied the feature engineering encoders there is no way for the algorithm to perform prediction.
In situations like this, a common method is to impute the missing values with a designated special value. In exodusutils
the method fill_nan_with_mode
is doing just that: we extract the most frequent label for an encoded column, and force the invalid cells to that most frequent label.
In our example, the final result after we've applied the fill_nan_with_mode
method will be:
column |
---|
1 |
1 |
1 |
0 |