Feature engineering
After we've ensured that there's nothing unprocessable in our input, we can move on to the feature engineering step, where we manipulate the input data to generate more suitable training input.
# then, do feature engineering
# TODO decide what feature engineering steps you need by modifying `feature_engineering` method
training_df, holdout_df, encoders, components = feature_engineering(
sanitized_df, sanitized_holdout
)
If you right click on the feature_engineering
symbol and select Go to definition
, you should be able to see that this is a method defined in model_algorithm.py
as well.
def feature_engineering(training_df: pd.DataFrame, holdout_df: Optional[pd.DataFrame]):
"""
Does feature engineering for the dataframes.
Here you should use the helper methods in `exodusutils` package, i.e. `time_component_encoding`, `one_hot_encoding` and `label_encoding`.
Parameters
----------
training_df : pd.DataFrame
The training dataframe
holdout_df : Optional[pd.DataFrame]
The holdout dataframe
Returns
-------
tuple[DataFrame, Optional[DataFrame], Dict[str, LabelEncoder], List[str]]
The modified training df and holdout df, the encoders for each categorical column, and the time component columns.
"""
training_df, holdout_df, encoders = label_encoding(training_df, holdout_df)
training_df, components = time_component_encoding(training_df)
# datetime columns are being removed because the necessary features derived from them
# have been generated in the previous step
# However, it is up to the developer to decide whether or not this step is necessary
training_df = remove_datetime_columns(training_df)
if holdout_df is not None:
holdout_df, _ = time_component_encoding(holdout_df)
holdout_df = remove_datetime_columns(holdout_df)
return training_df, holdout_df, encoders, components
The exact procedures for feature engineering is entirely up to you, and the code shown here is really just an example. Let's break it down step by step.