Introduction

This package is the template for creating a new Exodus IID model algorithm. Your goal as a developer is to implement your own model algorithm using this package.

What is a model algorithm?

A model algorithm in the Exodus project is a service that can train a machine learning model, persist the trained model, fetch the persisted model from MongoDB, and finally predict using the persisted model.

How do I read this documentation?

cd books; python -m http.server

Then visit http://localhost:8000 in your browser.

Dependencies

  • Python 3.8+
  • Poetry
    • A packaging and dependency management tool for Python.
    • Install with:
      curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -
      
  • Poethepoet
    • A task runner that works well with poetry.
    • Install with:
      pip install poethepoet
      
  • Python TOML library
    • The TOML library for Python.
    • Install with:
      pip install toml
      

Setup the model template

The model template comes with a fully functional model that can do training and predicting, albeit with rather poor predicting capability. To setup the model template, several steps are necessary:

  1. Renaming the model template package
  2. Installing dependency libraries

Renaming the package

When you get the package, it will be named as exodus_model_template, which is most definitely not the name for the model algorithm you are going to build. To rename the entire package, run the renaming script:

python scripts/rename.py [NAME_OF_THE_MODEL]

Installing dependency libraries

To install the dependency libraries, enter the following command (assuming you've already installed the packages specified in the Dependencies section):

poetry install --no-root

This command will install all the dependency libraries for you.

The libraies

The model algorithm template depends on several libraries, the ones that are used extensively are:

  • pandas: the de-facto go-to data analysis package for Python. We are using v1.4.0.
    • Note that pandas artifacts such as DataFrame are not backward compatible. Think it through if you want to bump its version.
  • numpy: Python's math libaray. We are using v1.21.0.
  • scikit-learn: the machine learning Swiss knife library. Currently only the LabelEncoder class is used. We are using v1.0.2.
    • Note that scikit-learn artifacts such as LabelEncoder are not backward compatible. Think it through if you want to bump its version.
  • pymongo: the MongoDB Python client library.
  • jinja2: a templating tool for Python. Used for generating migration scripts.
  • fastapi: the web framework we chose to run the model algorithm service on. We are using v0.14.1.
  • pydantic: the data validation library for Python classes. We are using v.1.9.0.
  • exodusutils: the utility functions for Exodus. The version of this package is subject to change.

Run the model template

As mentioned previously, the model algorithm template contains a fully functional model algorithm that can be used to train and predict. To see it in action, we have defined several commands.

The commands

For the list of available commands, simply enter:

poe --help

Run the model algorithm service

poe start

# Or if you want to specify a port:
#poe start --port [PORT]

This command will start a MongoDB container and a model algorithm container. By default the model algorithm container can be accessed through http://localhost:5566/, but you can designate a port to the command.

❯ poe start
Poe => PORT=${port} docker-compose up --build -d

# ... snipped ...

Successfully built a60ab63716b1
Successfully tagged exodus_model_template_exodus_model_template:latest
Creating exodus_model_template_exodus_model_template_1 ... done
Creating exodus_model_template_web_app_1               ... done
Creating exodus_model_template_mongo_1                 ... done

You can send a request to the model algorithm container to see if it is working:

❯ curl "http://localhost:5566/info"
{"name": "exodus_model_template", "description": "description"}

Using the webapp to test out the model algorithm

If you go to http://localhost:3000, you will see a web page that allows you to test out the model algorithm. Select a CSV file, upload it and select the proper feature types, and then click Train Model. For predicting with a model, type in the model id, upload a CSV file for prediction, and then press Predict.

Watching the model algorithm service

poe watch

If you do this while the service is down, the command will have no effect.

Stoping the model algorithm service

poe stop
❯ poe stop
Poe => PORT=$(python scripts/stop.py);
  if test $? -eq 0; then PORT=$PORT docker-compose down; else echo "No container running, not stopping anything"; fi
Stopping exodus_model_template_mongo_1                 ... done
Stopping exodus_model_template_exodus_model_template_1 ... done
Removing exodus_model_template_mongo_1                 ... done
Removing exodus_model_template_exodus_model_template_1 ... done
Removing network exodus_model_template_default

If you do this while the service is down, the command will have no effect.

How to create your own model algorithm

This section will walk you through the process of creating a model algorithm from the ground up. More accurately, it will describe the following processes:

  1. Implementing a model algorithm
    1. How to manage dependencies
    2. How to setup the editor
    3. How to train a model
    4. How to persist the model in MongoDB
    5. How to predict using a previously trained model
  2. Testing the model algorithm
    1. Using the landing page
    2. Using command line tools
    3. Debugging the model algorithm
    4. Running unit tests to ensure the correctness of some basic operations
  3. Publishing your model algorithm
  4. Modifying the model algorithm
    1. Migrating the ModelInfo
    2. Bumping the version

Implementing the model algorithm service

This section will walk you through implementing the model algorithm service.

The model algorithm template already contains a working example. For the rest of the documentation, we will be using that to demonstrate the process.

The only files you should modify in theory are those two:

  • model_algorithm.py, which contains the actual model training, persisting and predicting.
  • model_info.py, which defines the schema of the model you are persisting into MongoDB.

Adding or removing dependency libraries

Before you start creating your own model, you need to install some machine learning libraries that contain the algorithms you will be using. poetry, the package manager we are using for this project, actually comes with a neat virtual environment right out of the box, so you do not need to worry about messing up your host environment.

Take our model algorithm template for example. We decided to use Linear Regression and Logistic Regression as our machine learning algorithm (there are 2 of them because Linear Regression only handles regression problems, while Logistic Regression only handles categorical ones). These algorithms can be found as a part of the scikit-learn package, so that's what we installed for this example.

Adding a dependency library

To install the scikit-learn package and add it to the dependency libraries, simply do the following:

poetry add scikit-learn

The above command does not specify a version for the package, so poetry will just use the newest one that is compatible with all the other packages we've listed as dependencies. If you want to use the latest version:

poetry add scikit-learn@latest

Or if you want to stick to a specific version:

poetry add scikit-learn@1.0.2

Removing a dependency library

Suppose instead of those machine learning algorithms, you decided that you want to use something else. The scikit-learn package contains the LabelEncoder class, which we based one of our feature engineering methods (you will know more about this later!) on, so it does not make sense to remove the library from your dependencies.

The below is only an example and should not actually be run, but suppose you want to remove scikit-learn:

poetry remove scikit-learn

Setting up the editor

In this section we will talk about how to setup an editor, so that it is easier to make your own model algorithm service.

Vim

If you are a hardcore coder, chances are you already know how to use Vim?

Packages to use

NeoVim

NeoVim provides much better editing experience than vanilla Vim.

Conquer of Code

CoC lets you look up definition, autocomplete, and stuff. It uses different language servers under the hood.

VSCode

VSCode is a modern editor that is extremely expandable.

Basic usage

To open a VSCode session, navigate to the model algorithm template directory, and run the following command:

code .

This starts a VSCode session.

If you are not familiar with coding, going through the official VSCode Python tutorial docs is a good idea.

VSCode can recognize poetry environments, so make sure you select the appropriate Python interpreter (i.e. the one with your model algorithm's name in it).

Basic hotkeys

  • To expand / collapse the sidebar, press ctrl+b.
  • To open a file, you can press ctrl+p.
  • To type in a VSCode command, press ctrl+shift+p.
  • To start a terminal session, press ctrl+shift+` .

Extensions

  • Python
    • The official extension for Python language support.
    • Once you've installed it, remember to also activate pylance, the type checker extension for Python language.
  • Python Indent
    • The extension that handles indents correctly. I do not know why it is not part of the official extension, but this extension is the one that handles everything correctly.
  • autoDocstring
    • Makes writing comments for your algorithm that much easier.

Screenshot

If everthing is wired up correctly, you should be able to see code definitions when your mouse is on top of an imported method or class, and there should be no unknown imports.

Train a model

In order to train a machine learning model and obtain its performance metrics, we will be doing the following procedure:

  1. Sanitize the training input
  2. Perform feature engineering
  3. Extract labels for categorical problems
  4. Calculate cross validation scores
  5. Train the machine learning model
  6. Calculate holdout score
  7. Save the model
  8. Return the result

All of the code is in the files model_algorithm.py and model_info.py.

To start things off, let's modify the description of your model algorithm:

class ModelAlgorithm:
    def __init__(self, host: Optional[str] = None, port: Optional[str] = None) -> None:
# ... snipped ...
        self.description: str = "description" # TODO Change me!

After you've figured out the description of your model algorithm, you can start implementing the train_iid method, the entrypoint for training an IID model.

    def train_iid(self, request: TrainIIDReqBody) -> TrainRespBody:

If you enter a newline underneath the comments, and insert request., there should be a lot of autocomplete suggestions:

The class TrainIIDReqBody is defined in the exodusutils library, and contains a myriad of helper methods and variables. If we right click on the TrainIIDReqBody symbol, and select Go to definition, we will be able to see what's in it:

class TrainIIDReqBody(TrainReqBodyBase):
# ... snipped ...
    validation_data: Optional[bytes] = Field(
        default=None, description="The validation data as a sequence of bytes. Optional"
    )
    holdout_data: Optional[bytes] = Field(
        default=None, description="The holdout data as a sequence of bytes. Optional"
    )
    fold_assignment_column_name: Optional[str] = Field(
        default=None,
        description="The name of the fold assignment column. If not provided, Exodus will cut cross validation folds in a modulo fashion. If this field is defined, it is required to be a valid column in the input dataframe. This column is not included in `feature_types`, and will be discarded during training.",
    )

There seems to be only three fields, however the TrainIIDReqBody class actually inherits from the TrainReqBodyBase class:

class TrainReqBodyBase(BaseModel):
# ... snipped ...
    training_data: bytes = Field(description="The training data as a sequence of bytes")
    feature_types: List[Column] = Field(
        default=[], description="The features present in training data"
    )
    target: Column = Field(
        description="The target column. Must be present in feature_types"
    )
    folds: int = Field(
        default=5, ge=2, le=10, description="Number of folds for this experiment"
    )

We will be using class TrainIIDReqBody's methods and fields in the upcoming sections.

Sanitizing input

It could be the case that the input data contains unprocessable characters for the machine learning algorithm. For example, the model algorithm template's algorithm cannot process datasets with NaN or empty strings. To fix that, we start out the training process by sanitizing the dataset:

        # TODO: The implementer of algorithms should decide if this step is necessary, or if an alternative method should be used instead
        # First sanitize the dataframe
        sanitized_df = remove_invalid_values(request.get_training_df())
        raw_holdout_df = request.get_holdout_data()
        if raw_holdout_df is not None:
            sanitized_holdout = remove_invalid_values(raw_holdout_df)
        else:
            sanitized_holdout = None

Notice how we are not parsing the request here, instead there already exists helper functions get_training_df and get_holdout_data.

Since our model algorithm cannot handle invalid values, we can invoke the remove_invalid_values function from the exodusutils package to do that for us.

If the machine learning algorithm you chose has different restrictions, you should implement them here.

Note that holdout_data is actually an optional field for training, and we do not want to sanitize a nonexist dataset. So you want to make sure you sanitize it only when you're sure it is not a None.

Feature engineering

After we've ensured that there's nothing unprocessable in our input, we can move on to the feature engineering step, where we manipulate the input data to generate more suitable training input.

        # then, do feature engineering
        # TODO decide what feature engineering steps you need by modifying `feature_engineering` method
        training_df, holdout_df, encoders, components = feature_engineering(
            sanitized_df, sanitized_holdout
        )

If you right click on the feature_engineering symbol and select Go to definition, you should be able to see that this is a method defined in model_algorithm.py as well.

def feature_engineering(training_df: pd.DataFrame, holdout_df: Optional[pd.DataFrame]):
    """
    Does feature engineering for the dataframes.

    Here you should use the helper methods in `exodusutils` package, i.e. `time_component_encoding`, `one_hot_encoding` and `label_encoding`.

    Parameters
    ----------
    training_df : pd.DataFrame
        The training dataframe
    holdout_df : Optional[pd.DataFrame]
        The holdout dataframe

    Returns
    -------
    tuple[DataFrame, Optional[DataFrame], Dict[str, LabelEncoder], List[str]]
        The modified training df and holdout df, the encoders for each categorical column, and the time component columns.
    """
    training_df, holdout_df, encoders = label_encoding(training_df, holdout_df)
    training_df, components = time_component_encoding(training_df)
    # datetime columns are being removed because the necessary features derived from them
    # have been generated in the previous step
    # However, it is up to the developer to decide whether or not this step is necessary
    training_df = remove_datetime_columns(training_df)
    if holdout_df is not None:
        holdout_df, _ = time_component_encoding(holdout_df)
        holdout_df = remove_datetime_columns(holdout_df)
    return training_df, holdout_df, encoders, components

The exact procedures for feature engineering is entirely up to you, and the code shown here is really just an example. Let's break it down step by step.

Label encoding

    training_df, holdout_df, encoders = label_encoding(training_df, holdout_df)

Label encoding is the procedure that turns a categorical column into a numeric one, with the numbers being the label for a categorical value. Here we have to make sure the generated columns are consistent between the training dataframe and the holdout dataframe, hence we pass both into the method. It handles the case when the holdout dataframe is actually None.

Time component encoding

    training_df, components = time_component_encoding(training_df)

This encodes the datetime columns in the dataframe to different numeric columns.

For example, suppose we have the following dataframe:

ds
2022/01/01
2022/03/12
2022/05/23
2022/07/04

After we perform time component encoding, the result will be:

dsYEAR_dsQUARTER_dsMONTH_dsWEEK_dsWEEKDAY_dsDAY_dsHOUR_ds
2022/01/0120221152610
2022/03/12202213106120
2022/05/23202225211230
2022/07/0420222727140

Remove datetime columns

Once we have done time component encoding, for our algorithm the datetime column is no longer needed.

    training_df = remove_datetime_columns(training_df)

Handle holdout dataframe

If the holdout dataframe exists, we perform the same feature engineering on it as the training dataframe.

    if holdout_df is not None:
        holdout_df, _ = time_component_encoding(holdout_df)
        holdout_df = remove_datetime_columns(holdout_df)

One-hot encoding

One-hot encoding is another encoding technique to transform categorical columns. More precisely, it encodes a categorical column into multiple numeric columns, each representing one value in the categorical column.

This encoding can be invoke by doing:

    training_df, holdout_df, encoded_columns = one_hot_encoding(training_df, holdout_df)

You should choose either one of one-hot encoding and label encoding if you want to transform categorical columns.

Extract unique labels

In order to calculate cross validation scores properly for classification problems, we need to find out the total amount of categorical values in the target column. This is done in the below code snippet:

        # the unique labels in the target column;
        # used for calculating scores
        labels = pd.unique(training_df[request.target.name])

Calculate cross validation scores

Next, we calculate the cross validation scores for the model. The procedure is as follows:

  1. Cut the training dataframe into several sub-frames
  2. For each of those sub-frames, create a machine learning model
  3. For each of those trained machine learning models, calculate their performance by comparing its predictions to another sub-frame

The corresponding code is the following snippet:

        # Cut the training dataframe into cross-validation frames
        cv_frames = CVFrames.iid(
            training_df, request.folds, request.get_validation_data()
        )

        # Use the cross-validation frames to calculate the cv scores
        folds = [do_cv_fold(cv_frame, request.target, labels) for cv_frame in cv_frames.frames]
        cv_scores = CVScores(folds=folds)

We have provided several helper functions to make this a lot easier for you: the sub-frames splitting and the score calculations are all done by helpers in the exodusutils function! The only place you need modifying is in do_cv_fold function:

def do_cv_fold(train_frames: TrainFrames, target: Column, labels: np.ndarray):
    """
    Does a cross validation fold. Includes training a cross validation model and calculating the score for the model.

    Parameters
    ----------
    train_frames : TrainFrames
        The frames for this cross validation fold.
    target : Column
        The target column.
    labels : np.ndarray
        The unique classes that are involved in the classification problem.

    Returns
    -------
    scores
        The scores for this cross validation fold.
    """
    cv_model = train_model(train_frames, target)
    if train_frames.test is None:
        raise ExodusBadRequest(detail=f"Cannot create test frame for this fold")
    return calculate_score(cv_model, train_frames.test, target, labels)

This is a rather simple function, but there are 2 methods called here: train_model and calculate_score, the former correspond to the training model part, while the latter the score calculation part. We will go over them in the following sections.

Train the machine learning model

Let's take a look at the train_model function, where we actually train a machine learning model:

def train_model(train_frames: TrainFrames, target: Column):
    """
    Trains a model.

    Parameters
    ----------
    train_frames : TrainFrames
        The frames required during training. Includes the training dataframe, the validation dataframe, and the testing dataframe.
    target : Column
        The target column, tells you the name and the type of the column.

    Returns
    -------
    model
        The trained model.
    """
    train = train_frames.train
    features = train.loc[:, train.columns != target.name]
    targets = train[target.name]
    if target.data_type == DataType.double:
        model = LinearRegression()
    else:
        model = LogisticRegression()
    model.fit(X=features, y=targets)
    return model

Here, we extract the sub-frame that we will train the model with:

    train = train_frames.train

And then split the sub-frame into one dataframe with features to train with, and another one with only the target values to fit the machine learning model:

    features = train.loc[:, train.columns != target.name]
    targets = train[target.name]

If we are dealing with a regression problem, where the target column consists of numeric values, we use LinearRegression algorithm. Otherwise we are dealing with a classification problem, and we use LogisticRegression algorithm.

    if target.data_type == DataType.double:
        model = LinearRegression()
    else:
        model = LogisticRegression()

After we decide which algorithm to use, we can fit the model, and then return it:

    model.fit(X=features, y=targets)
    return model

Implementing your own thing

Most of the code here should be changed by you. In particular these are the things you need to consider:

  • Perhaps your machine learning algorithm is able to handle both classification and regression problems
  • Your model might need to filter out some more features
  • Instead of fit, your model might need to do something else
  • Your model might require a validation dataframe, which can be retreived via train_frames.validation

Either way, the content of this function should only serve as an example, it is up to you to decide what should actually be the machine learning model.

Calculate a single set of scores

After we've trained a model, we want to be able to measure its performance. In our template this is done in the calculate_score function:

def calculate_score(
    model: Union[LinearRegression, LogisticRegression], df: pd.DataFrame, target: Column, labels: np.ndarray
) -> Scores:
    """
    Calculates the scores for a frame and a model.

    The type of the model here should be modified.

    Parameters
    ----------
    model : Unknown
        The model to score with.
    df : pd.DataFrame
        The frame to score with.
    target : Column
        The target column.
    labels: np.ndarray
        The unique classes that were present in the training dataframe in a classification problem

    Returns
    -------
    Scores
        The scores for the model and the frame.
    """
    features = df.loc[:, df.columns != target.name]
    predicted = model.predict(X=features)
    actual = np.array(df[target.name].values)
    if target.data_type == DataType.double:
        # model type is LinearRegression here
        if not isinstance(model, LinearRegression):
            raise RuntimeError(
                f"Got {model.__class__} when it should be LinearRegression"
            )
        return RegressionScores.get_scores(predicted, actual)
    else:
        # model type is LogisticRegression here
        if not isinstance(model, LogisticRegression):
            raise RuntimeError(
                f"Got {model.__class__} when it should be LogisticRegression"
            )
        pred_proba = model.predict_proba(features)
        return ClassificationScores.get_scores(predicted, pred_proba, actual, labels)

Let's go through this step by step:

  1. We extract the features from the given dataframe:
        features = df.loc[:, df.columns != target.name]
    
  2. We predict using the trained model and the extracted features:
        predicted = model.predict(X=features)
    
  3. We extract the actual values from the given dataframe:
        actual = np.array(df[target.name].values)
    
  4. If it is a regression problem, use the RegressionScores helper class to calculate the scores:
            return RegressionScores.get_scores(predicted, actual)
    
  5. Otherwise we use the ClassificationScores helper class to calculate:
            pred_proba = model.predict_proba(features)
            return ClassificationScores.get_scores(predicted, pred_proba, actual, labels)
    

The RegressionScores and ClassificationScores helper classes will generate scores for you via the method get_scores, all you need to do is to provide the arguments.

Implementing your own thing

Just like the train_model function, a lot of this is subject to which model you chose to use. Maybe your model does not have a predict method nor a predict_proba method, maybe your model needs more information than just the features to predict. Either way, it is up to you to decide what should be in this function. Once you have figured out how to predict using the trained model, the helper classes will automatically generate the scores.

Save the model

Let's take a look at the following code snippet in the train_iid method:

        # Save the model info into MongoDB
        model_info_id = ModelInfo.save(
            name=self.name,
            target=request.target,
            model=model,
            encoders=encoders,
            time_components=components,
            feature_types=request.feature_types,
            mongo=self.mongo,
            gridfs=self.gridfs,
            collection=self.collection,
        )

And in model_info.py, the save method for ModelInfo is defined as:

    @classmethod
    def save(
        cls,
        name: str,
        target: Column,
        model: Union[LinearRegression, LogisticRegression],
        encoders: Dict[str, LabelEncoder],
        time_components: List[str],
        feature_types: List[Column],
        mongo: MongoInstance,
        gridfs: GridFS,
        collection: Collection,
    ) -> str:

        model_id = save_sklearn_stuff(model, gridfs)
        saved_encoders = {k: save_sklearn_stuff(v, gridfs) for k, v in encoders.items()}

        model_info = cls(
            name=name,
            target=target,
            model_id=model_id,
            encoder_ids=saved_encoders,
            time_components=time_components,
            feature_types=feature_types,
        )

        doc = model_info.dict()
        for field in model_info.encrypted_fields:
            if doc.get(field) is None:
                raise ExodusError(f'field = "{field}" is not a valid key for ModelInfo')
            doc[field] = mongo.encrypt(doc[field])
        return str(collection.insert_one(doc).inserted_id)

The save method returns a stringified ObjectId, which you should pass back to the user so that they can identify the model that was just trained.

The save method can be split into 3 parts:

  1. Saving scikit-learn stuff
  2. Creating a ModelInfo instance
  3. Encrypting the ModelInfo instance and storing it into MongoDB

Saving scikit-learn stuff

The maximum size limit for a MongoDB document is 4MB. Unfortunately, it is not unlikely that your machine learning will exceed that limit. To handle this behavior, we can store our machine learning and label encoders in GridFS, a component of MongoDB that is designed to specifically handel this scenario.

def save_sklearn_stuff(stuff: Any, gridfs: GridFS) -> str:
    return str(gridfs.put(pickle.dumps(stuff)))

We use this function to store scikit-learn artifacts, namely the machine learning model and the label encoders. The function is actually very simple:

  1. Dumps the thing into a byte sequence using pickle, the de-facto default serializer for Python
  2. Stores the byte sequence into GridFS, and retrieve an ObjectId
  3. Converts that ObjectId into a string, for the former does not play well with built-in Python classes

Notice there is a FIXME beneath the comments, this is to warn you that by pickling an artifact, you are also including its dependencies such as libraries, versions, and so on. This makes it virtually impossible for backward compatibility if you ever plan to upgrade a package that you've installed. Luckily it is not really something you should do that often, if at all. Still, if you find yourself in need of upgrading a package, and you've pickled something generate from that package, chances are you are better off creating a new model that makes use of the newer package.

Store the necessary informations

Let's take a look at the declaration of the ModelInfo class:

class ModelInfo(BaseModel):
    name: str
    target: Column
    model_id: str
    encoder_ids: Dict[str, str]
    time_components: List[str]
    feature_types: List[Column]

    encrypted_fields: List[str] = Field(
        default=["target", "encoder_ids", "time_components", "feature_types"],
        exclude=True,
    )

    model: Union[LinearRegression, LogisticRegression] = Field(
        default=None, exclude=True
    )
    encoders: Dict[str, LabelEncoder] = Field(default=None, exclude=True)

The only ones that are required are name, target, model_id, encoder_ids, time_components, and feature_types. As for the rest:

  • encrypted_fields describes what you need to encrypt if encrypted is set to true in config.ini. See Appendix C for more info.
  • model is the actual machine learning model that you just trained. During prediction, you should be using this.
  • encoders is the actual encoders. During prediction, you should be using this.

Let's get back to the save method:

        model_info = cls(
            name=name,
            target=target,
            model_id=model_id,
            encoder_ids=saved_encoders,
            time_components=time_components,
            feature_types=feature_types,
        )

Notice how we only fill in the required fields, and left the rest as default.

What should I include in ModelInfo?

Rule of thumb is to include anything you will need for prediction. That could include the following:

  • A portion of the training dataframe
  • Some more feature engineering artifacts
  • Columns that require additional processing

What should I implement?

The developer of the model algorithm should implement the following:

  1. A way to dump the things into an object that can be downloaded (see dump_to_base64 method)
  2. The delete method, when you delete a ModelInfo you should also remove the relevant machine learning model and feature engineering artifacts

Encrypting & saving the ModelInfo

The MongoInstance class contains a method that encrypts values for you automatically, so encryption becomes almost trivial:

        doc = model_info.dict()
        for field in model_info.encrypted_fields:
            if doc.get(field) is None:
                raise ExodusError(f'field = "{field}" is not a valid key for ModelInfo')
            doc[field] = mongo.encrypt(doc[field])

You turn the ModelInfo into a dictionary (this is the document we will be storing into MongoDB), and replace the fields you want to encrypt with the encrypted value.

If there's a key in the encrypted_fields variable that does not exist in the fields of ModelInfo, an exception will be raised.

To save the model, we insert it into the MongoDB collection corresponding to the model algorithm, and return the stringified ObjectId.

        return str(collection.insert_one(doc).inserted_id)

Return the results

After we are done training and saving the model, the last step would be returning the results to the user.

        # Create the `Attributes` instance
        attributes = Attributes(cv_scores=cv_scores, holdout_scores=holdout_scores)

        # Create the response, then return it
        response = TrainRespBody(
            name=self.name,
            attributes=attributes,
            model_id=model_info_id,
            hyperparameters=model.get_params(),
        )

        return response

Here cv_scores and holdout_scores are both calculated earlier during the training process.

The model hyperparameters

Notice there's a field for hyperparameters in TrainRespBody. This is to record the machine learning algorithm's hyperparameters, and should be a Dict. Most of the machine learning algorithms provide a function to expose the set of hyperparameters, but if it is not the case you should look up the possible settings for the machine learning algorithm, and then put them in this field.

Piecing it together

Here's the train_iid method in its entirety (sans comments):

    def train_iid(self, request: TrainIIDReqBody) -> TrainRespBody:

        # SANITIZING #

        sanitized_df = remove_invalid_values(request.get_training_df())
        raw_holdout_df = request.get_holdout_data()
        if raw_holdout_df is not None:
            sanitized_holdout = remove_invalid_values(raw_holdout_df)
        else:
            sanitized_holdout = None

        # FEATURE ENGINEERING #

        training_df, holdout_df, encoders, components = feature_engineering(
            sanitized_df, sanitized_holdout
        )

        labels = pd.unique(training_df[request.target.name])

        # CROSS VALIDATION #

        cv_frames = CVFrames.iid(
            training_df, request.folds, request.get_validation_data()
        )

        folds = [
            do_cv_fold(cv_frame, request.target, labels)
            for cv_frame in cv_frames.frames
        ]
        cv_scores = CVScores(folds=folds)

        # TRAIN THE MODEL #

        train_frames = TrainFrames(train=training_df, validation=request.get_validation_data())
        model = train_model(train_frames, request.target)

        # HOLDOUT #

        if holdout_df is not None:
            holdout_scores = calculate_score(model, holdout_df, request.target, labels)
        else:
            holdout_scores = None

        # SAVING #

        model_info_id = ModelInfo.save(
            name=self.name,
            target=request.target,
            model=model,
            encoders=encoders,
            time_components=components,
            feature_types=request.feature_types,
            mongo=self.mongo,
            gridfs=self.gridfs,
            collection=self.collection,
        )

        # PREPARING RESPONSE #

        attributes = Attributes(cv_scores=cv_scores, holdout_scores=holdout_scores)

        response = TrainRespBody(
            name=self.name,
            attributes=attributes,
            model_id=model_info_id,
            hyperparameters=model.get_params(),
        )

        return response

Several things to note here:

  1. For the final machine learning model, we do not need to calculate its score. Therefore we use the helper class TrainFrames, and invoke its method iid_without_test to generate the training frame and the validation frame.
  2. Remember in do_cv_fold we make use of the methods train_model and calculate_score? After we have calculated the cross validation scores, we can reuse those methods to calculate the final machine learning model, and calculate the holdout scores.

Now, we have successfully trained our machine learning model, and a set of scores to evaluate this model. In the following section we will see how persisting this model is possible.

Predicting with a trained model

Once we have successfully trained and stored a model, next we will need to predict something from this model. The general workflow is as follows:

  1. Load the model from MongoDB
  2. Sanitize the prediction input
  3. Apply the feature engineering steps
  4. Predict
  5. Format the predicted results

Load a model

In order to properly predict with a previously trained model, we need to first load the saved model from MongoDB.

Basically, you want to do the inverse of what you did during training: if you stored some LabelEncoders into GridFS, here you need to extract them, and so on.

    @classmethod
    def load(cls, model_id: str, collection: Collection, gridfs: GridFS):
        obj = collection.find_one(filter=identify(model_id))
        if not obj:
            raise ExodusNotFound(f"No model found with id = {model_id}")
        model_info = cls.parse_obj(obj)
        model_info.encoders = {
            k: load_sklearn_stuff(v, gridfs) for k, v in model_info.encoder_ids.items()
        }
        model_info.model = load_sklearn_stuff(model_info.model_id, gridfs)

        return model_info

Compare the above with the code snippet where we store the machine learning model and encoders into GridFS:

        model_id = save_sklearn_stuff(model, self.gridfs)
        saved_encoders = {k: save_sklearn_stuff(v, self.gridfs) for k, v in encoders.items()}

Remember we had fields model and encoders in our ModelInfo? During load, we actually store those in the resulting ModelInfo, so we can guarantee that after loading a ModelInfo, it is safe to access the model, which is the machine learning model, and the encoders, which are the label encoders.

Sanitizing input

        original_df = request.get_prediction_df(model_info.feature_types)
        sanitized_df = remove_invalid_values(original_df)

This should be self-explanatory. Basically, here your code should be doing what you did to the training dataframe.

Notice how we have access to the prediction dataframe via get_prediction_df method from the PredictReqBody class.

Apply feature engineering

def apply_feature_engineering(df: pd.DataFrame, encoders: Dict[str, LabelEncoder]) -> pd.DataFrame:
    encoded: pd.DataFrame = (df.pipe(apply_label_encoders, encoders).pipe(time_component_encoding))[0]
    return encoded.pipe(fill_nan_with_average, list(encoders.keys())).pipe(remove_datetime_columns)

# ... snipped ...

        df = apply_feature_engineering(sanitized_df, model_info.encoders)

Most of the things you are doing here are exactly the same as the feature_engineering part, except here you need to make sure you sanitize your dataframe properly.

Sanitize the result properly

Consider the following training dataframe, where there is only 1 column:

column
foo
bar
baz

Then after the column has been label encoded, the result becomes:

column
0
1
2

Where foo gets encoded to 0, bar becomes 1, and baz becomes 2.

However, if the prediction dataframe contains a value never seen during training, the encoder will not be able to deduce which label it should encode the value to, and will return a NaN. For example, consider the below dataframe:

column
bar
quax
bar
foo

After we apply our encoder, the result is:

column
1
nan
1
0

If your machine learning algorithm cannot handle NaN properly, then after you've applied the feature engineering encoders there is no way for the algorithm to perform prediction.

In situations like this, a common method is to impute the missing values with a designated special value. In exodusutils the method fill_nan_with_mode is doing just that: we extract the most frequent label for an encoded column, and force the invalid cells to that most frequent label.

In our example, the final result after we've applied the fill_nan_with_mode method will be:

column
1
1
1
0

Predict

After we've sanitized and applied feature engineering, it is finally time to predict with our model. Note that during cross validation we've already predicted using a calculated machine learning model though, so here it's more or less just repeating what you did back then.

        features = df.loc[:, df.columns != model_info.target.name]
        predicted = model_info.model.predict(features)

Here we extract the features we are using, then predict using the selected features.

Format the results

After we have our raw prediction results, we need to piece it back into a JSON string.

        if model_info.target.data_type != DataType.double:
            predicted = model_info.encoders[model_info.target.name].inverse_transform(predicted)

        sanitized_df[PREDICTION_COLNAME] = predicted

        if model_info.target.data_type != DataType.double:
            classes = list(model_info.model.classes_)
            predict_proba = pd.DataFrame(model_info.model.predict_proba(features), columns=classes)
            sanitized_df = pd.concat([sanitized_df, predict_proba], axis=1).reset_index(drop=True)
        else:
            classes = list()

        columns_to_drop = [
            c
            for c in list(sanitized_df.columns)
            if c not in request.keep_columns and c != PREDICTION_COLNAME and c not in classes
        ]

        results: List[Dict[str, Any]] = json.loads(
            str(sanitized_df.drop(columns=columns_to_drop).to_json(orient="records"))
        )

The return value of the predict method contains only a field of type List[Dict[str, Any]], where a single Dict represents the prediction results for a single row in the prediction input dataframe. That being said, it is easier for us to manipulate on a dataframe than doing each Dict by hand, so let's see how we can achieve that.

Turn our predicted value back to strings

Remember that we did label_encoding during feature engineering? This will turn our target column into a column with numeric class labels, and we need to turn those back into actual strings.

To do that, we make use of the LabelEncoder.inverse_transform method:

        if model_info.target.data_type != DataType.double:
            predicted = model_info.encoders[model_info.target.name].inverse_transform(predicted)

If your model didn't do label_encoding during feature engineering, you can omit this step.

Attach the predicted values

Then we attach our predicted value to the input dataframe.

        sanitized_df[PREDICTION_COLNAME] = predicted

If we are dealing with a classification problem, each prediction should come with the prediction probabilities of all the possible classes. To do that, we create a new dataframe containing the prediction probabilities, and then concatenate the new dataframe to our result dataframe.

Note that we have to use the strings instead of the model's classes, which are just a bunch of numbers.

        if model_info.target.data_type != DataType.double:
            classes = list(model_info.encoders[model_info.target.name].classes_)
            predict_proba = pd.DataFrame(model_info.model.predict_proba(features), columns=classes)
            sanitized_df = pd.concat([sanitized_df, predict_proba], axis=1).reset_index(drop=True)
        else:
            classes = list()

We also need to keep track of the classes we've added to the result dataframe.

Drop columns we don't need

Then we need to filter out the columns we don't need for the result.

        columns_to_drop = [
            c
            for c in list(sanitized_df.columns)
            if c not in request.keep_columns and c != PREDICTION_COLNAME and c not in classes
        ]

We only keep the columns that are explicitly specified by the user (via the keep_columns field in the request), the prediction result column (the PREDICTION_COLNAME variable, which is an alias for the string "prediction"), and the classes for the target column if it's a classification problem.

Piecing it together

Here's the predict method in its entirety (sans comments):

    def predict(self, request: PredictReqBody) -> PredictRespBody:

        # LOADING #

        model_info = ModelInfo.load(request.model_id, self.collection, self.gridfs)

        # SANITIZING #

        original_df = request.get_prediction_df(model_info.feature_types)
        sanitized_df = remove_invalid_values(original_df)

        # APPLYING FEATURE ENGINEERING #

        df = apply_feature_engineering(sanitized_df, model_info.encoders)

        # PREDICTING #

        features = df.loc[:, df.columns != model_info.target.name]
        predicted = model_info.model.predict(features)

        # FORMATING RESULTS #

        if model_info.target.data_type != DataType.double:
            predicted = model_info.encoders[model_info.target.name].inverse_transform(predicted)

        sanitized_df[PREDICTION_COLNAME] = predicted

        if model_info.target.data_type != DataType.double:
            if not isinstance(model_info.model, LogisticRegression):
                raise RuntimeError(
                    f"Got {model_info.model.__class__} when it should be LogisticRegression"
                )
            classes = list(model_info.encoders[model_info.target.name].classes_)
            predict_proba = pd.DataFrame(
                model_info.model.predict_proba(features), columns=classes
            )
            sanitized_df = pd.concat([sanitized_df, predict_proba], axis=1).reset_index(
                drop=True
            )
        else:
            classes = list()

        columns_to_drop = [
            c
            for c in list(sanitized_df.columns)
            if c not in request.keep_columns
            and c != PREDICTION_COLNAME
            and c not in classes
        ]

        # PREPARING RESPONSE #

        results: List[Dict[str, Any]] = json.loads(
            str(sanitized_df.drop(columns=columns_to_drop).to_json(orient="records"))
        )

        response = PredictRespBody(prediction=results)
        return response

Testing the model

There are several ways to verify that your model algorithm is able to run properly

  • Running unit tests
    • Basic tests to ensure basic functionalities work. If any of the unit tests failed, something is wrong with your code.
    • You are encouraged to create new unit tests, since they are much more simple to use then setting up your own script.
  • Using the landing page
    • If you just want to test whether your code can work for a piece of data, use your browser and navigate to the main page of the model algorithm.
    • You can run experiments through the landing page, and then inspect the output of the model algorithm service by the poe watch command.
  • Using the script / writing your own scripts
    • If you don't have access to a browser.
    • Need to setup things in a rather particular way.

Running unit tests

Setup

Before you run the tests, you need to fire up the MongoDB container. This can be done via the below command:

PORT=1234 docker-compose up -d mongo

The PORT environment variable will not actually be used, it's just docker-compose does not realize this variable is not needed if you're only starting the MongoDB service.

Your MongoDB service will be running, and can be reached at port 27018.

Make sure you have a MongoDB service up and running!

Otherwise all the unit tests will fail due to model algorithm timing out!

Run the test

Via command line

poe test

This will run all the unit tests within tests/ directory, which are the files that contain a test_ prefix in their names.

Via VSCode

  1. Press ctrl+shift+p, or go to Help > Show All Commands
  2. Type in Python: Configure Tests, select that and then press enter
  3. Select pytest
  4. Select tests
  5. Go to the testing panel

If no tests were run before the green ticks will not be there.

To run the tests, right click on the exodus_model_template symbol, and select Debug Test.

Write your own test

Most likely you want to make sure your model algorithm can run with some dataset you already have. To test whether your model algorithm can train and predict upon that dataset:

  1. Create a directory in tests/datasets. Let's say tests/datasets/foo/
  2. Move your training CSV file into tests/datasets/foo/, and rename it to tests/datasets/foo/train.csv
  3. Move your prediction CSV file into tests/datasets/foo/, and rename it to tests/datasets/foo/prediction.csv
  4. If you have a holdout dataset, move that file into tests/datasets/foo/, and rename it to tests/datasets/foo/holdout.csv
  5. Create a new file called tests/datasets/foo/meta.json, which should contain the following fields:
    • "target_column_name": the name of the target column
    • "features": a dict from the column names to the column types. The types should all be either one of "string" or "double"
  6. Write the test method by extending the tests/test_ml.py files:
    def test_foo():
        train_predict_delete("./tests/datasets/foo")
    
    It really is as simple as that!

Debugging the model algorithm

Let's say your unit test failed, and you want to figure out why. The common way to do this would be to add a lot of print() and hope that uncovers the issue, but actually there's a much more efficient way to debug: using a debugger.

How to debug with a debugger

  1. You set a breakpoint in your code, the breakpoint is to signal the debugger that the execution should be paused when it got to that line of code, and all execution contexts can be examined by the user
  2. You use the debugger console to print various informations
  3. You can jump between stacks (i.e. the method that is calling the method you're in right now)
  4. You can step through the execution, so that instead of running everything and failing immediately you can actually see where the program failed

VSCode

Setting breakpoint

It's just clicking the empty margin on the left of the line number:

Here I set a breakpoint right before the train_iid method returns.

Executing the test

You can run any specific test by expanding the drop down menu in the testing panel, right click on the target test method, and do Debug test.

Once it started running, the program will pause at the line of your breakpoint.

Toggle the debug console, or press ctrl+shift+y.

In the debug console, type in whatever you want to see. For example, here I typed in request.attributes.

Once you are done, either press the Continue button on the top of the screen (or press <F5>), or step through the execution, or do whatever you want. When the program finishes running, you can see the test result in the debug console.

Non VSCode

I'd say using VSCode is the better choice here, but if you insist you can insert pdb breakpoints in your code (here I will insert a breakpoint right before train_iid returns):

        response = TrainRespBody(
            name=self.name,
            attributes=attributes,
            model_id=model_info_id,
            hyperparameters=model.get_params(),
        )

        # THIS LINE IS THE BREAKPOINT
        import pdb; pdb.set_trace()

        return response

For more information, visit the pdb documentation page.

Debugging the model algorithm with actual requests

Right now we haven't figured out a way to debug FastAPI applications in a container, so to use a debugger to debug the application you need to the following:

  1. Start MongoDB service
  2. In app.py, modify the declaration of model_algorithm variable to the following:
    model_algorithm = ModelAlgorithm(host="localhost", port="27018")
    
  3. Run the FastAPI application
    • In VSCode: insert the following code snippet to .vscode/launch.json (can be opened via Show All Commands > Open 'launch.json')
      {
          "name": "Python: FastAPI",
          "type": "python",
          "request": "launch",
          "module": "uvicorn",
          "args": [
              "exodus_model_template.app:app"
          ],
          "jinja": true,
          "justMyCode": true
      }
      
    • In command line: run the app via the following command:
      poetry run python -m pdb exodus_model_template/app.py
      

Debug the deployed, containerized model algorithm

TODO

Using the landing page

See Run the model template.

What if I don't want to use port 3000 for the webapp?

In docker-compose.yml:

services:
# ... snipped ...
  web_app:
    ports:
      - "3000:3000" # Change this to "{SOME_PORT}:3000"

Then in exodus_model_template/app.py:

# ... snipped ...
app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:3000"], # Change this to "http://localhost:{SOME_PORT}"

Can I use the webapp to access a model algorithm that's not running on localhost?

This is currently not supported.

Publishing your model

Once you are done implementing the model, and all unit tests have passed, it is time to publish your work.

Before you do it, make sure you have access to MoBagel's docker harbor.

Then run the command:

poe save

This builds the model algorithm into a docker image, pushes that image to MoBagel's Docker harbor, then saves it to a tarball.

To incorporate it into the Exodus structure, simply copy the tarball to the main repo's images/ directory, or set the main repo's configuration so that it pulls images from the MoBagel harbor.

Modifying the model algorithm

When you're modifying your model algorithm, there are some scenarios where you might need to pay more attention:

  • When you modified ModelInfo of a model algorithms that's been published already
  • When you plan to publish a model algorithm that's been published in the past

Migrating the ModelInfo

If you modify the ModelInfo of a model algorithm that's been published before, it is then your responsibility to make sure it is consistent between different versions. To ensure this, you need to write a migration script.

Creating a migration script

poetry run python scripts/make_migration_script.py [SCRIPT NAME]

Let's say the name is foo, this is what would happen:

❯ poetry run python scripts/make_migration_script.py foo
Created: exodus_model_template/migrations/Migration_1646728080972_foo.py
Updated: exodus_model_template/migrations/all_migrations.py

A migration script called exodus_model_template/migrations/Migration_1646728080972_foo.py has been generated.

Implementing the migration script

Suppose your old ModelInfo is defined like this:

class ModelInfo(BaseModel):
    foo: str
    bar: int

And you want to change it to this:

class ModelInfo(BaseModel):
    foo: str
    bar: int
    baz: str

where baz should be foo concatenated with a stringified bar.

In the migration script, there are mainly two methods: up and down:

class Migration(BaseMigration):
    def __init__(self, timestamp: int = 1646728080972, name: str = "foo") -> None:
        super().__init__(timestamp, name)

    def up(self, collection: Collection) -> None:
        # TODO implement this method
        pass

    def down(self, collection: Collection) -> None:
        # TODO implement this method
        pass

For up, you want to find all the ModelInfo instances that don't have a field called baz. An examplary implementation might be this:

    def up(self, collection: Collection) -> None:
        for doc in collection.find({"baz": {"$exists": False}}):
            collection.update_one(filter={"_id": doc["_id"]}, update={"$set": {"baz": doc["foo"] + str(doc["bar"])}})

As for down, you do the exact opposite and unset the baz field:

    def down(self, collection: Collection) -> None:
        for doc in collection.find({"baz": {"$exists": True}}):
            collection.update_one(filter={"_id": doc["_id"]}, update={"$unset": {"baz": ""}})

Invoking the migration scripts

It is Exodus main that will be in charge of migrating, however if you want to test your script manually, do this:

curl -X POST "http://{MODEL ALGORITHM IP}/migrate" -d '{"action":"up"}' -H "Content-Type: application/json"

Change "up" to "down" if that's what you want to test.

Bumping the version

When you're done modifying your model algorithm, you need to update the version before saving. Say you are now at version 1.2.3, then you need to do this before saving it to harbor:

poetry version 1.2.4

This bumps the version to 1.2.4.

Appendix A: Helper functions and classes in exodusutils

This section includes the descriptions for the helper classes and methods defined in exodusutils.

Enums

TimeUnit

The time unit that Exodus accepts. Has multiple helper functions:

  • to_seasonality: how many time units there is in a recurring pattern. For example, the seasonality of time unit Month would be 12, since it's 12 months in a year.
  • to_resample_rule: returns a string literal, which will then be used by pandas to resample the dataset.
  • to_time_delta_unit: returns a string literal, which will then be used by pandas as time delta unit.
  • format_datetime: formats a datetime to str according to this TimeUnit.
  • to_prediction_format: returns the format required for prediction for this TimeUnit.

DataType

The data type for a column in a dataframe. Could be either one of the following:

  • double: the column is a numeric column.
  • timestamp: the column is a datetime column.
  • string: the column is a categorical column.
  • id: the column is for IDs. Not really used.

It also comes with 2 helper methods:

  • to_type: returns the Python type for this DataType. For instance, DataType.double.to_type() == float.
  • from_pandas: turns the pandas type or array into a DataType instance. For example, DataType.from_pandas(np.array([1,2,3,4,5])) == DataType.double

Constants

DATETIME_COMPONENTS_PREFIX

The prefixes for datetime component columns. For example, if a column named foo is a datetime column, then after time_component_encoding, we get the following new columns:

  • YEAR_foo
  • QUARTER_foo
  • MONTH_foo
  • WEEK_foo
  • WEEKDAY_foo
  • DAY_foo
  • HOUR_foo

PREDICTION_COLNAME

The name of the prediction column. Essentially the string literal "prediction".

Schemas

Column

Represents a column in the dataframe. Contains the name and the DataType.

Attribute

Represents a pair of a name (a str) and a value (a float). Used to represent a model algorithm's attribute.

Requests

TrainIIDReqBody

Fields:

  • training_data: the raw bytes representing the training data. Is a CSV file as a sting. Should not be accessed directly.
  • feature_types: the columns in the training data.
  • target: the target Column.
  • folds: the number of folds during cross validation.
  • validation_data: the optional raw bytes representing the validation data. Should not be accessed directly.
  • holdout_data: the optional raw bytes representing the holdout data. Should not be accessed directly.
  • fold_assignment_column: an optional field denoting a column representing the fold that each row belongs to. If empty, exodusutils will cut the cv folds by taking the modulo by folds to the row indices.

Methods:

  • get_feature_names: returns the names of the columns.
  • get_training_df: returns the parsed pandas DataFrame generated from the bytes in training_data. Note that the data types of the columns will correspond to the ones specified in feature_types.
  • get_validation_data: returns the parsed pandas DataFrame generated from the bytes in validation_data, or None if there's no validation data.
  • get_holdout_data: returns the parsed pandas DataFrame generated from the bytes in holdout_data, or None if there's no holdout data.

PredictReqBody

Fields:

  • model_id: the ID of the model you want to predict with.
  • data: the raw bytes representing the prediction data. Is a JSON string. Should not be accessed directly.
  • threshold: the threshold for classification predictions.
  • keep_columns: the columns to keep in the prediction results.

Methods:

  • get_prediction_df: takes in a list of Columns, and returns a pandas DataFrame representing the prediction data. The list of Column should be stored into MongoDB once training is completed.

Scores

There are 2 types of scores: RegressionScores and ClassificationScores. Instead of using them directly, you should use the class methods of these two classes:

  • For RegressionScores, create it via RegressionScores.get_scores
  • For ClassificationScores, create it via ClassificationScores.get_scores

RegressionScores

Contains the following metrics:

  • mse (mean square error)
  • rmse (root mean square error)
  • rmsle (root mean square logarithmic error)
  • mae (mean absolute error)
  • r2 (r squared)
  • deviance (deviance, but here it's the same as r2)
  • mape (mean absolute percentage error)
  • wmape (weighted mean absolute percentage error)

ClassificationScores

Could be either BinomialScores or MultinomialScores.

BinomialScores

Contains the following metrics:

  • logloss
  • mean_per_class_error (mean per class error)
  • misclassification
  • auc (area under curve)
  • lift_top_group

MultinomialScores

Contains the following metrics:

  • logloss
  • mean_per_class_error (mean per class error)

Exceptions

Instead throwing ValueError or HTTPException, you can use the following:

If the input is malformed:

  • ExodusBadRequest
  • ExodusForbidden
  • ExodusMethodNotAllowed
  • ExodusNotFound

If the model algorithm is supposed to fail, throw ExodusError instead of other exceptions.

Frames

See Appendix B for more information.

TrainFrames

Contains the following fields:

  • train: the training set
  • validation: the validation set
  • test: the testing set

Use the iid_without_test class method to create a TrainFrames instance without the testing set.

CVFrames

Contains a list of TrainFrames.

Use the iid class method to create an instance of CVFrames.

Feature engineering

  • one_hot_encoding
  • label_encoding
  • time_component_encoding

See previous chapters for explanation.

Frame manipulation

Contains methods that are applying feature engineering artifacts to the prediction dataframe.

  • apply_label_encoders
  • apply_one_hot_encoding

Also contains these helper methods:

  • to_numeric_features_df: extracts the numeric features from the dataframe, and returns them as a new dataframe.
  • remove_invalid_values: removes the invalid values from the dataframe.
  • remove_datetime_columns: removes all columns of dtype datetime64.
  • fill_nan_with_mode: fills the NaN cells in each numeric column with the most frequent value (aka mode) of that column.

Miscellaneous

  • closest_string: given a list of strings and one reference string, find out which string is the closest to the reference in the list.
  • remove_invalid_targets: returns a new dataframe that does not have NaN in target column. Does not modify the original dataframe.
  • format_datetime_column_with_unit: turns date column into str based on given time_unit. Will modify the original df.
  • are_valid_values: whether there exists a non-np.nan value in values. Used to test if a fold is useless - if all actual values are np.nan, then the fold is not usable.

How to update exodusutils

poetry add exodusutils@latest    # Or to a specific version

If this command does not update the version for exodusutils, it's probably because the poetry cache still contains the exodusutils on you current version. Normally this can be solved by waiting for 15 minutes or so.

Appendix B: How cross validation scores are calculated

To create the frames for cross validation, exodusutils library provides a helper class called CVFrames:

frames: List[TrainFrame] = CVFrames(df, nfolds, validation_df).frames

Here, df is the training dataframe, nfolds is the number of folds you want for the cross validation, and validation_df represents the validation dataframe, which is optional.

In the return type, a TrainFrame is essentially a 3-tuple comprised of three pandas DataFrames. The first is the dataframe that will be used to train the cross validation model, the second used as the validation set during training (this is the validation_df argument you passed to CVFrames), the third as the testing set for calculating the cross validation score.

Note that a cross validation fold is invalid if it does not have any row in its testing set - otherwise it is not possible to calculate the fold's scores! This can happen if there are not enough rows in the dataframe. In this instance, the program will raise an exception telling you your data is invalid.

How does CVFrames cut the cross validation frames?

The way we cut cross validation range is described as follows:

  1. We know each of the rows in the dataframe has an index attached to it.
  2. We split the indices into nfolds sets. Suppose there are 20 rows in the dataframe, and nfolds is 5, we calculate row_index % fold to see which rows goes to which fold. To demonstrate this, below is an example on how we split the indices:
    fold_indices = [
      [0, 5, 10, 15],
      [1, 6, 11, 16],
      [2, 7, 12, 17],
      [3, 8, 13, 18],
      [4, 9, 14, 19]
    ]
    
  3. Now that we have fold_indices, a list of 5 sublists, each indicating a set of indices. For the first cross validation fold, we take the rows in the first sublist as the testing set, the rest will be either be in training set or validation set.
  4. Repeat this process for the rest of the folds.

Appendix C: How to change MongoDB settings

Changing MongoDB service port

By default, the MongoDB service is reachable via port 27018. If you want to change that to something else, say 9487, modify the docker-compose.yml:

services:
  mongo:
    restart: always
    build:
      context: ./mongo
      dockerfile: Dockerfile
    ports:
      - "27018:27017" # Change this to "9487:27017"

If you want to run tests using your own port, modify tests/__init__.py, and change it from 27018 to 9487.

Changing other MongoDB configurations

All of the configuration values are in config.ini.

Encryption

In order to properly encrypt things, several things are required:

  • encrypted field in config.ini should be true
  • encryption_key in config.ini should be a 96 bytes long string

The encryption key will be stored in MongoDB, in a collection named __keyVault in the database specified in config.ini. There should be none or only one document in the collection.

Appendix D: How to view and use the API docs

Visit http://{MODEL ALGORTIHM SERVICE IP}/docs:

Sending requests to the endpoints

You can expand any of the endpoints, and try sending requests to them like so:

Inspecting the expected schema of an endpoint

If you want to see the required schema for a certain endpoint, expand it and click Schema: