Training an agent

Here, we first explain the script that prepares and start the training step by step. Thereafter, we explain the training procedure itself.

Training Script

  1. Create the path for storing the model and all related info like performance metrics and model parameters.

    1. The path is constructed in config/hypers.py and reflects the choice of the environment and some specific hyperparameters

  2. Create the MuJoCo Gym environment

    1. We create multiple parallel instances of the environment specified by the environment id (string) in common/config.py. The number of parallel environments is specified in common/hypers.py.

    2. Each environment is wrapped by a Monitor, a gym wrapper we built to monitor joint kinematics, kinetics as well as performance metrics

    3. Each environment is then wrapped by a SubprocVecEnv from SB3 which creates the parallel environments

    4. Finally, a VecNormalize wrapper from SB3 is used to maintain a running mean and running standard deviation of the observations (dimension-wise) and the return for normalization.

    5. In summary, we get VecNormalize(SubprocVecEnv(MonitorWrapper(MimicEnv), n_envs))

  3. Use hyperparameters to create schedules and prepare the configuration dictionary for the network architectures

  4. Create the model: a PPO agent with the specified hyperparameters and configurations

    1. In this step, also Tensorboard (TB) is launched. The TB logs go into a dedicated subfolder (tb_logs/) in the model folder. SB3 logs multiple PPO specific metrics like losses or entropy. In addition, we specify multiple metrics in common/callback.py that are logged to TB.

  5. Logging: start W&B if necessary, print infos about the training procedure and model to the console.

  6. Save initial model. The model checkpoint at this timestep is called ‘init’.

  7. Start the training. This single command (starting with model.learn(...)) starts and executes the whole training procedure as described below.

  8. After training ends, save the final model checkpoint named ‘final’.

  9. Close the environment, which is important to avoid multiple problems.

    Warning

    Always close the environment before ending a script in which a gym environment was instantiated.

  10. At this point, we used to evaluate the model and record videos of the trained agent. After the migration from SB2 to SB3 however, the recording of the video on a remote server broke. It still should work, when training on a laptop or PC connected to a display. Also, the performance evaluation in eval.py might no longer work. Use with caution.

    Warning

    There are two different sets of evaluation!

    • One is performed during training every N amounts of steps. This evaluation procedure is defined in common/callback.py. Here, we evaluate the walking performance of the agent and track it to TB.

    • The other evaluation used to be performed at the end of the training to evaluate the performance of the agent. It is defined in eval.py and is broken in the moment. To evaluate your agent after the training, please use/extend run.py.

Training Procedure

The training procedure is almost fully managed internally by Stable Baselines 3 (SB3). Our only touchpoint with it is the callback implemented in $common/callback.py$. The callback allows us to execute custom code before (_on_training_start()) and after the training (_on_training_end() as well as on every step taken in the environment (_on_step()). The former two methods are used to setup and close the SummaryWriter that is used to log custom metrics to Tensorboard. The most interesting inteactions with the trained agent are happening in the latter function. Hereafter, we first outline the overall training procedure as implemented in SB3 and then explain our interactions through the callback.

SB3 Training Loop

  1. Get the state observations s of the n parallel environments.

  2. Pass the observations through the value function network and the policy network to estimate the state values and get an action a for each of the parallel environments.

  3. Clip the actions to a specified range.

  4. Execute the actions in all parallel environments, and get the rewards r, new observations s’.

  5. Store the experience tuples (s,a,r,s’) in a batch.

  6. Continue steps above until the batch is full, i.e. enough experiences are collected to update the policy.

    Note

    Collecting enough experiences with the same policy until the batch is full is often referred to as a policy rollout.

  7. Pause the training to update the policy. Sample minibatches of experiences from the batch and perform a policy update. Repeat this step for noptepochs.

  8. Repeat steps above with the updated policy until the specified amount of steps (mio_samples) were collected and the training can be stopped.

Our Interactions with the Training via the Callback: Training Evaluation

_on_step() is used to log training performance to TB and Weights & Biases every 100 steps (self.skipped_steps). In addition, it performs a strong model/walking-performance evalution every 400k (EVAL_INTERVAL) steps. The evaluation is described in more detail in the following.

To evaluate the walking performance, every 400k steps, we

  • pause the training and save the current model

  • load the current model with the corresponding environment in a new thread

  • evaluate the model for 20 episodes and

  • record multiple metrics like the walked distance (before falling or episode end), average speed, average episode duration before falling etc. These metrics are then all uploaded to TB and W&B in the _det_eval section.