docs/Python-On-Off-Policy-Trainer-Documentation.md
<a name="mlagents.trainers.trainer.on_policy_trainer"></a>
<a name="mlagents.trainers.trainer.on_policy_trainer.OnPolicyTrainer"></a>
class OnPolicyTrainer(RLTrainer)
The PPOTrainer is an implementation of the PPO algorithm.
<a name="mlagents.trainers.trainer.on_policy_trainer.OnPolicyTrainer.__init__"></a>
| __init__(behavior_name: str, reward_buff_cap: int, trainer_settings: TrainerSettings, training: bool, load: bool, seed: int, artifact_path: str)
Responsible for collecting experiences and training an on-policy model.
Arguments:
behavior_name: The name of the behavior associated with trainer configreward_buff_cap: Max reward history to track in the reward buffertrainer_settings: The parameters for the trainer.training: Whether the trainer is set for training.load: Whether the model should be loaded.seed: The seed the model will be initialized withartifact_path: The directory within which to store artifacts from this trainer.<a name="mlagents.trainers.trainer.on_policy_trainer.OnPolicyTrainer.add_policy"></a>
| add_policy(parsed_behavior_id: BehaviorIdentifiers, policy: Policy) -> None
Adds policy to trainer.
Arguments:
parsed_behavior_id: Behavior identifiers that the policy should belong to.policy: Policy to associate with name_behavior_id.<a name="mlagents.trainers.trainer.off_policy_trainer"></a>
<a name="mlagents.trainers.trainer.off_policy_trainer.OffPolicyTrainer"></a>
class OffPolicyTrainer(RLTrainer)
The SACTrainer is an implementation of the SAC algorithm, with support for discrete actions and recurrent networks.
<a name="mlagents.trainers.trainer.off_policy_trainer.OffPolicyTrainer.__init__"></a>
| __init__(behavior_name: str, reward_buff_cap: int, trainer_settings: TrainerSettings, training: bool, load: bool, seed: int, artifact_path: str)
Responsible for collecting experiences and training an off-policy model.
Arguments:
behavior_name: The name of the behavior associated with trainer configreward_buff_cap: Max reward history to track in the reward buffertrainer_settings: The parameters for the trainer.training: Whether the trainer is set for training.load: Whether the model should be loaded.seed: The seed the model will be initialized withartifact_path: The directory within which to store artifacts from this trainer.<a name="mlagents.trainers.trainer.off_policy_trainer.OffPolicyTrainer.save_model"></a>
| save_model() -> None
Saves the final training model to memory Overrides the default to save the replay buffer.
<a name="mlagents.trainers.trainer.off_policy_trainer.OffPolicyTrainer.save_replay_buffer"></a>
| save_replay_buffer() -> None
Save the training buffer's update buffer to a pickle file.
<a name="mlagents.trainers.trainer.off_policy_trainer.OffPolicyTrainer.load_replay_buffer"></a>
| load_replay_buffer() -> None
Loads the last saved replay buffer from a file.
<a name="mlagents.trainers.trainer.off_policy_trainer.OffPolicyTrainer.add_policy"></a>
| add_policy(parsed_behavior_id: BehaviorIdentifiers, policy: Policy) -> None
Adds policy to trainer.
<a name="mlagents.trainers.trainer.rl_trainer"></a>
<a name="mlagents.trainers.trainer.rl_trainer.RLTrainer"></a>
class RLTrainer(Trainer)
This class is the base class for trainers that use Reward Signals.
<a name="mlagents.trainers.trainer.rl_trainer.RLTrainer.end_episode"></a>
| end_episode() -> None
A signal that the Episode has ended. The buffer must be reset. Get only called when the academy resets.
<a name="mlagents.trainers.trainer.rl_trainer.RLTrainer.create_optimizer"></a>
| @abc.abstractmethod
| create_optimizer() -> TorchOptimizer
Creates an Optimizer object
<a name="mlagents.trainers.trainer.rl_trainer.RLTrainer.save_model"></a>
| save_model() -> None
Saves the policy associated with this trainer.
<a name="mlagents.trainers.trainer.rl_trainer.RLTrainer.advance"></a>
| advance() -> None
Steps the trainer, taking in trajectories and updates if ready. Will block and wait briefly if there are no trajectories.
<a name="mlagents.trainers.trainer.trainer"></a>
<a name="mlagents.trainers.trainer.trainer.Trainer"></a>
class Trainer(abc.ABC)
This class is the base class for the mlagents_envs.trainers
<a name="mlagents.trainers.trainer.trainer.Trainer.__init__"></a>
| __init__(brain_name: str, trainer_settings: TrainerSettings, training: bool, load: bool, artifact_path: str, reward_buff_cap: int = 1)
Responsible for collecting experiences and training a neural network model.
Arguments:
brain_name: Brain name of brain to be trained.trainer_settings: The parameters for the trainer (dictionary).training: Whether the trainer is set for training.artifact_path: The directory within which to store artifacts from this trainerreward_buff_cap:<a name="mlagents.trainers.trainer.trainer.Trainer.stats_reporter"></a>
| @property
| stats_reporter()
Returns the stats reporter associated with this Trainer.
<a name="mlagents.trainers.trainer.trainer.Trainer.parameters"></a>
| @property
| parameters() -> TrainerSettings
Returns the trainer parameters of the trainer.
<a name="mlagents.trainers.trainer.trainer.Trainer.get_max_steps"></a>
| @property
| get_max_steps() -> int
Returns the maximum number of steps. Is used to know when the trainer should be stopped.
Returns:
The maximum number of steps of the trainer
<a name="mlagents.trainers.trainer.trainer.Trainer.get_step"></a>
| @property
| get_step() -> int
Returns the number of steps the trainer has performed
Returns:
the step count of the trainer
<a name="mlagents.trainers.trainer.trainer.Trainer.threaded"></a>
| @property
| threaded() -> bool
Whether or not to run the trainer in a thread. True allows the trainer to update the policy while the environment is taking steps. Set to False to enforce strict on-policy updates (i.e. don't update the policy when taking steps.)
<a name="mlagents.trainers.trainer.trainer.Trainer.should_still_train"></a>
| @property
| should_still_train() -> bool
Returns whether or not the trainer should train. A Trainer could stop training if it wasn't training to begin with, or if max_steps is reached.
<a name="mlagents.trainers.trainer.trainer.Trainer.reward_buffer"></a>
| @property
| reward_buffer() -> Deque[float]
Returns the reward buffer. The reward buffer contains the cumulative rewards of the most recent episodes completed by agents using this trainer.
Returns:
the reward buffer.
<a name="mlagents.trainers.trainer.trainer.Trainer.save_model"></a>
| @abc.abstractmethod
| save_model() -> None
Saves model file(s) for the policy or policies associated with this trainer.
<a name="mlagents.trainers.trainer.trainer.Trainer.end_episode"></a>
| @abc.abstractmethod
| end_episode()
A signal that the Episode has ended. The buffer must be reset. Get only called when the academy resets.
<a name="mlagents.trainers.trainer.trainer.Trainer.create_policy"></a>
| @abc.abstractmethod
| create_policy(parsed_behavior_id: BehaviorIdentifiers, behavior_spec: BehaviorSpec) -> Policy
Creates a Policy object
<a name="mlagents.trainers.trainer.trainer.Trainer.add_policy"></a>
| @abc.abstractmethod
| add_policy(parsed_behavior_id: BehaviorIdentifiers, policy: Policy) -> None
Adds policy to trainer.
<a name="mlagents.trainers.trainer.trainer.Trainer.get_policy"></a>
| get_policy(name_behavior_id: str) -> Policy
Gets policy associated with name_behavior_id
Arguments:
name_behavior_id: Fully qualified behavior nameReturns:
Policy associated with name_behavior_id
<a name="mlagents.trainers.trainer.trainer.Trainer.advance"></a>
| @abc.abstractmethod
| advance() -> None
Advances the trainer. Typically, this means grabbing trajectories from all subscribed trajectory queues (self.trajectory_queues), and updating a policy using the steps in them, and if needed pushing a new policy onto the right policy queues (self.policy_queues).
<a name="mlagents.trainers.trainer.trainer.Trainer.publish_policy_queue"></a>
| publish_policy_queue(policy_queue: AgentManagerQueue[Policy]) -> None
Adds a policy queue to the list of queues to publish to when this Trainer makes a policy update
Arguments:
policy_queue: Policy queue to publish to.<a name="mlagents.trainers.trainer.trainer.Trainer.subscribe_trajectory_queue"></a>
| subscribe_trajectory_queue(trajectory_queue: AgentManagerQueue[Trajectory]) -> None
Adds a trajectory queue to the list of queues for the trainer to ingest Trajectories from.
Arguments:
trajectory_queue: Trajectory queue to read from.<a name="mlagents.trainers.settings"></a>
<a name="mlagents.trainers.settings.deep_update_dict"></a>
deep_update_dict(d: Dict, update_d: Mapping) -> None
Similar to dict.update(), but works for nested dicts of dicts as well.
<a name="mlagents.trainers.settings.RewardSignalSettings"></a>
@attr.s(auto_attribs=True)
class RewardSignalSettings()
<a name="mlagents.trainers.settings.RewardSignalSettings.structure"></a>
| @staticmethod
| structure(d: Mapping, t: type) -> Any
Helper method to structure a Dict of RewardSignalSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure(). This is needed to handle the special Enum selection of RewardSignalSettings classes.
<a name="mlagents.trainers.settings.ParameterRandomizationSettings"></a>
@attr.s(auto_attribs=True)
class ParameterRandomizationSettings(abc.ABC)
<a name="mlagents.trainers.settings.ParameterRandomizationSettings.__str__"></a>
| __str__() -> str
Helper method to output sampler stats to console.
<a name="mlagents.trainers.settings.ParameterRandomizationSettings.structure"></a>
| @staticmethod
| structure(d: Union[Mapping, float], t: type) -> "ParameterRandomizationSettings"
Helper method to a ParameterRandomizationSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure(). This is needed to handle the special Enum selection of ParameterRandomizationSettings classes.
<a name="mlagents.trainers.settings.ParameterRandomizationSettings.unstructure"></a>
| @staticmethod
| unstructure(d: "ParameterRandomizationSettings") -> Mapping
Helper method to a ParameterRandomizationSettings class. Meant to be registered with cattr.register_unstructure_hook() and called with cattr.unstructure().
<a name="mlagents.trainers.settings.ParameterRandomizationSettings.apply"></a>
| @abc.abstractmethod
| apply(key: str, env_channel: EnvironmentParametersChannel) -> None
Helper method to send sampler settings over EnvironmentParametersChannel Calls the appropriate sampler type set method.
Arguments:
key: environment parameter to be sampledenv_channel: The EnvironmentParametersChannel to communicate sampler settings to environment<a name="mlagents.trainers.settings.ConstantSettings"></a>
@attr.s(auto_attribs=True)
class ConstantSettings(ParameterRandomizationSettings)
<a name="mlagents.trainers.settings.ConstantSettings.__str__"></a>
| __str__() -> str
Helper method to output sampler stats to console.
<a name="mlagents.trainers.settings.ConstantSettings.apply"></a>
| apply(key: str, env_channel: EnvironmentParametersChannel) -> None
Helper method to send sampler settings over EnvironmentParametersChannel Calls the constant sampler type set method.
Arguments:
key: environment parameter to be sampledenv_channel: The EnvironmentParametersChannel to communicate sampler settings to environment<a name="mlagents.trainers.settings.UniformSettings"></a>
@attr.s(auto_attribs=True)
class UniformSettings(ParameterRandomizationSettings)
<a name="mlagents.trainers.settings.UniformSettings.__str__"></a>
| __str__() -> str
Helper method to output sampler stats to console.
<a name="mlagents.trainers.settings.UniformSettings.apply"></a>
| apply(key: str, env_channel: EnvironmentParametersChannel) -> None
Helper method to send sampler settings over EnvironmentParametersChannel Calls the uniform sampler type set method.
Arguments:
key: environment parameter to be sampledenv_channel: The EnvironmentParametersChannel to communicate sampler settings to environment<a name="mlagents.trainers.settings.GaussianSettings"></a>
@attr.s(auto_attribs=True)
class GaussianSettings(ParameterRandomizationSettings)
<a name="mlagents.trainers.settings.GaussianSettings.__str__"></a>
| __str__() -> str
Helper method to output sampler stats to console.
<a name="mlagents.trainers.settings.GaussianSettings.apply"></a>
| apply(key: str, env_channel: EnvironmentParametersChannel) -> None
Helper method to send sampler settings over EnvironmentParametersChannel Calls the gaussian sampler type set method.
Arguments:
key: environment parameter to be sampledenv_channel: The EnvironmentParametersChannel to communicate sampler settings to environment<a name="mlagents.trainers.settings.MultiRangeUniformSettings"></a>
@attr.s(auto_attribs=True)
class MultiRangeUniformSettings(ParameterRandomizationSettings)
<a name="mlagents.trainers.settings.MultiRangeUniformSettings.__str__"></a>
| __str__() -> str
Helper method to output sampler stats to console.
<a name="mlagents.trainers.settings.MultiRangeUniformSettings.apply"></a>
| apply(key: str, env_channel: EnvironmentParametersChannel) -> None
Helper method to send sampler settings over EnvironmentParametersChannel Calls the multirangeuniform sampler type set method.
Arguments:
key: environment parameter to be sampledenv_channel: The EnvironmentParametersChannel to communicate sampler settings to environment<a name="mlagents.trainers.settings.CompletionCriteriaSettings"></a>
@attr.s(auto_attribs=True)
class CompletionCriteriaSettings()
CompletionCriteriaSettings contains the information needed to figure out if the next lesson must start.
<a name="mlagents.trainers.settings.CompletionCriteriaSettings.need_increment"></a>
| need_increment(progress: float, reward_buffer: List[float], smoothing: float) -> Tuple[bool, float]
Given measures, this method returns a boolean indicating if the lesson needs to change now, and a float corresponding to the new smoothed value.
<a name="mlagents.trainers.settings.Lesson"></a>
@attr.s(auto_attribs=True)
class Lesson()
Gathers the data of one lesson for one environment parameter including its name, the condition that must be fulfilled for the lesson to be completed and a sampler for the environment parameter. If the completion_criteria is None, then this is the last lesson in the curriculum.
<a name="mlagents.trainers.settings.EnvironmentParameterSettings"></a>
@attr.s(auto_attribs=True)
class EnvironmentParameterSettings()
EnvironmentParameterSettings is an ordered list of lessons for one environment parameter.
<a name="mlagents.trainers.settings.EnvironmentParameterSettings.structure"></a>
| @staticmethod
| structure(d: Mapping, t: type) -> Dict[str, "EnvironmentParameterSettings"]
Helper method to structure a Dict of EnvironmentParameterSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure().
<a name="mlagents.trainers.settings.TrainerSettings"></a>
@attr.s(auto_attribs=True)
class TrainerSettings(ExportableSettings)
<a name="mlagents.trainers.settings.TrainerSettings.structure"></a>
| @staticmethod
| structure(d: Mapping, t: type) -> Any
Helper method to structure a TrainerSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure().
<a name="mlagents.trainers.settings.CheckpointSettings"></a>
@attr.s(auto_attribs=True)
class CheckpointSettings()
<a name="mlagents.trainers.settings.CheckpointSettings.prioritize_resume_init"></a>
| prioritize_resume_init() -> None
Prioritize explicit command line resume/init over conflicting yaml options. if both resume/init are set at one place use resume
<a name="mlagents.trainers.settings.RunOptions"></a>
@attr.s(auto_attribs=True)
class RunOptions(ExportableSettings)
<a name="mlagents.trainers.settings.RunOptions.from_argparse"></a>
| @staticmethod
| from_argparse(args: argparse.Namespace) -> "RunOptions"
Takes an argparse.Namespace as specified in parse_command_line, loads input configuration files
from file paths, and converts to a RunOptions instance.
Arguments:
args: collection of command-line parameters passed to mlagents-learnReturns:
RunOptions representing the passed in arguments, with trainer config, curriculum and sampler configs loaded from files.