docs/numpy_ml.rl_models.agents.rst
CrossEntropyAgent.. autoclass:: numpy_ml.rl_models.agents.CrossEntropyAgent :members: :undoc-members: :inherited-members:
DynaAgent.. autoclass:: numpy_ml.rl_models.agents.DynaAgent :members: :undoc-members: :inherited-members:
MonteCarloAgentMonte Carlo methods are ways of solving RL problems based on averaging sample returns for each state-action pair. Parameters are updated only at the completion of an episode.
In on-policy learning, the agent maintains a single policy that it updates over the course of training. In order to ensure the policy converges to a (near-) optimal policy, the agent must maintain that the policy assigns non-zero probability to ALL state-action pairs during training to ensure continual exploration.
In off-policy learning, the agent maintains two separate policies:
Off-policy methods are often of greater variance and are slower to converge. On the other hand, off-policy methods are more powerful and general than on-policy methods.
.. autoclass:: numpy_ml.rl_models.agents.MonteCarloAgent :members: :undoc-members: :inherited-members:
TemporalDifferenceAgentTemporal difference methods are examples of bootstrapping in that they update
their estimate for the value of state s on the basis of a previous estimate.
Advantages of TD algorithms:
\alpha Monte Carlo methods on stochastic tasks... autoclass:: numpy_ml.rl_models.agents.TemporalDifferenceAgent :members: :undoc-members: :inherited-members: