Implementation of Decision Making Algorithms in OpenAI Parking Environment




3.1. Markov Decision Process (MDP)

  • S is a set of states called the state space,
  • A is the action space and can be considered as a set of all possible actions
  • Rₐ(s, a) is the immediate reward (or expected immediate reward) received after transitioning from state s to state s’, due to action a
the probability that action a in state s at time t will lead to state s’ at time t+1

Model-Based Reinforcement Learning


  1. Model learning: A model of the dynamics f_θ≃f is learned through a regression on interaction data.
  2. Planning: We then leverage the dynamics model f_θ to compute the optimal trajectory


  • Sparse rewards
  • The complexity of the policy/value vs dynamics

Is it easier to decide which action is best, or to predict what is going to happen?

  • Some problems can have complex dynamics but a simple optimal policy or value function. For instance, consider the problem of learning to swim. Predicting the movement requires understanding fluid dynamics and vortices while the optimal policy simply consists of moving the limbs in sync.
  • Conversely, other problems can have simple dynamics but complex policies/value functions. Think of the game of Go, its rules are simplistic (placing a stone merely changes the board state at this location) but the corresponding optimal policy is very complicated.
  • Inductive bias
  • Sample efficiency
  • Interpretability


  • The policy/value is highly dependent on the goal which adds a significant level of complexity to a model-free learning process, whereas the dynamics are completely independent of the goal and hence can be simpler to learn.
  • In the context of an industrial application, we can reasonably expect for safety concerns that the planned trajectory is required to be known in advance, before execution.
  • Draw samples from a probability distribution. We use Gaussian distributions over sequences of actions.
  • Minimize the cross-entropy between this distribution and a target distribution to produce a better sample in the next iteration. We define this target distribution by selecting the top-k performing sampled sequences.


The algorithm and math is all fine, but can the MDP park the car now?

3.2 Deep Q Learning Network (DQN)

DQN Algorithm. Reference: Mnih et al [9]
  1. 3 steering actions, 3 accelerations, hidden state = 128, 4 layers, gamma 0.1 — the network overfits
  2. 5 steering, 2 accelerations, hidden state = 128, 4 layers, gamma 0.1 — the network overfits
  3. 5 steering, 2 accelerations, 3 layers, hidden size 128 — the network overfits

3.3. Soft Actor-Critic (SAC)

  • Sample Efficiency: In general, learning skills in the real world can take a huge amount of time. Thus, adding a new task to the agent will require several trials. But compared to other algorithms, SAC has a good sample efficiency which makes learning new skills comparatively faster.
  • Lack of Sensitive Hyperparameters: In real-world scenarios, tuning the hyperparameters regularly is not ideal and we need a model that is not very sensitive to hyperparameters.
  • Off — Policy Learning: In general, an algorithm is said to be off-policy if the data collected for a different task can be used for another purpose. SAC being an off-policy algorithm makes it easier to adapt it for other tasks just by tuning the reward function and parameters accordingly.
  1. n_sampled_goal=4
  2. goal_selection_strategy=’future’
  3. verbose = 1
  4. buffer_size=int(1e6)
  5. Learning rate=1e-3
  6. gamma=0.9
  7. Batch size=256
  8. Training steps = 5e4


  1. Q network: The Q-value network or critic takes in state and action as input and outputs the Q-value.
  2. Deterministic policy network: The policy network or actor takes the state as input and outputs the exact action (continuous), instead of a probability distribution over actions. Thus, DDPG works in deterministic settings.
  3. Target Q network: It is a time-delayed copy of the original Q network.
  4. Target policy network: It is a time-delayed copy of the original policy network.
  1. Actor learning rate=1e-4
  2. Critic learning rate=1e-3
  3. gamma=0.99
  4. tau=0.005
  5. Batch size = 128
  6. Hidden size (in all 4 networks) = 256
  7. Maximum memory size = 50000
  8. Steps per episode = 500
  1. n_sampled_goal=4
  2. goal_selection_strategy=’future’
  3. online_sampling=True
  4. buffer_size=int(1e6)
  5. Learning rate=1e-3
  6. gamma=0.95
  7. Batch size=256
  8. max_episode_length=100
  9. noise_std = 0.2
  10. Training steps = 2e5


  • Model bias
  • The computational cost of planning


  3. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra, Continuous control with deep reinforcement learning, CoRR abs/1509.02971 (2015)
  4. Edouard Leurent’s answer to Quora's postWhy do we use the Ornstein Uhlenbeck Process in the exploration of DDPG?”
  6. Hindsight Experience Replay, Andrychowicz et al., 2017
  9. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
  11. Richard S. Sutton. 1991. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull. 2, 4 (Aug. 1991), 160–163. DOI:




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

The Sea of Inputs and Outputs

Our Terrible, Terrific Digital World

AI vs Human Emotions

Artificial Intelligence Tools From Google For Everyone

Embracing AI in medicine

How Are AI Apps Help Travel Agencies To Remain Competitive In The Industry?

Decisions made by AI need a meaningful appeals process

Angry Robot

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Pradeep Gopal

Pradeep Gopal

More from Medium

On desperation as motivation — Ben Berman

Test-Driven Development: Is it Mandatory?

Why Subnets are so important for the sake of Avalanche network and scalability?

Understanding Gen Z Humor