A Beginners Guide to Q-Learning (2024)

THE DEFINITIVE REFLECTION

Model-Free Reinforcement Learning

Published in

Towards Data Science

7 min read

Nov 15, 2019

Have you ever blamed or beat at your dog punitively for the wrongful actions once it done? Or have you ever trained a pet and rewarded it for every correct command you asked for? If you are a pet owner, probably your answer would be ‘Yes’. You may have noticed once you do so from its younger age frequently, its wrongful deeds getting reduced day by day. And the same as it will learn from mistakes and trained himself well.

As humans, we also have experienced the same. Can you remember, in our primary school, our school teachers rewarded us with stars once we had done school works properly. :D

This is what exactly happening in Reinforcement Learning(RL).

Reinforcement Learning is one of the most beautiful branches in Artificial Intelligence

The objective of RL is to maximize the reward of an agent by taking a series of actions in response to a dynamic environment.

Reinforcement Learning is the science of making optimal decisions using experiences. Breaking it down, the process of Reinforcement Learning involves these simple steps:

Observation of the environment
Deciding how to act using some strategy
Acting accordingly
Receiving a reward or penalty
Learning from the experiences and refining our strategy
Iterate until an optimal strategy is found

There are 2 main types of RL algorithms. They are model-based and model-free.

A model-free algorithm is an algorithm that estimates the optimal policy without using or estimating the dynamics (transition and reward functions) of the environment. Whereas, a model-based algorithm is an algorithm that uses the transition function (and the reward function) in order to estimate the optimal policy.

Q-learning is a model-free reinforcement learning algorithm.

Q-learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation(particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement.

Q-learning is an off-policy learner. Means it learns the value of the optimal policy independently of the agent’s actions. On the other hand, an on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps and it will find a policy that is optimal, taking into account the exploration inherent in the policy.

What’s this ‘Q’?

The ‘Q’ in Q-learning stands for quality. Quality here represents how useful a given action is in gaining some future reward.

Q-learning Definition

Q*(s,a) is the expected value (cumulative discounted reward) of doing a in state s and then following the optimal policy.
Q-learning uses Temporal Differences(TD) to estimate the value of Q*(s,a). Temporal difference is an agent learning from an environment through episodes with no prior knowledge of the environment.
The agent maintains a table of Q[S, A], where S is the set of states and A is the set of actions.
Q[s, a] represents its current estimate of Q*(s,a).

Q-learning Simple Example

In this section Q-learning has been explained along with a demo.

Let’s say an agent has to move from a starting point to an ending point along a path that has obstacles. Agent needs to reach the target in the shortest path possible without hitting in the obstacles and he needs to follow the boundary covered by the obstacles. For our convenience, I have introduced this in a customized grid environment as follows.

Introducing the Q-Table

Q-Table is the data structure used to calculate the maximum expected future rewards for action at each state. Basically, this table will guide us to the best action at each state. To learn each value of the Q-table, Q-Learning algorithm is used.

Q-function

The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).

Q-learning Algorithm Process

Step 1: Initialize the Q-Table

First the Q-table has to be built. There are n columns, where n= number of actions. There are m rows, where m= number of states.

In our example n=Go Left, Go Right, Go Up and Go Down and m= Start, Idle, Correct Path, Wrong Path and End. First, let’s initialize the values at 0.

Step 2 : Choose an Action

Step 3 : Perform an Action

The combination of steps 2 and 3 is performed for an undefined amount of time. These steps runs until the time training is stopped, or when the training loop stopped as defined in the code.

First, an action (a) in the state (s) is chosen based on the Q-Table. Note that, as mentioned earlier, when the episode initially starts, every Q-value should be 0.

Then, update the Q-values for being at the start and moving right using the Bellman equation which is stated above.

Epsilon greedy strategy concept comes in to play here. In the beginning, the epsilon rates will be higher. The agent will explore the environment and randomly choose actions. This occurs like this logically,since the agent does not know anything about the environment. As the agent explores the environment, the epsilon rate decreases and the agent starts to exploit the environment.

During the process of exploration, the agent progressively becomes more confident in estimating the Q-values.

In our Agent example, when the training of agent starting, the agent is completely unaware about the environment. So let’s say it takes a random action to its ‘right’ direction.

We can now update the Q-values for being at the start and moving right using the Bellman equation.

Steps 4 : Measure Reward

Now we have taken an action and observed an outcome and reward.

Steps 5 : Evaluate

We need to update the function Q(s,a).

This process is repeated again and again until the learning is stopped. In this way the Q-Table is been updated and the value function Q is maximized. Here the Q(state, action) returns the expected future reward of that action at that state.

In the example, I have entered the rewarding scheme as follows.

Reward when reach step closer to goal= +1
Reward when hit obstacle =-1
Reward when idle=0

Initially, we explore the agent’s environment and update the Q-Table. When the Q-Table is ready, the agent start to exploit the environment and start taking better actions. Final Q-table can be like following (for example).

Following are the outcomes that results the agent’s shortest path towards goal after training.

Please drop a mail to grasp the python implementation code for the concept explained.