Introduction to Reinforcement Learning

Value Functions and Bellman Equations

In reinforcement learning, the value function is a function that defines the expected outcome of an action in a given state. The value function measures how good it is to be in a particular state and take a specific action. The Bellman Equation is a recursive formula used to calculate the value function.

The basic idea behind the Bellman Equation is that the value of a state is the reward you expect to get from that state plus the expected value of the next state, discounted by some factor. The discount factor is used to ensure that the value function converges and is not infinite.

The Bellman equation is as follows:

V(s) = R(s) + γ max_a Σ_s' P(s'|s,a) V(s')

Where:

V(s) is the value function for state s
R(s) is the reward for being in state s
γ is the discount factor
a is the action taken in state s
P(s'|s,a) is the probability of transitioning to state s' when taking action a in state s

The Bellman equation is a powerful tool for calculating value functions, and it is used extensively in reinforcement learning algorithms such as Q-learning and SARSA.

Example

Let's say we have a robot that can move left or right in a 1D world with two states: A and B. If the robot moves left from state A, it receives a reward of -1 and ends up in state B. If it moves right from state A, it receives a reward of +1 and stays in state A. If it moves left from state B, it receives a reward of +1 and ends up in state A. If it moves right from state B, it receives a reward of -1 and stays in state B. The discount factor γ is 0.9.

The value function for each state can be calculated using the Bellman Equation as follows:

V(A) = R(A) + γ max{P(B|A,L) V(B), P(A|A,R) V(A)} = 1 + 0.9 max{V(B), V(A)} V(B) = R(B) + γ max{P(A|B,L) V(A), P(B|B,R) V(B)} = -1 + 0.9 max{V(A), V(B)}

Using these equations, we can calculate the value function for each state by iteratively solving for V(A) and V(B) until the values converge.

Take quiz (4 questions)

Previous unit

Markov Decision Processes

Next unit

Policy Optimization

All courses were automatically generated using OpenAI's GPT-3. Your feedback helps us improve as we cannot manually review every course. Thank you!