Markov Decision Process: A Framework for Decision Making
December 22, 2018
Markov Decision Process (MDP) provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. Let’s dive into how MDPs help us understand different types of environments and decision-making scenarios.
Types of Decision Environments
Decision environments can be categorized along several dimensions:
- Complete vs Incomplete Information
- Fully Observable vs Partially Observable
- Competitive vs Collaborative
- Static vs Dynamic
- Deterministic vs Stochastic
Today, we’ll focus on observability in environments and how it impacts decision-making.
Fully Observable Environments
In a fully observable environment, the agent has complete information about the current state of the system. Mathematically, we can represent this as:
\[s_t = f(o_t)\]where:
- \(s_t\) is the true state at time t
- \(o_t\) is the observation at time t
- \(f\) is a direct mapping function
Example: Chess
Consider a chess game. The state space can be represented as:
\[S = \{s_1, s_2, ..., s_n\}\]where each \(s_i\) represents a possible board configuration. The optimal policy \(\pi^*\) depends only on the current state:
\[\pi^*(s) = \arg\max_a Q(s,a)\]where \(Q(s,a)\) is the action-value function representing the expected future reward.
Partially Observable Environments
In partially observable environments, the agent cannot directly observe the true state. Instead, it maintains a belief state based on past observations. This can be represented as:
\[b_t = P(s_t | o_1, o_2, ..., o_t)\]where:
- \(b_t\) is the belief state at time t
- \(s_t\) is the true state
- \(o_1, o_2, ..., o_t\) are observations up to time t
Example: Poker
In poker, the optimal policy depends on both current observation and history:
\[\pi^*(b_t) = \arg\max_a \sum_{s \in S} b_t(s)Q(s,a)\]where \(b_t(s)\) represents the probability of being in state \(s\) given the history of observations.
The Markov Property
The key distinction lies in the Markov property. In fully observable environments:
\[P(s_{t+1}|s_t, s_{t-1}, ..., s_1) = P(s_{t+1}|s_t)\]This means the future is independent of the past given the present. However, in partially observable environments, we need to maintain a history:
\[P(s_{t+1}|o_t, o_{t-1}, ..., o_1) \neq P(s_{t+1}|o_t)\]Real-World Applications
Consider a self-driving car:
- Fully Observable: Highway driving with clear weather and perfect sensor data
- Partially Observable: Urban driving in fog where past observations help predict hidden obstacles
The mathematical framework for decision-making changes based on observability:
\[V(s) = \max_a \left[R(s,a) + \gamma \sum_{s'} P(s'|s,a)V(s')\right]\]where:
- \(V(s)\) is the value function
- \(R(s,a)\) is the reward function
- \(\gamma\) is the discount factor
- \(P(s'|s,a)\) is the transition probability
This equation, known as the Bellman equation, takes different forms depending on whether we’re dealing with fully or partially observable environments.
Conclusion
Understanding the observability of an environment is crucial for designing effective decision-making systems. While fully observable environments like chess can rely solely on current state information, partially observable environments like poker require sophisticated belief state tracking and historical information processing.
The mathematical frameworks we’ve explored help us design better algorithms for each type of environment, leading to more effective decision-making systems in real-world applications.