Each o∈O takes its share ρ(hao) / ∑o'∈O ρ(hao') of the total probability of 1.0.

DR.GEEK
Jan 8, 2021
2 min read

(8th-January-2021)

• Utility-Maximizing Agents

• Numerical values assigned to outcomes can express any set of preferences among outcomes that obey two reasonable assumptions, called completeness and transitivity (explained later in this section). The numerical function of outcomes is called a utility function. An agent can achieve its optimal preference by maximizing the utility function. However, as in the example of the hitman and the victim, agent actions are associated with sums of outcomes weighted by probabilities.

• Think of the agent's interactions with the environment as a chess game, in which the agent and the environment alternate actions and observations. Let v(h) denote the total value of history h, adding present utility u(h) and discounted, predicted future utilities. And let v(ha) denote the value after action a. We can compute v(h) and v(ha) by:

•

• (2.3) v(h) = u(h) + γ max a∈A v(ha),

• (2.4) v(ha) = ∑o∈O ρ(o | ha) v(hao).

•

• In equation (2.3), v(h) is computed as u(h) plus the maximum value that the agent can achieve with its next action, discounted by γ. In equation (2.4), v(ha) is the expected value after the environment's response, which is a sum of values for different observations, each multiplied by the probability of that observation. Equations (2.3) and (2.4) are applied alternately and recursively. That is, the value v(hao) in (2.4) is v(h') where h' = hao, and v(h') is evaluated using (2.3). The sum in (2.4) results in many different histories, hao, that must be evaluated in (2.3), and the maximization in (2.3) results in many different histories, ha, that must be evaluated in (2.4). Equations (2.3) and (2.4) result in a bushy tree of computations evaluating all posible future histories. If there are only a finite number, T, of time steps, then the recursion ends at that final time. The values n time steps in the future are multiplied by γn which converges toward 0 as n increases, so that even if there is no final time, the recursive sum converges. The function v(.)

• We can use this method of computing the values of future histories to define the actions of a rational agent, denoted by the symbol π:

•

• (2.5) π(h) := a|h|+1 = argmax a∈A v(ha).

•

• Here argmax means that π picks the action a∈A that maximizes v(ha). The agent π is defined in terms of a utility function u, an environment model ρ and a temporal discount γ. The function π(h) is also called a policy.

Monologue of

Dr. GEEK

Daily Blog by Dr. GEEK

Each o∈O takes its share ρ(hao) / ∑o'∈O ρ(hao') of the total probability of 1.0.

Recent Posts

Comments