top of page
Search
Writer's pictureDR.GEEK

Modifications of the Agent Framework

(12th-January-2021)



• Universal AI learns an environment model in the form of a probability distribution ρ(h) over interaction histories. Conditional probabilities ρ(o | ha) are derived from ρ(h) for use in equations (2.3)−(2.5), repeated here for convenience:

• (3.4) v(h) = u(h) + γ max a∈A v(ha),

• (3.5) v(ha) = ∑o∈O ρ(o | ha) v(hao),

• (3.6) π(h) := a|h|+1 = argmax a∈A v(ha).

• Given that the AIXI agent learns an abstract and complex function such as ρ(h), other agents could learn the value function v(ha) or the policy π(h) rather than learning ρ(h). In fact, some AI systems do just that.


• Learning π(h) is referred to as policy iteration or as evolutionary programming is following. For example, the Hayek system of Eric Baum and Igor Durdanovic (Baum 2004) learned to solve the block-stacking puzzle illustrated in Figure 3.2. Stack 0 contains several types of blocks with different designs. In total, stacks 1, 2, and 3 contain the same number of blocks of each design that are contained in stack 0. The goal is to get all the blocks from stacks 1, 2, and 3 into stack 1, with the order of block designs exactly matching stack 0. The solver can only move one block at a time from the top of stack 1, 2, or 3 to the top of another of stacks 1, 2, and 3. Blocks cannot be moved to or from stack 0.

• In the context of our agent-environment framework, for each puzzle Hayek made a single observation of the initial configuration of the block stacks. Its sequence of moves to solve (or not solve) the puzzle constituted a single action, followed by an observation of a reward (1 if the puzzle was solved, 0 if not). Hayek's efforts to solve a sequence of puzzles constituted an interaction history, during which Hayek learned a policy π(h).

2 views0 comments

Recent Posts

See All

Comments


bottom of page