top of page
Search
Writer's pictureDR.GEEK

Learning Finite Stochastic Models

(16th-January-2021)


• While Hutter studied using MDPs (Hutter 2009a) and dynamic Bayesian networks (DBN) (Hutter 2009b; Ghahramani 1997) for environment models, I prefer modeling environments with finite stochastic loop programs because they can factor logic common to.


• multiple observations into unified functions (Hibbard 2012a). To do that requires a new way to compute ρ (h). Let Q be the set of all finite stochastic loop programs in some prefix-free procedural language. Let φ(q) = 2-|q| be the prior probability of program q, where |q| is the length of program q.


• Let P(h | q) be the probability that the program q computes the history h (that is, produces the observations oi in response to the actions ai for 1 ≤ i ≤ |h|). For a simple example, let A = {a, b}, O = {0, 1}, h = (a, 1, a, 0, b, 1) and let q generate observation 0 with probability 0.2 and observation 1 with probability 0.8, without any internal state or dependence on the agent's actions. Then the probability that the interaction history h is generated by program q is the product of the probabilities of the 3 observations in h: P(h | q) = 0.8 × 0.2 × 0.8 = 0.128.


• For example, Table 4.1 defines transition probabilities for a program q' that generates observations dependent on internal state and the agent's actions. The left column gives the current state of q' (s0 or s1) and the agent's action (a or b). The top row gives the next state of q' and the generated observation (0 or 1). The entries are probabilities with each row summing to 1.0. As before, let the history be h = (a, 1, a, 0, b, 1). The program q' starts in state s0. There are two possible state sequences consistent with the history h: (s0, s0, s0, s1) and (s0, s1, s0, s1). The product of the transition probabilities for the first sequence is 0.3 × 0.2 × 0.4 = 0.024 while the product of the transition probabilities for the second sequence is 0.5 × 1.0 × 0.4 = 0.2. These sequences are mutually exclusive, hence P(h | q') is the sum 0.2 + 0.024

= 0.224.


• Given an interaction history h, the environment model is the single program that provides the most probable explanation of h, that is the q that maximizes P(q | h). By Bayes' Theorem:


• (4.1) P(q | h) = P(h | q) φ(q) / P(h).


• Because it is constant over all q, P(h) can be eliminated. Thus, given a history h, we define λ(h) as the most probable program modeling h by:


• (4.2) λ(h) := argmax q∈Q P(h | q) φ(q).


• Given an environment model λ(h), the following can be used for the prior probability of an observation history h' extending h (i.e., h is an initial sub-interval of h'):

• (4.3) ρ (h') = P(h' | λ(h)).


• This distribution ρ (h') can be used in the framework defined by equations (2.1), (2.3)−(2.5). Not only are the environment models finite, but they can be finitely computed by the agent (Hibbard 2012b):


• Proposition 4.1. Given a finite history h, the model λ(h) can be finitely computed.


• Proof. Given h = (a1, o1, ..., at, ot), let q0 be the program that produces observation oi at time step i for 1 ≤ i ≤ t. That is, q0 is a simple "table-lookup" that produces the observations oi in sequence. Let n = |q0|. Since the behavior of q0 is deterministic, P(h | q0) φ(q0) = 1 × 2-n = 2-n. Hence the maximum value in equation (4.2) must be at least 2-n, that is, P(h | λ(h)) φ(λ(h)) ≥ 2-n. For any program q with |q| > n, P(h | q) φ(q) < 1 × 2-n = 2-n so λ(h) ≠ q. Thus one algorithm for finitely computing λ(h) is an exhaustive search of the finite number of programs q with |q| ≤ n.


• As the computing time of this exhaustive search for λ(h) is exponential in |h|, this procedure is not practical for learning environment models. However, it is useful to have a finitely computable theoretical framework.


• An agent that is part of a finite universe will not have the necessary computational resources to compute the expressions defined by equations (4.2), (4.3), (2.1) and (2.3)−(2.5). Chapter 8 will address the issue of agents that must approximate these equations.


• The derivation of ρ (h) in equations (4.1)−(4.3) takes into account Occam's razor, to favor the simplest explanation of the observations, while ignoring Epicurus' principle to include all explanations. An alternate way to compute ρ (h) accounts for all finite stochastic loop programs:


• (4.4) ρ (h) = ∑q∈Q P(h | q) φ(q).


• Because this equation requires a sum over an infinite number of programs, it is not finitely computable. However, it can be approximated to any desired precision in finite computing time.

3 views0 comments

Recent Posts

See All

Comments


bottom of page