Anda di halaman 1dari 101

Predictive Science

a Tautology

Peter Nordin
The Answers is:
The asymmetry of similarity

!  What thing is this like?


!  And what is this like?
A heuristic measure of amount of
information: Shannon’s guessing
game…

1. Pony?
1. Pony?
2. Cow?
2. Cow?
3. Dog?
3. Dog?


345. Pegasus!
345. Pegasus!

345
!
Science is Prediction

!  When does the next solar eclipise in Europe occur?

!  The next solar eclipse in Europe will happen in


August 12, 2026.
Science is Compression
The ”Model” and
Science and Prediction

The Model
The Turkey and the issue
with inductive predictions
(1)
The Turkey and the issue
with inductive predictions
(2)
Mandatory Reading
All Real Science is
Predictive Science

!  Predict when the sun will set tomorrow

!  Predict if you will be sick or well by taking


this medicine

!  Predict what will happen in this project if this


methodology is used
How to predict
anything:

1.  Collect facts


2.  Find a short model fitting all the facts

3.  Extrapolate that model into the future,


probability is length of model
4.  Meta loop: Collect and include facts
about your model finding adventures,
goto step 2 and use for planning
Companies and
Prediction

!  A company is a collection of people


predicting risk from actions

!  No risk - no gain
Recent progress
Recent advances:

Universal Learning
Algorithms. There is a theoretically
optimal way of predicting the future,
given the past. It can be used to define
an optimal (though noncomputable)
rational agent that maximizes its
expected reward in almost arbitrary
environments sampled from
computable probability distributions.
Recent advances:

All Scientist,: Physicists and economists and other


scientists make predictions based on observations. So
does everybody in daily life. Did you know that there is a
theoretically optimal way of predicting? Every scientist
should know about it.

Normally we do not know the true conditional probability distribution p


(next event | past). But assume we do know that p is in some set P of
distributions. Choose a fixed weight w_q for each q in P such that the
w_q add up to 1 (for simplicity, let P be countable). Then construct the
Bayesmix M(x) = Sum_q w_q q(x), and predict using M instead of the
optimal but unknown p.
How wrong is it to do that? The recent exciting work of Marcus Hutter
(funded through Juergen Schmidhuber's SNF research grant
"Unification of Universal Induction and Sequential Decision Theory")
provides general and sharp loss bounds:
Let LM(n) and Lp(n) be the total expected losses of the M-predictor and
the p-predictor, respectively, for the first n events. Then LM(n)-Lp(n) is
at most of the order of sqrt[Lp(n)]. That is, M is not much worse than p.
And in general, no other predictor can do better than that!
In particular, if p is deterministic, then the M-predictor soon won't make
any errors any more!
If P contains ALL computable distributions, then M becomes the
celebrated enumerable universal prior. That is, after decades of
somewhat stagnating research we now have sharp loss bounds for Ray
Solomonoff's universal (but incomputable) induction scheme (1964,
1978).
Alternatively, reduce M to what you get if you just add up weighted
estimated future finance data probabilities generated by 1000
commercial stock-market prediction software packages. If only one of
them happens to work fine (but you do not know which) you still should
get rich.
.
Intelligence…

!  …Is compression

!  If used for prediction


=
Art?
Theory Pyramid

Undedecidable stuff etc

Optimal Cognition

Algorithmic Information The.

Optimal prediction

Exprerimental planning

Turingcompete repr.

Bayes etc

Multivariate distrib stats

Sing var distrib stat


Agent
Formal Agent Model
Gödel machine
Artificial Intelligence

!  Information-theoretic,

!  Statistical, and

!  Philosophical,

!  Foundations of

!  Artificial Intelligence
Universal AI

Universal Artificial Intelligence

= =

Decision Theory = Probability + Utility Theory

+ +

Universal Induction = Ockham + Bayes + Turing

!"
Pieces of the puzzle

!  Philosophical Issues: common principle


to their solution is Occam’s simplicity
principle. Based on Occam’s and
Epicurus’ principle, Bayesian probability
theory, and Turing’s universal machine,
Solomonoff developed a formal theory
of induction.

!  the sequential/online setup considered


in this pres and place it into the wider
machine learning context.
What is I

!  Informal Definition of (Artificial) Intelligence?

!  Intelligence measures an agent’s ability to achieve


goals in a wide range of environments.

!  Emergent: Features such as the ability to learn and


adapt, or to understand, are implicit in the above
definition as these capacities enable an agent to
succeed in a wide range of environments.

!  The science of Artificial Intelligence is concerned


with the construction of intelligent systems/artifacts/
agents and their analysis.
The Hiearchy

!  Induction →Prediction→Decision→Action

!  Having or acquiring or learning or inducing a model of


the environment an agent interacts with allows the
agent to make predictions and utilize them in its
decision process of finding a good next action.

!  Induction infers general models from specific


observations/facts/data, usually exhibiting regularities
or properties or relations in the latter.

!  Example Induction: Find a model of the world


economy.

!  Prediction: Use the model for predicting the future


stock market.

!  Decision: Decide whether to invest assets in stocks or


bonds. Action: Trading large quantities of stocks
influences the market.
Will the Sun Rise
Tomorrow

!  Example 1:

!  Probability of Sunrise Tomorrow What is the probability p(1 |


1 d ) that the sun will rise tomorrow?

!  (d = past # days sun rose, 1 =sun rises. 0 = sun will not rise) •
p is undefined, because there has never been an experiment that
tested the existence of the sun tomorrow (ref. class problem).

!  • The p = 1, because the sun rose in all past experiments.

!  • p = 1 − ϵ, where ϵ is the proportion of stars that explode per


day

!  . • p = d+1 d+2 , which is Laplace rule derived from Bayes


rule.

!  • Derive p from the type, age, size and temperature of the sun,
even though we never observed another star with those exact
properties. Conclusion: We predict that the sun will rise
tomorrow with high probability independent of the
justification.
Sequence

!  Example 2:

!  Digits of a Computable Number • Extend


14159265358979323846264338327950288419716939
937?

!  • Looks random?! • Frequency estimate: n = length of


sequence. ki = number of occured i = ⇒ Probability
of next digit being i is i n . Asymptotically i n → 1 10
(seems to be) true.

!  • But we have the strong feeling that (i.e. with high


probability) the next digit will be 5 because the
previous digits were the expansion of !.

!  • Conclusion: We prefer answer 5, since we see more


structure in the sequence than just random digits.
Sequence 2

!  Example 3:

!  Number Sequences Sequence: x1 , x2 , x3 , x4 ,


x5 , ... 1, 2, 3, 4, ?, ...

!  • x5 = 5, since xi = i for i = 1..4.

!  • x5 = 29, since xi = i 4 − 10i 3 + 35i2 − 49i + 24.


Conclusion: We prefer 5, since linear relation involves
less arbitrary parameters than 4th-order polynomial.
Sequence:
2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,?

!  • 61, since this is the next prime

!  • 60, since this is the order of the next simple group

!  Conclusion: We prefer answer 61, since primes are a


more familiar concept than simple groups. On-Line
Encyclopedia of Integer Seque
Occam?

!  Occam’s Razor to the Rescue

!  • Is there a unique principle which allows us to


formally arrive at a prediction which - coincides
(always?) with our intuitive guess -or- even better, -
which is (in some sense) most likely the best or
correct answer?

!  • Yes! Occam’s razor: Use the simplest explanation


consistent with past data (and use it for prediction). •
Works! For examples presented and for many more. •
Actually Occam’s razor can serve as a foundation of
machine learning in general, and is even a
fundamental principle (or maybe even the mere
definition) of science.

!  • Problem: Not a formal/mathematical objective


principle. What is simple for one may be complicated
for another.
Blue Emeralds?

!  Grue Emerald Paradox

!  Hypothesis 1: All emeralds are green.

!  Hypothesis 2: All emeralds found till y2010 are


green, thereafter all emeralds are blue.

!  • Which hypothesis is more plausible? H1!


Justification?

!  • Occam’s razor: take simplest hypothesis consistent


with data. is the most important principle in machine
learning and science.
View on probalilites

!  Uncertainty and Probability

!  The aim of probability theory is to describe


uncertainty. Sources/interpretations for uncertainty:

!  • Frequentist: probabilities are relative frequencies.


(e.g. the relative frequency of tossing head.)

!  • Objectivist: probabilities are real aspects of the


world. (e.g. the probability that some atom decays in
the next hour)

!  • Subjectivist: probabilities describe an agent’s degree


of belief. (e.g. it is (im)plausible that extraterrestrians
exist)
What we need

!  Kolmogorov complexity

!  Universal Distribution

!  Inductive Learning
Principle of
Indifference
(Epicurus)

! Keep all hypotheses that


are consistent with the
facts
Occam’s Razor

!  Among all hypotheses consistent with the


facts, choose the simplest

!  Newton’s rule #1 for doing nature


philosophy
!  We are to admit no more costs of nature
things than such as are both true and
sufficient to explain the appearances
Question

!  What does “simplest” mean?

!  How to define simplicity?

!  Can a thing be simple under one definition


and not under another?
Bayes’ Rule

!  P(H|D) = P(D|H)*P(H)/P(D)

-P(H) is often considered as initial degree


of belief in H

!  In essence, Bayes’ rule is a mapping from


prior probability P(H) to posterior
probability P(H|D) determined by D
How to get P(H)

!  By the law of large numbers, we can


get P(H|D) if we use many examples
!  Give as much information about that
from only a limited of number of
data
!  P(H) may be unknown,
uncomputable, even may not exist
!  Can we find a single probability
distribution to use as prior
distribution in each different case,
with a proximately the same result as
if we had used the real distribution
Hume on Induction

!  Induction is impossible because we can only


reach conclusion by using known data and
methods.

!  So the conclusion is logically already


contained in the start configuration
Only one algorithm?
Solomonoff ’s Theory of
Induction

!  Maintain all hypotheses consistent with the


data

!  Incoporate “Occam’s Razor”-assign the


simplest hypotheses with highest probability

!  Using Bayes’ rule


Kolmogorov
Complexity

!  k(s) is the length of the shortest program


which, on no input, prints out s

!  k(s)<=|s|

!  There is a string s, k(s) >=n

!  k(s) is objective (program language


independent) by Invariance Theorem
Universal Distribution

!  P(s) = 2-k(s)

!  We use k(s) to describe the complexity of an


object. By Occam’s Razor, the simplest
should have the highest probability.
Problem: !P(s)>1

!  For every n, there exists a n-bit string s, k(s)


= log n, so P(s) = 2-log n = 1/n

!  "+1/3+….>1
Levin’s improvement

!  Using prefix-free program


!  A set of programs, no one of which is a
prefix of any other

!  Kraft’s inequality
!  Let L1, L2,… be a sequence of natural
numbers. There is a prefix-code with this
sequence as lengths of its binary code words
iff !n2-ln<=1
Multiplicative
domination

!  Levin proved that there exists c, c*p(s) >=


p’(s) where c depends on p, but not on s

!  If true prior distribution is computable, then


use the single fixed universal distribution p
is almost as good as the actually true
distribution itself
!  Turing’s thesis: Universal turing
machine can compute all intuitively
computable functions
!  Kolmogorov’s thesis: the Kolmogorov
complexity gives the shortest
description length among all
description lengths that can be
effectively approximated according to
intuition.
!  Levin’s thesis: The universal
distribution give the largest
distribution among all the distribution
that can be effectively approximated
according to intuition
Universal Bet

!  Street gambler Bob tossing a coin and offer:


!  Next is head “1” – give Alice 2$
!  Next is tail “0” – pay Bob 1$

!  Is Bob honest?
!  Side bet: flip coin 1000 times, record the
result as a string s
!  Alice pay 1$, Bob pay Alice 21000-k(s) $
!  Good offer:
!  !|s|=1000 2-1000 21000-k(s)=! |s|=1000 2-k(s)<=1

!  If Bob is honest, Alice increase her money


polynomially

!  If Bob cheat, Alice increase her money


exponentially
Notice

!  The complexity of a string is non-


computable
Conclusion

!  Kolmogorov complexity – optimal effective


descriptions of objects

!  Universal Distribution – optimal effective


probability of objects

!  Both are objective and absolute


The most neutral possible prior…

!  Then, this
multiplicatively
dominates all priors
!  Suppose we want a !  though neutral priors
prior so neutral that will mean slow
it never rules out a learning
model
!  m(x) are “universal”
!  Possible, if limit to priors
computable models

!  Mixture of all
(computable) priors,
with weights, "i, that
decline fairly fast:
The most neutral possible coding
language

!  Universal programming languages (Java, matlab, UTMs,


etc)

!  K(x) = length of shortest program in Java, matlab, UTM,


that generates x (K is uncomputable)

!  Invariance theorem
!  any languages L1, L2, #c,
!  $x |KL1(x)-KL2(x)| # c

!  Mathematically justifies talk of K(x), not KJava(x) , KMatlab


(x),…
So does this mean that choice of
language doesn’t matter?

!  Not quite!
!  c can be large

!  And, for any $L1, c0, #L2, x such that


!  |KL1(x)-KL2(x)| $ c0

!  The problem of the one-instruction code for the


entire data set…

But Kolmogorov complexity can be made


concrete…
Compact Universal Turing
machines

!  210 bits, !-calculus !  272, combinators

Not much room to hide, here!


Neutral priors and Kolmogorov
complexity

!  A key result: !  And for any


computable q,

!  K(x) = -log2m(x) ± o(1)


!  K(x) # -log2q(x) ± o(1)
!  For typical x
drawn from q(x)
!  Where m is a
universal prior

!  Any data, x, that is


!  Analogous to the
likely for any
sensible probability
Shannon’s source distribution has low
coding theorem K(x)
Prediction by simplicity

!  Find shortest ‘program/explanation’ for current


‘corpus’ (binary string)

!  Predict using that program


!  Strictly, use ‘weighted sum’ of
explanations, weighted by brevity
Prediction is possible (Solomonoff,
1978)
Summed error has finite bound

!  sj is summed squared error between


prediction and true probability on item j
!  So prediction converges [faster than 1/
nlog(n)], for corpus size n
!  Computability assumptions only (no
stationarity needed)
Summary so far…

!  Simplicity/occam- close and deep


connections with Bayes
!  Defines universal prior (i.e., based on
simplicity)
!  Can be made “concrete”

!  General prediction results

!  A convenient “dual” framework to Bayes,


when codes are easier than probabilities
Methods…
Infrastructure

Anda mungkin juga menyukai