Predictive Science

Predictive Science
a Tautology
Peter Nordin
The Answers is:
The asymmetry of similarity
! What thing is this like?

! And what is this like?
A heuristic measure of amount of
information: Shannon’s guessing
game…
1. Pony?
1. Pony?
2. Cow?
2. Cow?
3. Dog?
3. Dog?
…
…
345. Pegasus!
345. Pegasus!
345
!
Science is Prediction
! When does the next solar eclipise in Europe occur?
! The next solar eclipse in Europe will happen in

August 12, 2026.
Science is Compression
The ”Model” and
Science and Prediction
The Model
The Turkey and the issue
with inductive predictions
(1)
The Turkey and the issue
with inductive predictions
(2)
Mandatory Reading
All Real Science is
Predictive Science
! Predict when the sun will set tomorrow
! Predict if you will be sick or well by taking

this medicine
! Predict what will happen in this project if this

methodology is used
How to predict
anything:
1. Collect facts

2. Find a short model fitting all the facts
3. Extrapolate that model into the future,

probability is length of model
4. Meta loop: Collect and include facts
about your model finding adventures,
goto step 2 and use for planning
Companies and
Prediction
! A company is a collection of people

predicting risk from actions
! No risk - no gain
Recent progress
Recent advances:
Universal Learning
Algorithms. There is a theoretically
optimal way of predicting the future,
given the past. It can be used to define
an optimal (though noncomputable)
rational agent that maximizes its
expected reward in almost arbitrary
environments sampled from
computable probability distributions.
Recent advances:
All Scientist,: Physicists and economists and other

scientists make predictions based on observations. So
does everybody in daily life. Did you know that there is a
theoretically optimal way of predicting? Every scientist
should know about it.
Normally we do not know the true conditional probability distribution p

(next event | past). But assume we do know that p is in some set P of
distributions. Choose a fixed weight w_q for each q in P such that the
w_q add up to 1 (for simplicity, let P be countable). Then construct the
Bayesmix M(x) = Sum_q w_q q(x), and predict using M instead of the
optimal but unknown p.
How wrong is it to do that? The recent exciting work of Marcus Hutter
(funded through Juergen Schmidhuber's SNF research grant
"Unification of Universal Induction and Sequential Decision Theory")
provides general and sharp loss bounds:
Let LM(n) and Lp(n) be the total expected losses of the M-predictor and
the p-predictor, respectively, for the first n events. Then LM(n)-Lp(n) is
at most of the order of sqrt[Lp(n)]. That is, M is not much worse than p.
And in general, no other predictor can do better than that!
In particular, if p is deterministic, then the M-predictor soon won't make
any errors any more!
If P contains ALL computable distributions, then M becomes the
celebrated enumerable universal prior. That is, after decades of
somewhat stagnating research we now have sharp loss bounds for Ray
Solomonoff's universal (but incomputable) induction scheme (1964,
1978).
Alternatively, reduce M to what you get if you just add up weighted
estimated future finance data probabilities generated by 1000
commercial stock-market prediction software packages. If only one of
them happens to work fine (but you do not know which) you still should
get rich.
.
Intelligence…
! …Is compression
! If used for prediction

=
Art?
Theory Pyramid
Undedecidable stuff etc
Optimal Cognition
Algorithmic Information The.
Optimal prediction
Exprerimental planning
Turingcompete repr.
Bayes etc
Multivariate distrib stats
Sing var distrib stat

Agent
Formal Agent Model
Gödel machine
Artificial Intelligence
! Information-theoretic,
! Statistical, and
! Philosophical,
! Foundations of
! Artificial Intelligence
Universal AI
Universal Artificial Intelligence
= =
Decision Theory = Probability + Utility Theory
+ +
Universal Induction = Ockham + Bayes + Turing
!"
Pieces of the puzzle
! Philosophical Issues: common principle

to their solution is Occam’s simplicity
principle. Based on Occam’s and
Epicurus’ principle, Bayesian probability
theory, and Turing’s universal machine,
Solomonoff developed a formal theory
of induction.
! the sequential/online setup considered

in this pres and place it into the wider
machine learning context.
What is I
! Informal Definition of (Artificial) Intelligence?
! Intelligence measures an agent’s ability to achieve

goals in a wide range of environments.
! Emergent: Features such as the ability to learn and

adapt, or to understand, are implicit in the above
definition as these capacities enable an agent to
succeed in a wide range of environments.
! The science of Artificial Intelligence is concerned

with the construction of intelligent systems/artifacts/
agents and their analysis.
The Hiearchy
! Induction →Prediction→Decision→Action
! Having or acquiring or learning or inducing a model of

the environment an agent interacts with allows the
agent to make predictions and utilize them in its
decision process of finding a good next action.
! Induction infers general models from specific

observations/facts/data, usually exhibiting regularities
or properties or relations in the latter.
! Example Induction: Find a model of the world

economy.
! Prediction: Use the model for predicting the future

stock market.
! Decision: Decide whether to invest assets in stocks or

bonds. Action: Trading large quantities of stocks
influences the market.
Will the Sun Rise
Tomorrow
! Example 1:
! Probability of Sunrise Tomorrow What is the probability p(1 |

1 d ) that the sun will rise tomorrow?
! (d = past # days sun rose, 1 =sun rises. 0 = sun will not rise) •
p is undefined, because there has never been an experiment that
tested the existence of the sun tomorrow (ref. class problem).
! • The p = 1, because the sun rose in all past experiments.
! • p = 1 − ϵ, where ϵ is the proportion of stars that explode per

day
! . • p = d+1 d+2 , which is Laplace rule derived from Bayes

rule.
! • Derive p from the type, age, size and temperature of the sun,
even though we never observed another star with those exact
properties. Conclusion: We predict that the sun will rise
tomorrow with high probability independent of the
justification.
Sequence
! Example 2:
! Digits of a Computable Number • Extend

14159265358979323846264338327950288419716939
937?
! • Looks random?! • Frequency estimate: n = length of

sequence. ki = number of occured i = ⇒ Probability
of next digit being i is i n . Asymptotically i n → 1 10
(seems to be) true.
! • But we have the strong feeling that (i.e. with high

probability) the next digit will be 5 because the
previous digits were the expansion of !.
! • Conclusion: We prefer answer 5, since we see more

structure in the sequence than just random digits.
Sequence 2
! Example 3:
! Number Sequences Sequence: x1 , x2 , x3 , x4 ,

x5 , ... 1, 2, 3, 4, ?, ...
! • x5 = 5, since xi = i for i = 1..4.
! • x5 = 29, since xi = i 4 − 10i 3 + 35i2 − 49i + 24.

Conclusion: We prefer 5, since linear relation involves
less arbitrary parameters than 4th-order polynomial.
Sequence:
2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,?
! • 61, since this is the next prime
! • 60, since this is the order of the next simple group
! Conclusion: We prefer answer 61, since primes are a

more familiar concept than simple groups. On-Line
Encyclopedia of Integer Seque
Occam?
! Occam’s Razor to the Rescue
! • Is there a unique principle which allows us to

formally arrive at a prediction which - coincides
(always?) with our intuitive guess -or- even better, -
which is (in some sense) most likely the best or
correct answer?
! • Yes! Occam’s razor: Use the simplest explanation

consistent with past data (and use it for prediction). •
Works! For examples presented and for many more. •
Actually Occam’s razor can serve as a foundation of
machine learning in general, and is even a
fundamental principle (or maybe even the mere
definition) of science.
! • Problem: Not a formal/mathematical objective

principle. What is simple for one may be complicated
for another.
Blue Emeralds?
! Grue Emerald Paradox
! Hypothesis 1: All emeralds are green.
! Hypothesis 2: All emeralds found till y2010 are

green, thereafter all emeralds are blue.
! • Which hypothesis is more plausible? H1!

Justification?
! • Occam’s razor: take simplest hypothesis consistent

with data. is the most important principle in machine
learning and science.
View on probalilites
! Uncertainty and Probability
! The aim of probability theory is to describe

uncertainty. Sources/interpretations for uncertainty:
! • Frequentist: probabilities are relative frequencies.

(e.g. the relative frequency of tossing head.)
! • Objectivist: probabilities are real aspects of the

world. (e.g. the probability that some atom decays in
the next hour)
! • Subjectivist: probabilities describe an agent’s degree

of belief. (e.g. it is (im)plausible that extraterrestrians
exist)
What we need
! Kolmogorov complexity
! Universal Distribution
! Inductive Learning
Principle of
Indifference
(Epicurus)
! Keep all hypotheses that

are consistent with the
facts
Occam’s Razor
! Among all hypotheses consistent with the

facts, choose the simplest
! Newton’s rule #1 for doing nature

philosophy
! We are to admit no more costs of nature
things than such as are both true and
sufficient to explain the appearances
Question
! What does “simplest” mean?
! How to define simplicity?
! Can a thing be simple under one definition

and not under another?
Bayes’ Rule
! P(H|D) = P(D|H)*P(H)/P(D)
-P(H) is often considered as initial degree

of belief in H
! In essence, Bayes’ rule is a mapping from

prior probability P(H) to posterior
probability P(H|D) determined by D
How to get P(H)
! By the law of large numbers, we can

get P(H|D) if we use many examples
! Give as much information about that
from only a limited of number of
data
! P(H) may be unknown,
uncomputable, even may not exist
! Can we find a single probability
distribution to use as prior
distribution in each different case,
with a proximately the same result as
if we had used the real distribution
Hume on Induction
! Induction is impossible because we can only

reach conclusion by using known data and
methods.
! So the conclusion is logically already

contained in the start configuration
Only one algorithm?
Solomonoff ’s Theory of
Induction
! Maintain all hypotheses consistent with the

data
! Incoporate “Occam’s Razor”-assign the

simplest hypotheses with highest probability
! Using Bayes’ rule

Kolmogorov
Complexity
! k(s) is the length of the shortest program

which, on no input, prints out s
! k(s)<=|s|
! There is a string s, k(s) >=n
! k(s) is objective (program language

independent) by Invariance Theorem
Universal Distribution
! P(s) = 2-k(s)
! We use k(s) to describe the complexity of an

object. By Occam’s Razor, the simplest
should have the highest probability.
Problem: !P(s)>1
! For every n, there exists a n-bit string s, k(s)

= log n, so P(s) = 2-log n = 1/n
! "+1/3+….>1
Levin’s improvement
! Using prefix-free program

! A set of programs, no one of which is a
prefix of any other
! Kraft’s inequality
! Let L1, L2,… be a sequence of natural
numbers. There is a prefix-code with this
sequence as lengths of its binary code words
iff !n2-ln<=1
Multiplicative
domination
! Levin proved that there exists c, c*p(s) >=

p’(s) where c depends on p, but not on s
! If true prior distribution is computable, then

use the single fixed universal distribution p
is almost as good as the actually true
distribution itself
! Turing’s thesis: Universal turing
machine can compute all intuitively
computable functions
! Kolmogorov’s thesis: the Kolmogorov
complexity gives the shortest
description length among all
description lengths that can be
effectively approximated according to
intuition.
! Levin’s thesis: The universal
distribution give the largest
distribution among all the distribution
that can be effectively approximated
according to intuition
Universal Bet
! Street gambler Bob tossing a coin and offer:

! Next is head “1” – give Alice 2$
! Next is tail “0” – pay Bob 1$
! Is Bob honest?
! Side bet: flip coin 1000 times, record the
result as a string s
! Alice pay 1$, Bob pay Alice 21000-k(s) $
! Good offer:
! !|s|=1000 2-1000 21000-k(s)=! |s|=1000 2-k(s)<=1
! If Bob is honest, Alice increase her money

polynomially
! If Bob cheat, Alice increase her money

exponentially
Notice
! The complexity of a string is non-

computable
Conclusion
! Kolmogorov complexity – optimal effective

descriptions of objects
! Universal Distribution – optimal effective

probability of objects
! Both are objective and absolute

The most neutral possible prior…
! Then, this
multiplicatively
dominates all priors
! Suppose we want a ! though neutral priors
prior so neutral that will mean slow
it never rules out a learning
model
! m(x) are “universal”
! Possible, if limit to priors
computable models
! Mixture of all
(computable) priors,
with weights, "i, that
decline fairly fast:
The most neutral possible coding
language
! Universal programming languages (Java, matlab, UTMs,

etc)
! K(x) = length of shortest program in Java, matlab, UTM,

that generates x (K is uncomputable)
! Invariance theorem
! any languages L1, L2, #c,
! $x |KL1(x)-KL2(x)| # c
! Mathematically justifies talk of K(x), not KJava(x) , KMatlab

(x),…
So does this mean that choice of
language doesn’t matter?
! Not quite!
! c can be large
! And, for any $L1, c0, #L2, x such that

! |KL1(x)-KL2(x)| $ c0
! The problem of the one-instruction code for the

entire data set…
But Kolmogorov complexity can be made

concrete…
Compact Universal Turing
machines
! 210 bits, !-calculus ! 272, combinators
Not much room to hide, here!

Neutral priors and Kolmogorov
complexity
! A key result: ! And for any

computable q,
! K(x) = -log2m(x) ± o(1)

! K(x) # -log2q(x) ± o(1)
! For typical x
drawn from q(x)
! Where m is a
universal prior
! Any data, x, that is

! Analogous to the
likely for any
sensible probability
Shannon’s source distribution has low
coding theorem K(x)
Prediction by simplicity
! Find shortest ‘program/explanation’ for current

‘corpus’ (binary string)
! Predict using that program

! Strictly, use ‘weighted sum’ of
explanations, weighted by brevity
Prediction is possible (Solomonoff,
1978)
Summed error has finite bound
! sj is summed squared error between

prediction and true probability on item j
! So prediction converges [faster than 1/
nlog(n)], for corpus size n
! Computability assumptions only (no
stationarity needed)
Summary so far…
! Simplicity/occam- close and deep

connections with Bayes
! Defines universal prior (i.e., based on
simplicity)
! Can be made “concrete”
! General prediction results
! A convenient “dual” framework to Bayes,

when codes are easier than probabilities
Methods…
Infrastructure

Predictive Science

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Predictive Science

Diunggah oleh

Hak Cipta:

Format Tersedia

Predictive Science

! What thing is this like?

! When does the next solar eclipise in Europe occur?

! The next solar eclipse in Europe will happen in

! Predict when the sun will set tomorrow

! Predict if you will be sick or well by taking

! Predict what will happen in this project if this

1. Collect facts

3. Extrapolate that model into the future,

! A company is a collection of people

All Scientist,: Physicists and economists and other

Normally we do not know the true conditional probability distribution p

! If used for prediction

Undedecidable stuff etc

Algorithmic Information The.

Multivariate distrib stats

Sing var distrib stat

Universal Artificial Intelligence

Decision Theory = Probability + Utility Theory

Universal Induction = Ockham + Bayes + Turing

! Philosophical Issues: common principle

! the sequential/online setup considered

! Informal Deﬁnition of (Artiﬁcial) Intelligence?

! Intelligence measures an agent’s ability to achieve

! Emergent: Features such as the ability to learn and

! The science of Artiﬁcial Intelligence is concerned

! Having or acquiring or learning or inducing a model of

! Induction infers general models from speciﬁc

! Example Induction: Find a model of the world

! Prediction: Use the model for predicting the future

! Decision: Decide whether to invest assets in stocks or

! Probability of Sunrise Tomorrow What is the probability p(1 |

! • The p = 1, because the sun rose in all past experiments.

! • p = 1 − ϵ, where ϵ is the proportion of stars that explode per

! . • p = d+1 d+2 , which is Laplace rule derived from Bayes

! Digits of a Computable Number • Extend

! • Looks random?! • Frequency estimate: n = length of

! • But we have the strong feeling that (i.e. with high

! • Conclusion: We prefer answer 5, since we see more

! Number Sequences Sequence: x1 , x2 , x3 , x4 ,

! • x5 = 5, since xi = i for i = 1..4.

! • x5 = 29, since xi = i 4 − 10i 3 + 35i2 − 49i + 24.

! • 61, since this is the next prime

! • 60, since this is the order of the next simple group

! Conclusion: We prefer answer 61, since primes are a

! Occam’s Razor to the Rescue

! • Is there a unique principle which allows us to

! • Yes! Occam’s razor: Use the simplest explanation

! • Problem: Not a formal/mathematical objective

! Grue Emerald Paradox

! Hypothesis 1: All emeralds are green.

! Hypothesis 2: All emeralds found till y2010 are

! • Which hypothesis is more plausible? H1!

! • Occam’s razor: take simplest hypothesis consistent

! Uncertainty and Probability

! The aim of probability theory is to describe

! • Frequentist: probabilities are relative frequencies.

! • Objectivist: probabilities are real aspects of the

! • Subjectivist: probabilities describe an agent’s degree

! Keep all hypotheses that

! Among all hypotheses consistent with the

! Newton’s rule #1 for doing nature

! What does “simplest” mean?