Anda di halaman 1dari 105

Sequential Models

DL4Sci School, Berkeley Lab

Luke de Oliveira

Twilio AI & Berkeley Lab

! @lukede0
" @lukedeo
# lukedeo@ldo.io
$ https://ldo.io
Welcome!
Schedule
Schedule

• Setting up sequence learning


Schedule

• Setting up sequence learning

• A taxonomy of sequential models


Schedule

• Setting up sequence learning

• A taxonomy of sequential models

• Deep learning building blocks


Schedule

• Setting up sequence learning

• A taxonomy of sequential models

• Deep learning building blocks

• RNNs
Schedule

• Setting up sequence learning

• A taxonomy of sequential models

• Deep learning building blocks

• RNNs

• CNNs
Schedule

• Setting up sequence learning

• A taxonomy of sequential models

• Deep learning building blocks

• RNNs

• CNNs

• Transformers (high level)


Basics of Sequence
Learning
Typical (supervised) ML Setup

• You have a fixed set of observables (sensor readouts,


pixels in images, etc.) and you want to predict a fixed
set of outputs (signal/background, energy level, time-
to-event, etc.)

• Formally, X domain and Y codomain

• We want to learn models h : X → Y from, a space H


such that h approximates some true map f : X → Y

• We learn this map through data!


Typical (supervised) ML Setup - shortcomings

• When does a fixed-dimensional vector fail? Variadic


size

• Collections, ordered sequences, temporal


components, unbounded spatial components

• Building models with “summary features”


What are sequences?

• What is sequence data?

• Problem domain where at least one of the input or output


spaces is naturally represented as a ordered sequence

• What is an order?

• Intrinsic (strong) - purely sequential data (time series,


genomic sequences, text)

• Extrinsic (weak) - apply an imposed ad hoc ordering (by


age, ordering particle by energy level, etc)
Representing sequences?

• Vector sequence xi in {xi}i=1:n

• Can index on time, order, or any other extrinsic


ordering

• Without new models, how can we use “typical ML


setup” for dealing with sequences?

• Answer: reductions!
Reductions (or, “bags”)

• Assume I have input sequence {xi}i=1:K, with individual


features xji (some individual sensor readout, or one-hot
encoding of specific word, for example)

• Q: How can I summarize {xji}i=1:K with one number for


any K?

• Mean, median, max, sum, product, etc.

• These lose ordering, which may/may not be ok,


and each of these impose an inductive bias!
Reductions

• Pros:

• Interpretable (sometimes…)

• …?

• Cons:

• Brittle! How do I know my inductive bias is correct?

• Inherently limit performance! Missing information

• Have a very hard time actually representing the sequential nature!

• This doesn’t help us when we want to predict a sequence!


Desiderata

• Consider the whole sequence as a sequence, don’t


summarize it!

• We need to be able to predict sequences (more


later on what this means)

• Be able to handle truly variable length sequences


(no fixed windows, etc.)
Constructing Sequences Inputs

• For a non-temporal sequence, it is trivial (word


order in sentence, gene sequencing, etc.)

• For time series, not trivial!

• If we’ve got constant sampling (audio, etc.), no


need to worry

• Models (as we’ll see) don’t know about time,


process sequence elements one by one
Timeseries

• Consider this sequence:

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

• How can we handle the time component? Three options:

• Ignore, re-sample, or us time-delta as a feature


Ignore time

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

X1 X2 X3 X4 X5 X6
Ignore time

• Avoids complex processing and alignment of


sequences, can directly feed into most deep learning-
based sequence models.

• You lose a critical feature of your data!

• Example: predicting time to failure of a sensor, we


may want to know the frequency with which a sensor
is performing readouts

• Generally not a bad idea to try as a baseline!


Resample

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:54 T = 2:18 T = 2:42 T = 3:06 T = 3:30


Resample
?

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:54 T = 2:18 T = 2:42 T = 3:06 T = 3:30


Resample
?

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:54 T = 2:18 T = 2:42 T = 3:06 T = 3:30


?
Resample

• Allows the model to realistically ingest equally spaced


points in a time series (what deep models are designed for!)

• Traded model complexity for data pipeline & engineering


complexity!

• How do I aggregate multiple samples within a


window?

• How do I choose the sampling frequency?

• How do I fill in sampled ranges with no real samples?


Time-delta as feature

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25


Time-delta as feature

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25


Time-delta as feature

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

0 Δ1

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25


Time-delta as feature

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

0 Δ1 Δ2

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25


Time-delta as feature

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

0 Δ1 Δ2 Δ3 Δ4 Δ5

X1 X2 X3 X4 X5 X6

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25


Time-delta as feature

• As with resampling, allows the model to realistically ingest


equally spaced points in a time series (what deep models are
designed for!)

• Over-indexes on time difference as useful features

• Additional data pipeline & engineering complexity

• How do I do this time-delta calculation when data is big?

• Should I do a single delta, or should I start to do time


deltas k samples back as well?
Representing Features in Sequence

• Variable understanding & normalization are even more


important in sequence models due to compounding
effects

• Normalize features to reasonably comparable ranges

• [0, 1] and [-1, 1] and [0, 3], fine! Then adding [-99,
120]? Not ideal…

• Relevant for non-sequence models as well: use


embeddings for discrete variables!
Representing Discrete Variables

• Use lookup tables (embeddings) to learn a vector


for each class! Similar to Word2Vec / word
embeddings

• Vector space will hold more semantic meaning than


one-hot-encoded representations.
Representing Discrete Variables
Archetypes of
Sequence Models
Archetypes of Sequence Models

“Predictive” “Abstractive” “Labeling” “Captioning”


Archetypes of Sequence Models

“Predictive” “Abstractive” “Labeling” “Captioning”


Many-to-one

• Key idea: given an (ordered) sequence


of vectors representing an example or
datapoint, predict a fixed vector.

• Classification / regression /
survival / etc. agnostic

• Flexible! Inputs can be images (video


understanding), vectors, or anything of a
fixed size
Many-to-many (sequence-to-sequence)

• Key idea: can we perform sequence


transduction, i.e., read in a sequence then
output another sequence, potentially of
different length?

• Most examples come from NLP (translation,


summarization)
Decoder
• Can predict time evolution of a sequence
into the future (forecasting style)
Encoder
• Key question: how do we predict a variable
length entity?
Many-to-many (sequence-to-sequence + decoding)

• How can we predict a variable length sequence?

• Answer: we can learn when to stop decoding!

• For each timestep in our output, predict whether or


not it is the last timestep

• (In NLP, called a “stop” token)


Many-to-many (sequence labeling)

• Key idea: For every “time-step” in the


input, we predict something! This
could be something where we label
the current time-step, or predict the
next timestep (or both).
One-to-many

• Key idea: given a fixed representation,


produce a sequential output

• Example: Given a video frame, predict


the next 24 frames

• Canonical problem in ML: image


captioning
How can we build these?
Deep Learning
Building Blocks
Building blocks for building sequence models

• Very active area of research!

• We will cover “main” three:

• Recurrent neural networks

• Convolutional neural networks

• Transformers

• Many other fancy variants build on these:

• QRNNs, dynamic transformers, etc., etc.


Recurrent Neural Networks
Recurrent Neural Networks

• Neural network unit to learn sequence-based dependencies


over arbitrary length!
Recurrent Neural Networks

• Neural network unit to learn sequence-based dependencies


over arbitrary length!
Recurrent Neural Networks

• Neural network unit to learn sequence-based dependencies


over arbitrary length!
Recurrent Neural Networks

• Neural network unit to learn sequence-based dependencies


over arbitrary length!
Recurrent Neural Networks

• Neural network unit to learn sequence-based dependencies


over arbitrary length!

• Key Insight: can we have an internal state that gets modified


as we read a sequence?
Recurrent Neural Networks

• Neural network unit to learn sequence-based dependencies


over arbitrary length!

• Key Insight: can we have an internal state that gets modified


as we read a sequence?

• Transformation is applied to every element of the sequence


Recurrent Neural Networks

• Neural network unit to learn sequence-based dependencies


over arbitrary length!

• Key Insight: can we have an internal state that gets modified


as we read a sequence?

• Transformation is applied to every element of the sequence

• Recurrence relation feeds output back in to input


Recurrent Neural Networks

• Neural network unit to learn sequence-based dependencies


over arbitrary length!

• Key Insight: can we have an internal state that gets modified


as we read a sequence?

• Transformation is applied to every element of the sequence

• Recurrence relation feeds output back in to input Issues?


Recurrent Neural Networks
Recurrent Neural Networks

Next state
Recurrent Neural Networks

Next state Previous state


Recurrent Neural Networks

Next state Previous state This timestep


Sounds great, right?
Recurrent Neural Networks, issues

• Vanilla RNN formulation

• What happens when we take a gradient?


Recurrent Neural Networks, issues
Recurrent Neural Networks, issues
Recurrent Neural Networks, issues
Recurrent Neural Networks, issues
Recurrent Neural Networks, issues
How do we fix this?

• This is referred to as the vanishing/exploding


gradient problem

• Intuitively: the model will either forget too


frequently or have vivid yet incorrect memories

• We need to control how updates affect the


Jacobian product that the chain rule gives us for the
state update
Hochreiter&
Schmidhuber&
LSTM
The Long Short-Term Memory Unit

• Allow for a trainable decision function for retaining


knowledge

• Trainable ability to forget

• Internally, can read, write, and reset state!


The Long Short-Term Memory Unit
Going Backwards in Time
Going Backwards in Time

• As we have defined things, we only process state


going forward
Going Backwards in Time

• As we have defined things, we only process state


going forward

• Can we have a state for the backwards direction?


Going Backwards in Time

• As we have defined things, we only process state


going forward

• Can we have a state for the backwards direction?

• Yes! Simply have another RNN and invert the input


sequence. For outputs, can concatenate, add,
multiply
Going Backwards in Time

• As we have defined things, we only process state


going forward

• Can we have a state for the backwards direction?

• Yes! Simply have another RNN and invert the input


sequence. For outputs, can concatenate, add,
multiply
How does this fit with our archetypes?
Take the last hidden state of the Use the fixed input as initial state in
encoder and feed as initial state to the RNN
the decoder

Take the last hidden state Output every hidden state, then
use them to predict the target
RNNs (summary)

• Process timesteps one at a time, modifying an


internal state to keep track of what is useful for the
task at hand

• Time-invariant transformation per timestep


(recurrence)

• Can traverse sequences backwards as well!


Convolutional Neural Networks (CNNs)

• Usually thought about in terms of computer vision

• Learn weights that are applied across entire signal


(2D patches for images)
CNNs for Sequence Learning

• We aren’t dealing with 2D sequences, just 1-dimensional convolutions


will do

• Stack vectors from a sequence into an “image” so convolution operator


can broadcast across time dimension — sliding window

• Parameterized by filter size, stride

X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6

Stride = 1, filter size = 3 Stride = 3, filter size = 4


CNN Filters (simplified math)

X1 X2 X3
Filter is parameterized by w, u, v
u
w v

O1 O1 = wTX1 + uTX2 + vTX3


CNNs for Sequence Learning - Trends

• Recent trend: move from RNNs to CNNs

• Machine translation, summarization, …

• Returning to signal processing roots!

• Big question: do we need to retain state?

• Convolutions are translation invariant - how can we encode


ordering?

• Position encoding
Positional Encoding

• Use an embedding as part of the sequence to indicate the


position

• Each integer 1:N has a trainable embedding concatenated


to features, allows model to learn sequential dependence

P1 P2 P3 P4 P5 P6

X1 X2 X3 X4 X5 X6
How does this fit with our archetypes?
How does this fit with our archetypes?

Use a reduction to get


fixed length vector

X1 X2 X3 X4 X5 X6

{max, min, mean, sum}


How does this fit with our archetypes?

Use a reduction to get


fixed length vector

Use dilated convolutions to obtain


sequential dependence in decoder
How does this fit with our archetypes?

Use a reduction to get


fixed length vector

Use dilated convolutions to obtain


sequential dependence in decoder
How does this fit with our archetypes?

Use a reduction to get


fixed length vector
CNNs are a natural
fit for this flavor!
Use dilated convolutions to obtain
sequential dependence in decoder
How does this fit with our archetypes?

Use a reduction to get


fixed length vector
CNNs are a natural
fit for this flavor!
Use dilated convolutions to obtain
sequential dependence in decoder

Use dilated convolutions and


positional encoding
CNNs (summary)

• Use filters without a notion of state to process


windowed groupings of sequence elements

• Time-invariant transformation per timestep


(convolution)

• Use positional encodings to allow learning of position-


aware relationships

• Need a reduction mechanism for many-to-one usecase


Paying Attention in Sequence Models
Paying Attention in Sequence Models

• Attention has gotten lots of hype - intuitively, give a deep learning model an explicit
mechanism to look through it’s history
Paying Attention in Sequence Models

• Attention has gotten lots of hype - intuitively, give a deep learning model an explicit
mechanism to look through it’s history

• Can be used with RNNs, CNNs, etc.


Paying Attention in Sequence Models

• Attention has gotten lots of hype - intuitively, give a deep learning model an explicit
mechanism to look through it’s history

• Can be used with RNNs, CNNs, etc.

• Gained popularity in NMT, used in most areas now too


Paying Attention in Sequence Models

• Attention has gotten lots of hype - intuitively, give a deep learning model an explicit
mechanism to look through it’s history

• Can be used with RNNs, CNNs, etc.

• Gained popularity in NMT, used in most areas now too

• Self attention — use a representation to attend to itself


Paying Attention in Sequence Models

• Attention has gotten lots of hype - intuitively, give a deep learning model an explicit
mechanism to look through it’s history

• Can be used with RNNs, CNNs, etc.

• Gained popularity in NMT, used in most areas now too

• Self attention — use a representation to attend to itself

• Sentence example: The dog had a treat


Paying Attention in Sequence Models

• Attention has gotten lots of hype - intuitively, give a deep learning model an explicit
mechanism to look through it’s history

• Can be used with RNNs, CNNs, etc.

• Gained popularity in NMT, used in most areas now too

• Self attention — use a representation to attend to itself

• Sentence example: The dog had a treat

• In self-attention, the word ‘treat’ can pay attention to the word ‘dog’, and report
that out to another layer)
Paying Attention in Sequence Models

• Attention has gotten lots of hype - intuitively, give a deep learning model an explicit
mechanism to look through it’s history

• Can be used with RNNs, CNNs, etc.

• Gained popularity in NMT, used in most areas now too

• Self attention — use a representation to attend to itself

• Sentence example: The dog had a treat

• In self-attention, the word ‘treat’ can pay attention to the word ‘dog’, and report
that out to another layer)
Transformers

• Transformers are (now) the classical fully attentive


model (no recurrence)

• Key idea: let sequences look at themselves (every


timestep computes a “similarity” with all other
timesteps)

• Differentiable Key-Value lookup with vectors

• Model has “read heads”, allows a single model to


focus on different things
Transformers
Transformers

Encoder Decoder
Transformers (overly simplified encoder)

Every X looks at every other X, and


computes a similarity to then sum
H1 H2 H3 H4 H5 H6
up and have become H, a context-
aware representation at a timestep

X1 X2 X3 X4 X5 X6

This is just one “head”


Where are Transformers useful?

• Long term dependencies

• Explicit attention helps! No need to store this


in a state.

• Is or is near State of the Art on most sequence to


sequence modeling tasks

• Probably not very useful for many-to-one unless you


have lots of extra compute on your hands :)
Picking building blocks

• Choice of RNN vs CNN vs Transformer is fairly company /


research group driven

• Everyone claims that “you only need X” where X is in


{RNN, CNN, Transformer}

• No free lunch! Try, test, and evaluate for specific problems.

• Certain problem characteristics lend themselves


better to a certain type of architecture - better to test
than assume
Concluding Remarks
In conclusion

• Sequence modeling is a flexible tool!

• Four main flavors: many-to-one, many-to-many


(seq2seq & sequence labeling), and one-to-many

• Many building blocks to choose from: main are


RNNs, CNNs, Transformers

• No best model! Evaluate whether any


individual model is relevant for your domain
Thanks!

! @lukede0
" @lukedeo
# lukedeo@ldo.io
$ https://ldo.io

Anda mungkin juga menyukai