Sequential Models

Sequential Models
DL4Sci School, Berkeley Lab
Luke de Oliveira
Twilio AI & Berkeley Lab
! @lukede0
" @lukedeo
# lukedeo@ldo.io
$ https://ldo.io
Welcome!
Schedule
Schedule
• Setting up sequence learning

Schedule
• A taxonomy of sequential models

Schedule
• Deep learning building blocks

Schedule
• RNNs
Schedule
• RNNs
• CNNs
Schedule
• RNNs
• CNNs
• Transformers (high level)

Basics of Sequence
Learning
Typical (supervised) ML Setup
• You have a fixed set of observables (sensor readouts,

pixels in images, etc.) and you want to predict a fixed
set of outputs (signal/background, energy level, time-
to-event, etc.)
• Formally, X domain and Y codomain
• We want to learn models h : X → Y from, a space H

such that h approximates some true map f : X → Y
• We learn this map through data!

Typical (supervised) ML Setup - shortcomings
• When does a fixed-dimensional vector fail? Variadic

size
• Collections, ordered sequences, temporal

components, unbounded spatial components
• Building models with “summary features”

What are sequences?
• What is sequence data?
• Problem domain where at least one of the input or output

spaces is naturally represented as a ordered sequence
• What is an order?
• Intrinsic (strong) - purely sequential data (time series,

genomic sequences, text)
• Extrinsic (weak) - apply an imposed ad hoc ordering (by

age, ordering particle by energy level, etc)
Representing sequences?
• Vector sequence xi in {xi}i=1:n
• Can index on time, order, or any other extrinsic

ordering
• Without new models, how can we use “typical ML

setup” for dealing with sequences?
• Answer: reductions!
Reductions (or, “bags”)
• Assume I have input sequence {xi}i=1:K, with individual

features xji (some individual sensor readout, or one-hot
encoding of specific word, for example)
• Q: How can I summarize {xji}i=1:K with one number for

any K?
• Mean, median, max, sum, product, etc.
• These lose ordering, which may/may not be ok,

and each of these impose an inductive bias!
Reductions
• Pros:
• Interpretable (sometimes…)
• …?
• Cons:
• Brittle! How do I know my inductive bias is correct?
• Inherently limit performance! Missing information
• Have a very hard time actually representing the sequential nature!
• This doesn’t help us when we want to predict a sequence!

Desiderata
• Consider the whole sequence as a sequence, don’t

summarize it!
• We need to be able to predict sequences (more

later on what this means)
• Be able to handle truly variable length sequences

(no fixed windows, etc.)
Constructing Sequences Inputs
• For a non-temporal sequence, it is trivial (word

order in sentence, gene sequencing, etc.)
• For time series, not trivial!
• If we’ve got constant sampling (audio, etc.), no

need to worry
• Models (as we’ll see) don’t know about time,

process sequence elements one by one
Timeseries
• Consider this sequence:
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25
• How can we handle the time component? Three options:
• Ignore, re-sample, or us time-delta as a feature

Ignore time
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25
X1 X2 X3 X4 X5 X6
Ignore time
• Avoids complex processing and alignment of

sequences, can directly feed into most deep learning-
based sequence models.
• You lose a critical feature of your data!
• Example: predicting time to failure of a sensor, we

may want to know the frequency with which a sensor
is performing readouts
• Generally not a bad idea to try as a baseline!

Resample
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:54 T = 2:18 T = 2:42 T = 3:06 T = 3:30

Resample
?
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:54 T = 2:18 T = 2:42 T = 3:06 T = 3:30

Resample
?
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:54 T = 2:18 T = 2:42 T = 3:06 T = 3:30

?
Resample
• Allows the model to realistically ingest equally spaced

points in a time series (what deep models are designed for!)
• Traded model complexity for data pipeline & engineering

complexity!
• How do I aggregate multiple samples within a

window?
• How do I choose the sampling frequency?
• How do I fill in sampled ranges with no real samples?

Time-delta as feature
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25
0 Δ1
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25
0 Δ1 Δ2
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25
0 Δ1 Δ2 Δ3 Δ4 Δ5
X1 X2 X3 X4 X5 X6
T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

• As with resampling, allows the model to realistically ingest

equally spaced points in a time series (what deep models are
designed for!)
• Over-indexes on time difference as useful features
• Additional data pipeline & engineering complexity
• How do I do this time-delta calculation when data is big?
• Should I do a single delta, or should I start to do time

deltas k samples back as well?
Representing Features in Sequence
• Variable understanding & normalization are even more

important in sequence models due to compounding
effects
• Normalize features to reasonably comparable ranges
• [0, 1] and [-1, 1] and [0, 3], fine! Then adding [-99,
120]? Not ideal…
• Relevant for non-sequence models as well: use

embeddings for discrete variables!
Representing Discrete Variables
• Use lookup tables (embeddings) to learn a vector

for each class! Similar to Word2Vec / word
embeddings
• Vector space will hold more semantic meaning than

one-hot-encoded representations.
Representing Discrete Variables
Archetypes of
Sequence Models
Archetypes of Sequence Models
“Predictive” “Abstractive” “Labeling” “Captioning”

Archetypes of Sequence Models
“Predictive” “Abstractive” “Labeling” “Captioning”

Many-to-one
• Key idea: given an (ordered) sequence

of vectors representing an example or
datapoint, predict a fixed vector.
• Classification / regression /
survival / etc. agnostic
• Flexible! Inputs can be images (video

understanding), vectors, or anything of a
fixed size
Many-to-many (sequence-to-sequence)
• Key idea: can we perform sequence

transduction, i.e., read in a sequence then
output another sequence, potentially of
different length?
• Most examples come from NLP (translation,

summarization)
Decoder
• Can predict time evolution of a sequence
into the future (forecasting style)
Encoder
• Key question: how do we predict a variable
length entity?
Many-to-many (sequence-to-sequence + decoding)
• How can we predict a variable length sequence?
• Answer: we can learn when to stop decoding!
• For each timestep in our output, predict whether or

not it is the last timestep
• (In NLP, called a “stop” token)

Many-to-many (sequence labeling)
• Key idea: For every “time-step” in the

input, we predict something! This
could be something where we label
the current time-step, or predict the
next timestep (or both).
One-to-many
• Key idea: given a fixed representation,

produce a sequential output
• Example: Given a video frame, predict

the next 24 frames
• Canonical problem in ML: image

captioning
How can we build these?
Deep Learning
Building Blocks
Building blocks for building sequence models
• Very active area of research!
• We will cover “main” three:
• Recurrent neural networks
• Convolutional neural networks
• Transformers
• Many other fancy variants build on these:
• QRNNs, dynamic transformers, etc., etc.

Recurrent Neural Networks
• Neural network unit to learn sequence-based dependencies

over arbitrary length!




• Key Insight: can we have an internal state that gets modified

as we read a sequence?


• Transformation is applied to every element of the sequence



• Recurrence relation feeds output back in to input



• Recurrence relation feeds output back in to input Issues?

Next state
Next state Previous state

Next state Previous state This timestep

Sounds great, right?
Recurrent Neural Networks, issues
• Vanilla RNN formulation
• What happens when we take a gradient?

How do we fix this?
• This is referred to as the vanishing/exploding

gradient problem
• Intuitively: the model will either forget too

frequently or have vivid yet incorrect memories
• We need to control how updates affect the

Jacobian product that the chain rule gives us for the
state update
Hochreiter&
Schmidhuber&
LSTM
The Long Short-Term Memory Unit
• Allow for a trainable decision function for retaining

knowledge
• Trainable ability to forget
• Internally, can read, write, and reset state!

The Long Short-Term Memory Unit
Going Backwards in Time
• As we have defined things, we only process state

going forward

going forward
• Can we have a state for the backwards direction?


going forward
• Yes! Simply have another RNN and invert the input

sequence. For outputs, can concatenate, add,
multiply

going forward
• Yes! Simply have another RNN and invert the input

sequence. For outputs, can concatenate, add,
multiply
How does this fit with our archetypes?
Take the last hidden state of the Use the fixed input as initial state in
encoder and feed as initial state to the RNN
the decoder
Take the last hidden state Output every hidden state, then
use them to predict the target
RNNs (summary)
• Process timesteps one at a time, modifying an

internal state to keep track of what is useful for the
task at hand
• Time-invariant transformation per timestep

(recurrence)
• Can traverse sequences backwards as well!

Convolutional Neural Networks (CNNs)
• Usually thought about in terms of computer vision
• Learn weights that are applied across entire signal

(2D patches for images)
CNNs for Sequence Learning
• We aren’t dealing with 2D sequences, just 1-dimensional convolutions

will do
• Stack vectors from a sequence into an “image” so convolution operator

can broadcast across time dimension — sliding window
• Parameterized by filter size, stride
X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6
Stride = 1, filter size = 3 Stride = 3, filter size = 4

CNN Filters (simplified math)
X1 X2 X3
Filter is parameterized by w, u, v
u
w v
O1 O1 = wTX1 + uTX2 + vTX3

CNNs for Sequence Learning - Trends
• Recent trend: move from RNNs to CNNs
• Machine translation, summarization, …
• Returning to signal processing roots!
• Big question: do we need to retain state?
• Convolutions are translation invariant - how can we encode

ordering?
• Position encoding
Positional Encoding
• Use an embedding as part of the sequence to indicate the

position
• Each integer 1:N has a trainable embedding concatenated

to features, allows model to learn sequential dependence
P1 P2 P3 P4 P5 P6
X1 X2 X3 X4 X5 X6
Use a reduction to get

fixed length vector
X1 X2 X3 X4 X5 X6
{max, min, mean, sum}


fixed length vector
Use dilated convolutions to obtain

sequential dependence in decoder

fixed length vector


fixed length vector
CNNs are a natural
fit for this flavor!

fixed length vector
CNNs are a natural
fit for this flavor!
Use dilated convolutions and

positional encoding
CNNs (summary)
• Use filters without a notion of state to process

windowed groupings of sequence elements
• Time-invariant transformation per timestep

(convolution)
• Use positional encodings to allow learning of position-

aware relationships
• Need a reduction mechanism for many-to-one usecase

Paying Attention in Sequence Models
• Attention has gotten lots of hype - intuitively, give a deep learning model an explicit
mechanism to look through it’s history
• Can be used with RNNs, CNNs, etc.

• Gained popularity in NMT, used in most areas now too

• Self attention — use a representation to attend to itself

• Sentence example: The dog had a treat

• In self-attention, the word ‘treat’ can pay attention to the word ‘dog’, and report
that out to another layer)
• In self-attention, the word ‘treat’ can pay attention to the word ‘dog’, and report
that out to another layer)
Transformers
• Transformers are (now) the classical fully attentive

model (no recurrence)
• Key idea: let sequences look at themselves (every

timestep computes a “similarity” with all other
timesteps)
• Differentiable Key-Value lookup with vectors
• Model has “read heads”, allows a single model to

focus on different things
Transformers
Transformers
Encoder Decoder
Transformers (overly simplified encoder)
Every X looks at every other X, and

computes a similarity to then sum
H1 H2 H3 H4 H5 H6
up and have become H, a context-
aware representation at a timestep
X1 X2 X3 X4 X5 X6
This is just one “head”

Where are Transformers useful?
• Long term dependencies
• Explicit attention helps! No need to store this

in a state.
• Is or is near State of the Art on most sequence to

sequence modeling tasks
• Probably not very useful for many-to-one unless you

have lots of extra compute on your hands :)
Picking building blocks
• Choice of RNN vs CNN vs Transformer is fairly company /

research group driven
• Everyone claims that “you only need X” where X is in

{RNN, CNN, Transformer}
• No free lunch! Try, test, and evaluate for specific problems.
• Certain problem characteristics lend themselves

better to a certain type of architecture - better to test
than assume
Concluding Remarks
In conclusion
• Sequence modeling is a flexible tool!
• Four main flavors: many-to-one, many-to-many

(seq2seq & sequence labeling), and one-to-many
• Many building blocks to choose from: main are

RNNs, CNNs, Transformers
• No best model! Evaluate whether any

individual model is relevant for your domain
Thanks!
! @lukede0
" @lukedeo
# lukedeo@ldo.io
$ https://ldo.io

Sequential Models

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Sequential Models

Diunggah oleh

Hak Cipta:

Format Tersedia

Sequential Models

DL4Sci School, Berkeley Lab

Twilio AI & Berkeley Lab

• Setting up sequence learning

• Setting up sequence learning

• A taxonomy of sequential models

• Setting up sequence learning

• A taxonomy of sequential models

• Deep learning building blocks

• Setting up sequence learning

• A taxonomy of sequential models

• Deep learning building blocks

• Setting up sequence learning

• A taxonomy of sequential models

• Deep learning building blocks

• Setting up sequence learning

• A taxonomy of sequential models

• Deep learning building blocks

• Transformers (high level)

• You have a fixed set of observables (sensor readouts,

• Formally, X domain and Y codomain

• We want to learn models h : X → Y from, a space H

• We learn this map through data!

• When does a fixed-dimensional vector fail? Variadic

• Collections, ordered sequences, temporal

• Building models with “summary features”

• What is sequence data?

• Problem domain where at least one of the input or output

• Intrinsic (strong) - purely sequential data (time series,

• Extrinsic (weak) - apply an imposed ad hoc ordering (by

• Vector sequence xi in {xi}i=1:n

• Can index on time, order, or any other extrinsic

• Without new models, how can we use “typical ML

• Assume I have input sequence {xi}i=1:K, with individual

• Q: How can I summarize {xji}i=1:K with one number for

• Mean, median, max, sum, product, etc.

• These lose ordering, which may/may not be ok,

• Brittle! How do I know my inductive bias is correct?

• Inherently limit performance! Missing information

• Have a very hard time actually representing the sequential nature!

• This doesn’t help us when we want to predict a sequence!

• Consider the whole sequence as a sequence, don’t

• We need to be able to predict sequences (more

• Be able to handle truly variable length sequences

• For a non-temporal sequence, it is trivial (word

• For time series, not trivial!

• If we’ve got constant sampling (audio, etc.), no

• Models (as we’ll see) don’t know about time,

• Consider this sequence:

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

• How can we handle the time component? Three options:

• Ignore, re-sample, or us time-delta as a feature

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

• Avoids complex processing and alignment of

• You lose a critical feature of your data!

• Example: predicting time to failure of a sensor, we

• Generally not a bad idea to try as a baseline!

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

T = 1:30 T = 1:54 T = 2:18 T = 2:42 T = 3:06 T = 3:30

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

T = 1:30 T = 1:54 T = 2:18 T = 2:42 T = 3:06 T = 3:30

T = 1:30 T = 1:45 T = 2:10 T = 2:15 T = 3:00 T = 3:25

T = 1:30 T = 1:54 T = 2:18 T = 2:42 T = 3:06 T = 3:30