1904 11574

TVQA+: Spatio-Temporal Grounding for Video Question Answering
Jie LeiLicheng Yu Tamara L. Berg Mohit Bansal

Department of Computer Science
University of North Carolina at Chapel Hill
{jielei, licheng, tlberg, mbansal}@cs.unc.edu
arXiv:1904.11574v1 [cs.CV] 25 Apr 2019
Abstract
We present the task of Spatio-Temporal Video Question 00:02.314 → 00:06.732

Howard: Sheldon, he’s got Raj. Use
Answering, which requires intelligent systems to simultane- your sleep spell. Sheldon! Sheldon!
ously retrieve relevant moments and detect referenced vi- 00:06.902 → 00:10.992
Sheldon: I’ve got the Sword of Azeroth.
sual concepts (people and objects) to answer natural lan-
Question: What is Sheldon holding when he is talking to Howard about the sword?
guage questions about videos. We first augment the TVQA Correct Answer: A computer.
dataset with 310.8k bounding boxes, linking depicted ob-

jects to visual concepts in questions and answers. We name
this augmented version as TVQA+. We then propose Spatio-
Temporal Answerer with Grounded Evidence (STAGE), a 00:17.982 → 00:20.532
Howard: That's really stupid advice.
unified framework that grounds evidence in both the spatial 00:20.534 → 00:22.364
and temporal domains to answer questions about videos. Raj: You know that hurts my feelings.
Comprehensive experiments and analyses demonstrate the

Question: Who is talking to Howard when he is in the kitchen upset?
effectiveness of our framework and how the rich annota- Correct Answer: Raj is talking to Howard.
tions in our TVQA+ dataset can contribute to the question Figure 1. Sample QA pairs from TVQA+ dataset. Questions are
answering task. As a side product, by performing this joint both temporally localized to clips, and spatially localized with
task, our model is able to produce more insightful interme- frame-level bounding box annotations for visual concepts (objects
diate results. Dataset1 and code2 are publicly available. and people) that appear in questions and correct answers. Colors
indicate corresponding box-object pairs. Text inside red dashed
blocks are subtitles. For brevity, the wrong answers are omitted.
1. Introduction
We have witnessed great progress in recent years on
scenarios, such as natural language guided spatio-temporal
image-based visual question answering (QA) tasks [2, 43,
localization, and adding explainability to video question an-
48]. One key to this success has been spatial atten-
swering, which is potentially useful for decision making
tion [1, 34, 23], where neural models learn to attend to rel-
and model debugging. To enable this line of research, we
evant regions for predicting the correct answer. Compared
collect new annotations for an existing video QA dataset.
to image-based QA, there has been less progress on the per-
In the past few years, several video QA datasets have
formance of video-based QA tasks. One possible reason is
been proposed, e.g., MovieFIB [25], MovieQA [35], TGIF-
that attention techniques are hard to generalize to the tem-
QA [14], PororoQA [17], and TVQA [19]. Among them,
poral nature of videos. Moreover, due to the high cost of
TVQA was released most recently, providing a large video
annotation, most existing video QA datasets only contain
QA dataset built on top of 6 famous TV series. Because
question-answer pairs, without providing labels for the key
TVQA was collected on television shows, it is built on
moments or regions needed to answer the question. Inspired
natural video content with rich dynamics and realistic so-
by previous work on grounded image and video caption-
cial interactions, where question-answer pairs are written
ing [24, 47, 46], we propose methods that explicitly local-
by people observing both videos and their accompanying
ize video moments as well as spatial regions for answering
dialogues, encouraging the questions to require both vision
video-based questions. Such methods are useful in many
and language understanding to answer. One key property of
1 http://tvqa.cs.unc.edu TVQA is it provides temporal annotations denoting which
2 https://github.com/jayleicn/TVQA-PLUS parts of a video clip are necessary for answering a pro-
1
posed question. However, none of the video QA datasets • We design a novel video question answering frame-
(including TVQA) provide spatial annotation for the an- work, Spatio-Temporal Answerer with Grounded Ev-
swers. Actually, grounding spatial regions correctly could idence (STAGE), to jointly localize moments, ground
be as important as grounding temporal moments for answer- objects, and answer questions. By performing all three
ing a given question. For example, in Fig. 1, to answer the sub-tasks together, our model achieves significant per-
question of ‘What is Sheldon holding when he is talking to formance gains over the state-of-the-art, as well as pre-
Howard about the sword?’, we need to localize the moment senting insightful visualized results.
when ‘he is talking to Howard about the sword?’, as well as
looking at the specific region of ‘What is Sheldon holding’. 2. Related Work
In this paper, we first augment one show, “The Big Bang
Theory”, from the TVQA dataset with grounded bound- Question Answering Teaching machines to answer ques-
ing boxes, resulting in a spatio-temporally grounded video tions is an important problem for AI. In recent years, mul-
QA dataset, TVQA+. TVQA+ consists of 29.4K multiple- tiple question answering datasets and tasks have been pro-
choice questions grounded in both the temporal and spatial posed to facilitate research towards this goal, in both the
domains. To collect spatial groundings, we start by identi- vision and language communities, in the form of visual
fying a set of visual concept words, i.e. objects and people, question answering [2, 43, 14] and textual question answer-
mentioned in the question or correct answer. Next, we asso- ing [30, 29, 39, 38], respectively. Video question answer-
ciate the referenced visual concepts with object regions in ing [19, 35, 17] with naturally occurring subtitles are par-
individual frames, if there are any, by annotating bounding ticularly interesting, as it combines both visual and textual
boxes for each referred concept. One example QA pair is information for question answering. Different from exist-
shown in Fig. 1. The TVQA+ dataset has a total of 310.8K ing video QA tasks, where a system is only required to pre-
bounding boxes linked with referred objects and people, dict an answer, we propose a novel task that additionally
spanning across 2.5K categories, more details in Section 3. grounds the answer in both spatial and temporal domains.
With such richly annotated data, we propose the task of Language-Guided Retrieval Grounding language in im-
spatio-temporal video question answering, which requires ages/videos is an interesting problem that requires jointly
intelligent systems to localize relevant moments, detect re- understanding both text and visual data. Earlier works [16,
ferred objects and people, and answer questions. 13, 45, 44, 42, 32] focused on identifying the referred object
We further design several metrics to evaluate the perfor- in an image. Recently, there has been a growing interest in
mance of the proposed task, including QA accuracy, object moment retrieval tasks [12, 11, 9], where the goal is to lo-
grounding precision, and a joint temporal localization and calize a short clip from a long video via a natural language
answering accuracy. We find that the performance of ques- query. Our work integrates the goal of both tasks, requir-
tion answering benefits from both temporal moment and ing a system to ground the referred moments and objects
spatial region supervision. Additionally, the visualization of simultaneously.
temporal and spatial localization is helpful for understand- Temporal and Spatial Attention Attention has shown
ing what the model has learned. great success on many vision and language tasks, such as
To address spatio-temporal video question answering, image captioning [1, 40], visual question answering [1, 36],
we propose a novel end-to-end trainable model, Spatio- language grounding [42], etc. However, sometimes the at-
Temporal Answerer with Grounded Evidence (STAGE), tention learned by the model itself may not accord with hu-
which effectively combines moment localization, object man expectations [22, 5]. Recent works on grounded image
grounding, and question answering in a unified framework. captioning and video captioning [46, 22, 47] show better
Comprehensive ablation studies demonstrate how each of performance can be achieved by explicitly supervising the
our annotations and model components helps to improve the attention. In this work, we use annotated frame-wise bound-
performance of video question answering. ing box annotations to supervise both temporal and spatial
attention. Experimental results demonstrate the effective-
To summarize, our contributions are:
ness of supervising both domains in video QA.
• We collect TVQA+, a large-scale spatio-temporal
video question answering dataset, which augments the 3. Dataset
original TVQA dataset with frame-level bounding box In this section, we describe the TVQA+ Dataset, the first
annotations. To our knowledge, this is the first dataset video question answering dataset with both spatial and tem-
that combines moment localization, object grounding, poral annotations. TVQA+ is built on the TVQA dataset
and question answering. introduced in [19]. TVQA is a large-scale video QA dataset
• We propose a set of metrics to evaluate the perfor- based on 6 popular TV shows, containing 152.5K multi-
mance of both spatio-temporal localization and ques- ple choice questions from 21.8K, 60-90 second long video
tion answering. clips. The questions in the TVQA dataset are composi-
2
Avg Span Len Avg Video Len
Split #QAs #Clips #Annotated Images #Boxes #Categories
(secs) (secs)
Train 23,545 3,364 7.20 61.49 118,930 249,236 2,281
Val 3,017 431 7.26 61.48 15,350 32,682 769
Test 2,821 403 7.18 61.48 14,188 28,908 680
Total 29,383 4,198 7.20 61.49 148,468 310,826 2,527
Table 1. Data Statistics for TVQA+ dataset.
annotations could provide additional useful training source

for models to gain a deeper understanding of visual infor-
mation in TVQA. Therefore, to complement the original
TVQA dataset, we collect frame-wise bounding boxes for
visual concepts mentioned in the questions and correct an-
swers. Since the full TVQA dataset is very large, we start
by collecting bounding box annotations for QA pairs asso-
ciated with one of the 6 TV shows - The Big Bang Theory,
which contains 29,383 QA pairs from 4,198 clips.
3.1. Data Collection
Identify Visual Concepts: To annotate the visual concepts
in video frames, the first step is to identify them in the QA
pairs. We use the Stanford CoreNLP part-of-speech (POS)
Figure 2. User interface for bounding box annotation. Here, the tagger [26] to extract all nouns in the questions and correct
worker is asked to draw a box around the highlighted “laptop”. answers; this gives us a total of 152,722 words from a vo-
cabulary of 9,690 words. We manually label the non-visual
Dataset Origin Task
#Clips/#QAs
#Boxes
Temporal nouns (e.g., ‘plan’, ‘time’, etc.) in the top 600 nouns, re-
(#Querys) Annotation
MovieFIB [25] Movie QA 118.5K/349K 0 -
moving 165 frequent non-visual nouns from the vocabulary.
MovieQA [35] Movie QA 6.8K/6.5K 0 X Bounding Box Annotation: For the selected The Big Bang
TGIF-QA [14] Tumblr QA 71.7K/165.2K 0 -
PororoQA [17] Cartoon QA 16.1K/8.9K 0 - Theory videos from TVQA, we first ask Amazon Mechan-
DiDeMo [12] Flickr TL 10.5K/40.5K 0 X ical Turk (AMT) workers to adjust the start and end times-
Charades-STA [9] Home TL -/19.5K 0 X
TVQA [19] TV Show QA/TL 21.8K/152.5K 0 X tamps to refine the temporal annotation.3 We then sample
TVQA+ TV Show QA/TL/SL 4.2K/29.4K 310.8K X one frame every two seconds from each span for annota-
Table 2. Comparison of TVQA+ dataset with other video-language tion. For each frame, we collect the bounding boxes for the
datasets. QA = Question Answering, TL = Temporal Localization, objects/people mentioned in each QA pair. In this step, we
SL = Spatial Localization. show a question, its correct answer, and the sampled video
frames to an AMT worker (illustrated in Figure 2). As each
QA pair has multiple visual concepts as well as multiple
tional, where each question is comprised of two parts, a frames, each task shows one pair of a concept word and a
question part (“where was Sheldon sitting”), joined via a sampled frame. For example, in Figure 2, the word “lap-
link word, (“before”, “when”, “after”), to a localization top” is highlighted, and workers are instructed to draw a
part that temporally locates when the question occurs (“he box around it. Note, it is possible that the highlighted word
spilled the milk”). Models should answer questions using will be a non-visual word or a visual word that is not present
both visual information from the video, as well as language in the frame being shown. In that case, the workers are al-
information from the naturally associated dialog (subtitles). lowed to check the box indicating the object is not present.
Since the video clips on which the questions were collected During annotation, we also provide the original videos (with
are usually much longer than the context needed for answer- subtitles) in case they have trouble understanding the given
ing the questions, the TVQA dataset also provides a tem- QA pair.
poral timestamp annotation indicating the minimum span
3.2. Dataset Analysis
(context) needed to answer each question.
While the TVQA dataset provides a novel question for- TVQA+ contains 29,383 QA pairs from 4,198 video
mat and temporal annotations, it lacks spatial grounding in- clips, with 148,468 images annotated with 310,826 bound-
formation, i.e., bounding boxes of the concepts (objects and 3 We provide results of our model trained with original and refined tem-
people) mentioned in the QA pair. We hypothesize object poral annotation in the supplementary file.
3
10^5
Object
People
#boxes per category
10^4
10^3
10^2
10^1
10^0 hand
hands
sheldon
leonard
penny
howard
raj
amy
bernadette
apartment
table
couch
laptop
door
guys
stuart
phone
head
car
bed
wine
glass
chair
cup
room
shirt
everyone
food
kitchen
man
friends
computer
people
book
box
face
stairs
bottle
drink
cooper
priya
emily
store
arthur
group
dinner
board
clothes
bag
jacket
guy
girls
coffee
woman
leslie
girl
water
arm
desk
glasses
Figure 3. Box distributions for top 60 categories in TVQA+ train set.
moment localization, object grounding and video QA. First,

30%
STAGE encodes the video and text (subtitle, QA pairs) via
Percentage of questions
Percentage of boxes
20% 20% frame-wise regional visual representations and neural lan-

guage representations, respectively. The encoded video and
10% 10% text representations are then contextualized using a Con-
volutional Encoder. Second, STAGE computes attention
0% 0%
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1 3 5 7 9 11 13 15 17 19 scores from each QA word to the object regions and sub-
box area to image area ratio span lengths (seconds)
Figure 4. Bounding box/image ratios (left), and span length distri-

title words. Leveraging the attention scores, STAGE is
butions (right) in TVQA+ The majority of the boxes are very small able to generate QA-aware representations, as well as au-
compared to image size and most spans are less than 10 seconds. tomatically detecting the referred objects and people. The
attended QA-aware video representation and subtitle rep-
resentation are then fused together to obtain a frame-wise
ing boxes. Statistics of the full dataset are shown in Table 1. joint representation. Third, taking the frame-wise represen-
Note, we follow the same data splits as the original TVQA tation as input, STAGE learns to predict temporal spans that
dataset [19]. Table 2 compares the TVQA+ dataset with are relevant to the QA pair, then combines the global and
other video-language based datasets. The TVQA+ dataset local (span localized) video information to answer the ques-
is unique as it contains three different annotations: question tions. Next, we explain each step in detail.
answering, temporal localization, and spatial localization.
On average, we obtain 2.09 boxes per image and 10.58 4.1. Formulation
boxes per question. The annotated boxes cover 2,527 cat- In our tasks, the inputs are: (1) a question with 5 candi-
egories. We show the number of boxes (in log scale) for date answers; (2) a 60-second long video; (3) a set of sub-
each of the top 60 categories in Figure 3. The distribution title sentences, and our goal is to predict the correct answer
has a long tail, e.g., the number of boxes for the most fre- as well as ground the answer both spatially and temporally.
quent category ‘sheldon’ is around 2 orders of magnitude Given the question, q, and the answers, {ak }5k=1 , we first
larger than the 60th category ‘glasses’. We also show the formulate them as 5 hypotheses (QA-pair) hk = [q, ak ] and
distribution of ratio of bounding box area over image area predict their correctness scores based on the video and sub-
ratio in Figure 4 (left). The majority of boxes are fairly title context, which is similar to [27, 19]. We denote the
small compared to the image, which makes object ground- ground-truth (GT) answer index as y ans and thus the GT
ing challenging. Figure 4 (right) shows the distribution of hypothesis as hyans . We then extract video frames {vt }Tt=1
localized span length. While most of the spans are less than at 0.5 FPS (T is the number of frames for each video), align-
10 seconds, the largest spans are up to 20 seconds. The av- ing the subtitle sentences temporally with the video frames.
erage span length is 7.2 seconds, which is short compared Specifically, for each frame vt , we pair it with two neigh-
to the average length of the full video clip (61.2 seconds). boring subtitle sentences based on the subtitle timestamp.
We choose two neighbors since this keeps most of the sen-
4. Methods tences at our current frame rate, and also avoids severe mis-
Our proposed method, Spatio-Temporal Answerer with alignment between the frames and the sentences. The set of
Grounded Evidence (STAGE), is a unified framework for aligned subtitle sentences are denoted as {st }Tt=1 .
4
<latexit sha1_base64="9oUIamfaOEF+W8iGquoB9yICg98=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF48V+oVtKJvNpl262YTdiVBK/4UXD4p49d9489+4bXPQ1gcDj/dmmJkXpFIYdN1vp7CxubW9U9wt7e0fHB6Vj0/aJsk04y2WyER3A2q4FIq3UKDk3VRzGgeSd4Lx3dzvPHFtRKKaOEm5H9OhEpFgFK302CR9FDE3JByUK27VXYCsEy8nFcjRGJS/+mHCspgrZJIa0/PcFP0p1SiY5LNSPzM8pWxMh7xnqaJ2jT9dXDwjF1YJSZRoWwrJQv09MaWxMZM4sJ0xxZFZ9ebif14vw+jWnwqVZsgVWy6KMkkwIfP3SSg0ZygnllCmhb2VsBHVlKENqWRD8FZfXiftWtW7qnoP15V6LY+jCGdwDpfgwQ3U4R4a0AIGCp7hFd4c47w4787HsrXg5DOn8AfO5w/HKZBH</latexit>
T ⇥d
Linear Max
Span Predictor
Concat Linear Answer Loss
<latexit sha1_base64="9oUIamfaOEF+W8iGquoB9yICg98=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF48V+oVtKJvNpl262YTdiVBK/4UXD4p49d9489+4bXPQ1gcDj/dmmJkXpFIYdN1vp7CxubW9U9wt7e0fHB6Vj0/aJsk04y2WyER3A2q4FIq3UKDk3VRzGgeSd4Lx3dzvPHFtRKKaOEm5H9OhEpFgFK302CR9FDE3JByUK27VXYCsEy8nFcjRGJS/+mHCspgrZJIa0/PcFP0p1SiY5LNSPzM8pWxMh7xnqaJ2jT9dXDwjF1YJSZRoWwrJQv09MaWxMZM4sJ0xxZFZ9ebif14vw+jWnwqVZsgVWy6KMkkwIfP3SSg0ZygnllCmhb2VsBHVlKENqWRD8FZfXiftWtW7qnoP15V6LY+jCGdwDpfgwQ3U4R4a0AIGCp7hFd4c47w4787HsrXg5DOn8AfO5w/HKZBH</latexit>
T ⇥d
Max
T ⇥ Lh ⇥ d
Max
<latexit sha1_base64="6grhipJsJ/qaylDsj4w3rDeRMS8=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZcOPCRYXeoA1hMpm0QyeTMHMilFDwVdy4UMStz+HOt3HaRtDWHwY+/nMO58wfpIJrcJwva2V1bX1js7RV3t7Z3du3Dw7bOskUZS2aiER1A6KZ4JK1gINg3VQxEgeCdYLRzbTeeWBK80Q2YZwyLyYDySNOCRjLt4+buA88Zhrf+cMfDH274lSdmfAyuAVUUKGGb3/2w4RmMZNABdG65zopeDlRwKlgk3I/0ywldEQGrGdQErPGy2fnT/CZcUIcJco8CXjm/p7ISaz1OA5MZ0xgqBdrU/O/Wi+D6NrLuUwzYJLOF0WZwJDgaRY45IpREGMDhCpubsV0SBShYBIrmxDcxS8vQ7tWdS+q7v1lpV4r4iihE3SKzpGLrlAd3aIGaiGKcvSEXtCr9Wg9W2/W+7x1xSpmjtAfWR/f2G6Utw==</latexit>
Conv Encoder (ed

<latexit sha1_base64="JiWdoZjJBpMFQ4v9+/v22p4+jRM=">AAACF3icbZDJSgNBEIZ7XGPcoh69NAYhIoaZKOgx4MVjBLNAJoSenpqkSc9Cd40YhryFF1/FiwdFvOrNt7GzHDSxoOHj/6uort9LpNBo29/W0vLK6tp6biO/ubW9s1vY22/oOFUc6jyWsWp5TIMUEdRRoIRWooCFnoSmN7ge+817UFrE0R0OE+iErBeJQHCGRuoWyq6EAEvURXjADPzR2ZQ0jk4d6irR6+OJcUUImvr5bqFol+1J0UVwZlAks6p1C1+uH/M0hAi5ZFq3HTvBTsYUCi5hlHdTDQnjA9aDtsGImT2dbHLXiB4bxadBrMyLkE7U3xMZC7Uehp7pDBn29bw3Fv/z2ikGV51MREmKEPHpoiCVFGM6Don6QgFHOTTAuBLmr5T3mWIcTZTjEJz5kxehUSk752Xn9qJYrcziyJFDckRKxCGXpEpuSI3UCSeP5Jm8kjfryXqx3q2PaeuSNZs5IH/K+vwB8xifEw==</latexit>
st + 1) ⇥ d
Span Predictor
Slice
Start and
Attention Loss
Linear Span End indices
T ⇥ Lh ⇥ 3d
Start and End Proposal
Linear
<latexit sha1_base64="TrHQjlTsyLAZyjKPtK4TWwkoSrs=">AAAB/3icbZDLSsNAFIZPvNZ6iwpu3AwWwVVJWkGXBTcuXFToDdoQJpNpO3QyCTMTocQufBU3LhRx62u4822cthG09YeBj/+cwznzBwlnSjvOl7Wyura+sVnYKm7v7O7t2weHLRWnktAmiXksOwFWlDNBm5ppTjuJpDgKOG0Ho+tpvX1PpWKxaOhxQr0IDwTrM4K1sXz7uIF6mkVUoVt/+IPV0LdLTtmZCS2Dm0MJctV9+7MXxiSNqNCEY6W6rpNoL8NSM8LppNhLFU0wGeEB7RoU2Ozxstn9E3RmnBD1Y2me0Gjm/p7IcKTUOApMZ4T1UC3WpuZ/tW6q+1dexkSSairIfFE/5UjHaBoGCpmkRPOxAUwkM7ciMsQSE20iK5oQ3MUvL0OrUnarZffuolSr5HEU4ARO4RxcuIQa3EAdmkDgAZ7gBV6tR+vZerPe560rVj5zBH9kfXwDUgqU9A==</latexit>
probabilities Linear
Video-Text Fusion Span Loss
Linear
T ⇥ Lh ⇥ d <latexit sha1_base64="6grhipJsJ/qaylDsj4w3rDeRMS8=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZcOPCRYXeoA1hMpm0QyeTMHMilFDwVdy4UMStz+HOt3HaRtDWHwY+/nMO58wfpIJrcJwva2V1bX1js7RV3t7Z3du3Dw7bOskUZS2aiER1A6KZ4JK1gINg3VQxEgeCdYLRzbTeeWBK80Q2YZwyLyYDySNOCRjLt4+buA88Zhrf+cMfDH274lSdmfAyuAVUUKGGb3/2w4RmMZNABdG65zopeDlRwKlgk3I/0ywldEQGrGdQErPGy2fnT/CZcUIcJco8CXjm/p7ISaz1OA5MZ0xgqBdrU/O/Wi+D6NrLuUwzYJLOF0WZwJDgaRY45IpREGMDhCpubsV0SBShYBIrmxDcxS8vQ7tWdS+q7v1lpV4r4iihE3SKzpGLrlAd3aIGaiGKcvSEXtCr9Wg9W2/W+7x1xSpmjtAfWR/f2G6Utw==</latexit>
T ⇥ Lh ⇥ d
QA Guided QA Guided
<latexit sha1_base64="6grhipJsJ/qaylDsj4w3rDeRMS8=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZcOPCRYXeoA1hMpm0QyeTMHMilFDwVdy4UMStz+HOt3HaRtDWHwY+/nMO58wfpIJrcJwva2V1bX1js7RV3t7Z3du3Dw7bOskUZS2aiER1A6KZ4JK1gINg3VQxEgeCdYLRzbTeeWBK80Q2YZwyLyYDySNOCRjLt4+buA88Zhrf+cMfDH274lSdmfAyuAVUUKGGb3/2w4RmMZNABdG65zopeDlRwKlgk3I/0ywldEQGrGdQErPGy2fnT/CZcUIcJco8CXjm/p7ISaz1OA5MZ0xgqBdrU/O/Wi+D6NrLuUwzYJLOF0WZwJDgaRY45IpREGMDhCpubsV0SBShYBIrmxDcxS8vQ7tWdS+q7v1lpV4r4iihE3SKzpGLrlAd3aIGaiGKcvSEXtCr9Wg9W2/W+7x1xSpmjtAfWR/f2G6Utw==</latexit>
Attention Attention Start and End

probabilities
T ⇥ No ⇥ d Lh ⇥ d T ⇥ Ls ⇥ d
Conv Conv Conv
<latexit sha1_base64="2H8F5EdG9qZNO8N7b25gavRWLKI=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZcONKKvQGbQiTyaQdOsmEmROhhIKv4saFIm59Dne+jdM2grb+MPDxn3M4Z/4gFVyD43xZK6tr6xubpa3y9s7u3r59cNjWMlOUtagUUnUDopngCWsBB8G6qWIkDgTrBKObab3zwJTmMmnCOGVeTAYJjzglYCzfPm7iPvCYaXznyx8MfbviVJ2Z8DK4BVRQoYZvf/ZDSbOYJUAF0brnOil4OVHAqWCTcj/TLCV0RAasZzAhZo2Xz86f4DPjhDiSyrwE8Mz9PZGTWOtxHJjOmMBQL9am5n+1XgbRtZfzJM2AJXS+KMoEBomnWeCQK0ZBjA0Qqri5FdMhUYSCSaxsQnAXv7wM7VrVvai695eVeq2Io4RO0Ck6Ry66QnV0ixqohSjK0RN6Qa/Wo/VsvVnv89YVq5g5Qn9kfXwD5meUwA==</latexit> <latexit sha1_base64="Yr6dmSxk2ndlg8wYLdI1S6MylVE=">AAAB83icbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fjw4sFDBfsBTSibzaZdutmE3YlQSv+GFw+KePXPePPfuG1z0NYHA4/3ZpiZF2ZSGHTdb2dtfWNza7u0U97d2z84rBwdt02aa8ZbLJWp7obUcCkUb6FAybuZ5jQJJe+Eo9uZ33ni2ohUPeI440FCB0rEglG0kn/fHxIfRcINifqVqltz5yCrxCtIFQo0+5UvP0pZnnCFTFJjep6bYTChGgWTfFr2c8MzykZ0wHuWKmrXBJP5zVNybpWIxKm2pZDM1d8TE5oYM05C25lQHJplbyb+5/VyjG+CiVBZjlyxxaI4lwRTMguAREJzhnJsCWVa2FsJG1JNGdqYyjYEb/nlVdKu17zLmvdwVW3UizhKcApncAEeXEMD7qAJLWCQwTO8wpuTOy/Ou/OxaF1zipkT+APn8wc6X5Ea</latexit> <latexit sha1_base64="oBJl/cWUfpHessmnS3QreSkaEtk=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZcOPCRYXeoA1hMpm0QyeTMHMilFDwVdy4UMStz+HOt3HaRtDWHwY+/nMO58wfpIJrcJwva2V1bX1js7RV3t7Z3du3Dw7bOskUZS2aiER1A6KZ4JK1gINg3VQxEgeCdYLRzbTeeWBK80Q2YZwyLyYDySNOCRjLt4+buA88Zhrf+foHQ9+uOFVnJrwMbgEVVKjh25/9MKFZzCRQQbTuuU4KXk4UcCrYpNzPNEsJHZEB6xmUxKzx8tn5E3xmnBBHiTJPAp65vydyEms9jgPTGRMY6sXa1Pyv1ssguvZyLtMMmKTzRVEmMCR4mgUOuWIUxNgAoYqbWzEdEkUomMTKJgR38cvL0K5V3Yuqe39ZqdeKOEroBJ2ic+SiK1RHt6iBWoiiHD2hF/RqPVrP1pv1Pm9dsYqZI/RH1sc36X2Uwg==</latexit>
Encoder Encoder Encoder

Linear Linear Linear Conv Encoder
RCNN BERT BERT

Add & Norm
00:00.243 → 00:01.473
Howard: Where are the guys? Conv
Who is at NASA to pick up 00:01.678 → 00:02.768 repeat
Howard after he comes home ? Bernadette: Oh, it's just me.
Just Bernadette. 00:07.350 → 00:08.820
Positional Encoding
Howard: Don't worry. I can act surprised.
<latexit sha1_base64="4ur7Sd/nAEmZxpjHPkeaAz1FUYg=">AAACBXicbVDJSgNBEO1xjXEb9aiHwSB4CjMqmIsQ8OIxQjbIxKGnU0ma9Cx01wTDMB68+CtePCji1X/w5t/YWQ6a+KDg8V4VVfX8WHCFtv1tLC2vrK6t5zbym1vbO7vm3n5dRYlkUGORiGTTpwoED6GGHAU0Ywk08AU0/MH12G8MQSoehVUcxdAOaC/kXc4oaskzj1yEe0yHvANR9uCmQw/dzEvxysnuqp5ZsIv2BNYicWakQGaoeOaX24lYEkCITFClWo4dYzulEjkTkOXdREFM2YD2oKVpSANQ7XTyRWadaKVjdSOpK0Rrov6eSGmg1CjwdWdAsa/mvbH4n9dKsFtqpzyME4SQTRd1E2FhZI0jsTpcAkMx0oQyyfWtFutTSRnq4PI6BGf+5UVSPys650Xn9qJQLs3iyJFDckxOiUMuSZnckAqpEUYeyTN5JW/Gk/FivBsf09YlYzZzQP7A+PwBurWZTg==</latexit>
video {vt }Tt=1 hypothesis hk
<latexit sha1_base64="UEbl0f3NiY3Gt18YFftf31XUJxI=">AAAB/nicbVBNS8NAEN3Ur1q/ouLJS7AInkqigj0WvHisYD+gDWGznTZLNx/sTsQQKv4VLx4U8erv8Oa/cdvmoK0PBh7vzTAzz08EV2jb30ZpZXVtfaO8Wdna3tndM/cP2ipOJYMWi0Usuz5VIHgELeQooJtIoKEvoOOPr6d+5x6k4nF0h1kCbkhHER9yRlFLnnnUR3jAPMiSGANQXE0eA2/smVW7Zs9gLROnIFVSoOmZX/1BzNIQImSCKtVz7ATdnErkTMCk0k8VJJSN6Qh6mkY0BOXms/Mn1qlWBtYwlroitGbq74mchkploa87Q4qBWvSm4n9eL8Vh3c15lKQIEZsvGqbCwtiaZmENuASGItOEMsn1rRYLqKQMdWIVHYKz+PIyaZ/XnIuac3tZbdSLOMrkmJyQM+KQK9IgN6RJWoSRnDyTV/JmPBkvxrvxMW8tGcXMIfkD4/MHr+2Wjw==</latexit>
subtitle {st }Tt=1
<latexit sha1_base64="SRhGuScV9o2Ssj1xxRfLklDbxb8=">AAACCHicbVDLSsNAFJ3UV62vqEsXBovgqiQq2I1QcOOyQl/QxDCZTtqhkwczN2IJcefGX3HjQhG3foI7/8ZpmoW2Hhg4nHPPzNzjxZxJMM1vrbS0vLK6Vl6vbGxube/ou3sdGSWC0DaJeCR6HpaUs5C2gQGnvVhQHHicdr3x1dTv3lEhWRS2YBJTJ8DDkPmMYFCSqx/aQO8hlYmXh7MHO5Uu2JmbwqWV3bZcvWrWzBzGIrEKUkUFmq7+ZQ8ikgQ0BMKxlH3LjMFJsQBG1P0VO5E0xmSMh7SvaIgDKp00XyQzjpUyMPxIqBOCkau/EykOpJwEnpoMMIzkvDcV//P6Cfh1J2VhnAANyewhP+EGRMa0FWPABCXAJ4pgIpj6q0FGWGACqruKKsGaX3mRdE5r1lnNujmvNupFHWV0gI7QCbLQBWqga9REbUTQI3pGr+hNe9JetHftYzZa0orMPvoD7fMHURyavg==</latexit>
Figure 5. Overview of the proposed framework, Spatio-Temporal Answerer with Grounded Evidence (STAGE) for spatio-temporal video
question answering. Given 5 hypotheses (question + answer pairs), a set of video frames and aligned subtitle sentences, STAGE is able
to answer the question, as well as providing detections for referred visual concepts (objects and people), and predicting a temporal span
for localizing the relevant moment in the video. The full model is trained end-to-end using a combination of attention loss, span loss and
answer loss. For brevity, we only show one hypothesis in the figure. We provide output dimensions for some of the modules for clarity.
We denote the number of words in each hypothesis as Lh the bottom-right corner of Fig. 5, this is comprised of a po-
and the aligned subtitle sentence pair as Ls respectively. We sitional encoding layer and multiple convolutional layers,
use No to denote the number of object regions in a frame, each with a residual connection [10] and layer normaliza-
and d = 128 as the hidden size. tion. Specifically, we use Layernorm(ReLU(Conv(x)) +
x) as a single Conv unit and stack Nconv of such units as
4.2. STAGE Architecture
the convolutional encoder. x is the input after PE, Conv is a
Input Embedding Layer: One of our goals is to local- depthwise separable convolution [4]. We use two convolu-
ize visual concepts. For each frame vt , we use Faster R- tional encoders at two different levels of STAGE, one with
CNN [31] pre-trained on Visual Genome [18] to detect ob- kernel size 7 to encode the raw inputs, and another with ker-
jects and extract their regional embeddings as our visual in- nel size 5 to encode the fused video-text representation. For
put feature [1]. We keep top-20 object proposals and use both encoders, we set Nconv = 2.
PCA to reduce the feature dimension from 2048 to 300. We QA-Guided Attention: For each hypothesis hk = [q, ak ],
denote ot,r ∈ R300 as the r-th object embedding in the we compute its attention scores w.r.t. the object embeddings
t-th frame. To encode the text input, we use BERT [7], a in each frame and the words in each subtitle sentence, re-
transformer [37] based language model that achieves state- spectively. Given the encoded hypothesis Hk ∈ RLh ×d for
of-the-art performance on various NLP tasks. Specifically, the hypothesis hk with Lh words, and encoded visual fea-
we first fine-tune the BERT-base model using a masked lan- ture Vt ∈ RNo ×d for the frame vt with No objects, we com-
guage model and next sentence prediction on the subtitles pute their matching scores Mk,t ∈ RLh ×No = Hk VtT . We
and QA pairs from the TVQA+ train set. Then, we fix its then apply softmax at the second dimension of Mk,t to get
parameters and use it to extract 768-dimensional word-level the normalized scores M̄k,t . Finally, we compute the QA-
embeddings from the second-to-last layer for the subtitles att
aware visual representation Vk,t ∈ RLh ×d = M̄k,t Vt . Sim-
and each hypothesis. Both the object-level embeddings and ilarly, we compute QA-aware subtitle representation Sk,t att
.
the word-level embeddings are then projected into a 128- Video-Text Fusion: The above two QA-aware representa-
dimensional space using a linear layer with ReLU activa- tions are then fused together as:
tion.
Convolutional Encoder: Inspired by the recent trend of att att att att
Fk,t = [Sk,t ; Vk,t ; Sk,t Vk,t ]WF + bF , (1)
replacing recurrent networks with CNNs [6, 15, 41] and
Transformers [37, 7] for sequence modeling, we use po- where denotes element-wise multiplication, WF ∈
sitional encoding (PE) [37], CNNs, and layer normaliza- R3d×d and bF ∈ Rd are trainable weights and bias, Fk,t ∈
tion [3] to build our basic encoding block. As shown in RLh ×d is the fused video-text representation. Note that the
5
frame and subtitle representations are temporally aligned, respectively. During training, we randomly sample two neg-
which is essential for the downstream span prediction task. atives for each positive box. We use Latti to denote the at-
att
Collecting Fk,t at all time steps, we have Fkatt ∈ RT ×Lh ×d . tention loss for the i-th example, which is obtained by sum-
We then apply another convolutional encoder with a max- ming over all the annotated frames {vt } and concepts {wj }
pooling layer to obtain the output Ak ∈ RT ×d . for Latt
t,j in the example. We define the overall attention loss
PN
Span Predictor: To predict temporal spans, we follow ex- L = N1 i=1 Llse
att
i . At inference time, we choose the box
isting works [21, 33, 41] to predict the probability of each with the highest score as the prediction.
position being the start or end of the span. Given the fused Span Prediction: Given the softmax normalized start and
input Ak ∈ RT ×d , we produce start probabilities p1k ∈ RT end probabilities p1 and p2 , we use cross-entropy loss:
and end probabilities p2k ∈ RT using two linear layers with
N
softmax, as shown in the top-right corner of Fig. 5. 1 X
Lspan = − log p1y1 + log p2y2 ,

Span Proposal and Answer Prediction: Given the max- (3)
2N i=1 i i
pooled video-text representation Ak , we use a linear layer

to further encode it. We run max-pool across all the time where yi1 and yi2 are the indices of the GT start and end
steps to get a global hypothesis representation Ggk ∈ Rd . positions, respectively. To predict the span (st, ed), st ≤ ed
With the start and end probabilities from the span predictor, for each QA pair, we follow previous works [33, 41] to find
we generate span proposals using dynamic programming the one with maximum p1st p2ed .
as [41, 33]. At training time, we combine the set of pro- Answer Prediction: Similar to span prediction loss, given
posals with IoU ≥ 0.5 with the GT spans, as well as the answer probabilities pans , answer prediction loss is:
GT spans to form the final proposals {stp , edp } [31]. At
N
inference time, we take the proposals with the highest con- 1 X
Lans = − log pans
yians , (4)
fidence scores for each hypothesis. For each proposal, we N i=1
generate a local representation Glk ∈ Rd by max-pooling
Ak,stp :edp . The local and the global representations are con- where yians is the index of the GT answer.
catenated to obtain Gk ∈ R2d . We then forward {Gk }5k=1 Finally, the overall loss is a weighted combination of the
through softmax to get the answer scores pans ∈ R5 . above three objectives: L = Lans +watt Latt +wspan Lspan ,
where watt and wspan are set as 0.1 and 0.5 based on vali-
4.3. Training and Inference Objective Functions dation set tuning.
In this section, we describe the objective functions used
in the STAGE framework. Since our spatial and temporal
5. Experiments
annotations are collected based on the question and GT an- Our task is spatio-temporal video question answering,
swer, we only apply the attention loss and span loss on the requiring systems to temporally localize relevant moments,
targets associated with the GT hypothesis (question + GT spatially detect referred objects and people, and answer
answer), i.e., Mk=yans ,t , p1k=yans and p2k=yans . For brevity, questions. In this section, we first introduce our metrics,
we omit the subscript k=y ans in the following. then compare STAGE against several baselines, and finally
Explicit Attention Supervision: While the attention de- provide a comprehensive analysis of our model. Addition-
scribed in Section 4.2 can be learned in a weakly supervised ally, we evaluate our STAGE on the full original TVQA
end-to-end manner, we can also train it with the supervision dataset and achieve rank-1 in the TVQA Codalab leader-
of available GT boxes. We define a box as positive if it board4 at the time of submission, outperforming the second
has an IoU ≥ 0.5 with the GT box. Consider the attention best method by 1.5%.
scores Mt,j ∈ RNo from a concept word wj in GT hy- 5.1. Metrics
pothesis hyans to the set of proposal boxes’ representations
No To measure question answering performance, we use
{ot,r }r=1 at frame vt . We expect the attention on positive classification accuracy (QA Acc.). We evaluate span
boxes to be higher than the negative ones, thus we use a prediction using temporal mean Intersection-over-Union
ranking loss for the supervision. Recent work [20] suggests (Temp. mIoU) following previous works [12, 11] on
using log-sum-exp (LSE) as a smooth approximation of the language-guided video moment retrieval. Since the span
non-smooth hinge loss, as it is easier to optimize. The LSE depends on the hypothesis (QA pair), each QA pair pro-
formulation of ranking loss is: vides a predicted span, but we only evaluate the span of the
X predicted answer. Additionally, we propose a new metric,
Llse
t,j = log 1+exp(Mt,j,rn −Mt,j,rp ) , (2) Answer-Span joint Accuracy (ASA), that jointly evaluates
rp ∈Ωp ,rn ∈Ωn
both answer prediction and span prediction. For this metric,
where Mt,j,rp is the rp -th element of the vector Mt,j . Ωp 4 https://competitions.codalab.org/competitions/
and Ωn denote the set of positive and negative box indices, 20687#results
6
Model vfeat tfeat QA Acc. Grd. mAP Temp. mIoU ASA
1 Longest Answer [19] - - 33.32 - - -
2 TFIDF Answer-Subtitle [19] - - 50.97 - - -
3 two-stream [19] reg GloVe 64.73 - - -
4 two-stream [19] cpt GloVe 66.47 - - -
5 backbone + Attn. Sup. + Temp. Sup. + local (STAGE) reg BERT 74.83 27.34 32.49 22.23
6 Human Performance [19] - - 90.46 - - -
Table 3. Comparison with existing methods on TVQA+ test set. vfeat = video feature, tfeat = text feature. We follow the convention in [19]
to use reg to denote detected object embeddings, cpt to denote detected object labels and attributes. Grd. mAP = grounding mAP, Temp.
mIoU = temporal mIoU, ASA = Answer-Span joint Accuracy.
Model vfeat tfeat QA Acc. Grd. mAP Temp. mIoU ASA

1 two-stream [19] reg GloVe 62.28 - - -
2 two-stream [19] cpt GloVe 62.25 - - -
3 backbone reg GloVe 67.29 4.46 - -
4 backbone reg BERT 68.31 7.31 - -
5 backbone + Attn. Sup. reg BERT 71.03 24.8 - -
6 backbone + Temp. Sup. reg BERT 71.4 10.86 30.77 20.09
7 backbone + Attn. Sup. + Temp. Sup. reg BERT 71.99 24.1 31.16 20.42
8 backbone + Attn. Sup. + Temp. Sup. + local (STAGE) reg BERT 72.56 25.22 31.67 20.78
9 STAGE with GT Span reg BERT 73.28 - - -
Table 4. Ablation study of our proposed STAGE framework on TVQA+ val set. In row 9, we show a model with GT spans at inference.
Attn. Sup. = spatial attention supervision, Temp. Sup. = span predictor with temporal supervision. Models in row 3-7 use only global
feature Gg for question answering, while the one in row 8 additionally use local feature Gl .
we define a prediction to be correct if the predicted span has 5.3. Model Analysis
an IoU ≥ 0.5 with the GT span, provided that the answer
Backbone Model: Given the full STAGE model defined in
prediction is correct. Finally, to evaluate object grounding
Sec. 4, we define the backbone model as the ablated version
performance, we follow the standard metric from the PAS-
of it, where we removed span predictor along with the span
CAL VOC challenge [8] and report the mean Average Pre-
proposal module, as well as the explicit attention supervi-
cision (Grd. mAP) at IoU threshold 0.5. We only consider
sion. Different from the baseline two-stream model [19]
the annotated words and frames when calculating the mAP.
which uses RNNs to model text and video sequences, in our
5.2. Comparison with Baseline Methods backbone model, we use CNN to encode both modalities.
The two-stream [19] model interacts QA pairs with subtitles
We consider the previous two-stream model [19] as our and videos separately, then sums the confidence score from
main baseline. In this model, two streams are used to pre- each modality, while we align subtitles with video frames
dict answer scores from subtitles and videos respectively from the start, fusing their representation conditioned on the
and final answer scores are produced by summing scores input QA pair, as in Fig. 5. We believe this aligned fusion is
from the two streams. We retrain the model using the offi- essential for improving QA performance, as the latter part
cial code5 on TVQA+ data. We also evaluate the two most of STAGE has a joint understanding of both video and subti-
representative non-neural baselines from [19], i.e., Longest tles. Using the same visual and text features, we observe our
Answer and TFIDF Answer-Subtitle matching. backbone model (row 3) far outperforms two-stream (row
Table 3 shows the test results of STAGE and the baseline 1) in Table 4.
methods. Our best QA model (row 5) outperforms previous
BERT as Feature: BERT [7] has primarily been used in
state-of-the-art (row 4) by a large margin in QA accuracy,
NLP tasks. In Table 4, we show it is also useful for video
with 12.58% relative gains. Additionally, our model also
QA task. Compared to the model with GloVe [28] as a text
localizes the relevant moments and detect referred objects
feature, BERT improves the backbone model by 1.52% in
and people. Table 3 shows our model achieves the best mAP
QA Acc. (row 4 vs row 3). We also find it improves the
of 27.34% on object grounding, and the best temporal mIoU
grounding performance of the model by 63.9%, relatively.
of 32.49% on temporal localization. However, a large gap
is still observed between our best model and humans (row Spatial Attention Supervision: On top of the backbone
6), showing there is space for further improvement. model, we use annotated bounding boxes to provide atten-
tion supervision. We compare the model with attention su-
5 https://github.com/jayleicn/TVQA pervision (row 5) with the backbone model (row 4) in Ta-
7
00:16.897 → 00:20.067
00:01.509 → 00:04,539 Grab a napkin, homey,
Leonard: Said the premise is intriguing. you just got served.
00:04.545 → 00:06.475 00:22.236 → 00:23.776

Sheldon: Good to see you again. Leonard: It's fine. You win.
Q: What is Leonard wearing when he says said the premise is intriguing? Q: What did Leonard tell Howard after Howard said that Leonard just got served?
A1: Glasses. Pred GT A1: Leonard told Howard that he really hates that game.
A2: Coffee. A2: Leonard told Howard that Howard isn't very good.
A3: Rosary. A3: Leonard told Howard that Sheldon will beat his score.
A4: Gang Collars. A4: Leonard told Howard that it was fine, he wins. Pred GT
A5: Hat. A5: Leonard told Howard that he will beat him.
(a) (b)
Figure 6. Example predictions from STAGE. The span predictions are shown on the top of each example, each block represents a frame,
the color indicates the model’s confidence for the predicted spans. For each QA, we show grounding examples and scores for one frame in
GT span, GT boxes are shown in green. Model predicted answers are labeled by Pred, GT answers are labeled by GT.
backbone
ble 4. After adding such supervision, we observe a relative Model two-stream [19]
+C1 +C2 +C3 +C4
gain of 3.98% in QA Acc. and 239.26% in Grd. mAP. vfeat reg cpt reg reg reg reg reg reg
tfeat GloVe GloVe GloVe BERT BERT BERT BERT BERT
Temporal Supervision: In Table 4, we also show the re- what (60.52%) 62.71 62.60 67.63 67.58 69.99 70.76 71.25 72.34
sults of our model with span prediction under temporal su- who (10.24%) 53.07 55.66 61.49 64.72 72.60 72.17 73.14 74.11
where (9.68%) 51.37 55.82 62.33 68.49 71.52 71.58 71.58 74.32
pervision. For the backbone model with span prediction why (9.55%) 78.46 75.35 76.74 77.43 79.86 79.86 78.12 76.39
(with global feature Gg for question answering), we have how (9.05%)
total (100%)
65.20
62.28
61.90
62.25
67.40
67.29
69.23
68.31
68.50
71.03
66.30
71.40
69.96
71.99
67.03
72.56
a relative gain of 4.52% in QA Acc. and 48.56% in Grd.
Table 5. QA accuracy breakdown for different approaches across
mAP (row 6 vs row 4). For our backbone model with both
each question type on TVQA+ val set. For brevity, we only show
attention supervision and span predictor, we have a relative
top-5 question types (with the percentage of each type). C1 = Attn.
gain of 1.35% in QA Acc. (row 7 vs row 5). Sup., C2 = Temp. Sup., C3 = Attn. Sup. + Temp. Sup., C4 = Attn.
Span Proposal and Local Feature: In row 8 and row 7 Sup. + Temp. Sup. + local (STAGE).
of Table 4, we compare the models with and without lo-
cal features Gl for answer classification. Local features are
obtained by max-pooling the span proposal regions, which QA Acc.
Model
should contain more relevant cues for answering the ques- val test-public
tions. With additional local features, we achieve the best 1 two-stream [19] 65.85 66.46
performance across all metrics, indicating the benefit of us- 2 anonymous 1 (JunyeongKim) 66.22 67.05
ing a span proposal module, as well as its provided local 3 anonymous 2 (jeyki) 68.90 68.77
4 backbone 68.56 69.67
features.
5 backbone + Temp. Sup. + local 70.50 70.23
Inference with GT Span: The last row of Table 4 shows
our model with GT spans instead of predicted spans at in- Table 6. Model performance on full TVQA dataset. The results
are from TVQA Codalab leaderboard.
ference time. We observe better QA Acc. with GT spans.
Accuracy by Question Type: In Table 5 we show a break-
down of QA Acc. by different question types. We observe
a clear increasing trend on “what”, “who”, and “where” (including failure cases) are provided in the supplementary.
questions after replacing the backbone net and adding at- TVQA Leaderboard Results: We also conduct experi-
tention/span modules in each column. Interestingly, for ments on the Leaderboard’s full TVQA dataset (Table 6),
“why” and “how” question types, our full model fails to without relying on the bounding box annotations and re-
present overwhelming performance, indicating some rea- fined timestamps in TVQA+. Without span predictor (row
soning (textual) module to be incorporated as future work. 4), STAGE backbone is able to achieve 4.83% relative gain
Qualitative Examples: We show two correct predictions in from the best published result (row 1) on TVQA test-public
Fig. 6, where Fig. 6(a) uses text to answer the question, and set. Adding span predictor (row 5), performance is im-
Fig. 6(b) uses grounded objects to answer. More examples proved to 70.23%, a new state-of-the-art for the task.
8
6. Conclusion
We presented the TVQA+ dataset and corresponding
spatio-temporal video question answering task. The pro-
posed task requires intelligent systems to localize relevant
moments, detect referred objects and people, and answer
questions. We further introduced STAGE, a novel, end-to-
end trainable framework to jointly perform all three tasks.
Comprehensive experiments show that temporal and spatial
predictions help improve question answering performance
as well as producing more explainable results. Though
STAGE performs well, there is still a large gap to human
performance that we hope will inspire future research.
Acknowledgments
This research is supported by NSF Awards #1633295, Figure 7. Timestamp refinement interface.
1562098, 1405822, Google Faculty Research Award, Sales-
Original
force Research Deep Learning Grant, Facebook Faculty Re- Refined
20%
Percentage of questions
search Award, and ARO-YIP Award #W911NF-18-1-0336.
A. Appendix
10%
A.1. Timestamp Refinement.
During our initial analysis, we find the original times-
tamp annotations from the TVQA [19] dataset to be some-
what loose, i.e., around 8.7% of 150 randomly sampled 0%
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
training questions had a span that was at least 5 seconds span lengths (seconds)
longer than what is needed. To have better timestamps, we Figure 8. Comparison between the original and the refined times-
asked a set of Amazon Mechanical Turk (AMT) workers tamps in the TVQA+ train set. The refined timestamps are gener-
to refine the original timestamps. Specifically, we take the ally tighter than the original timestamps
questions that have a localized span length of more than
QA Acc.
10 seconds (41.33% of the questions) for refinement while Model
leaving the rest unchanged. During annotation, we show a Original Refined
backbone 68.56 68.56
question, its correct answer, its associated video (with sub- backbone + Attn. Sup. 71.03 71.03
title), as well as the original timestamp to the AMT workers backbone + Temp. Sup. 70.87 71.40
backbone + Attn. Sup. + Temp. Sup. 71.23 71.99
(illustrated in Fig. 7, with instructions omitted). The work- backbone + Attn. Sup. + Temp. Sup. + local (STAGE) 70.63 72.56
ers are asked to adjust the start and end timestamps to make
the span as small as possible, but need to contain all the Table 7. Model performance comparison between the original
information mentioned in the QA pair. timestamps and the refined timestamps on TVQA+ val set. We use
reg as video feature, BERT as text feature for all the experiments
We show span length distributions of the original and
in this table.
the refined timestamps from TVQA+ train set in Fig. 8. The
average span length of the original timestamps is 14.41 secs,
while the average for the refined timestamps is 7.2 secs. the model is able to localize common objects, it is difficult
In Table 7 we show model performance on TVQA+ val for it to localize unusual objects (Fig. 10(a, d)), small ob-
set using the original timestamps and the refined times- jects (Fig. 10(b)). Incorrect temporal localization is another
tamps. Models with the refined timestamps perform con- most frequent failure reason, e.g., Fig. 10(c, f). There are
sistently better than the ones with the original timestamps. also cases where the objects being referred are not present
in the sampled frame, as in Fig. 10(e). Such failures indi-
A.2. More Examples
cate that using more densely sampled frames for question
We show 6 correct prediction examples from STAGE in answering would be advantageous.
Fig. 9. As can be seen from the figure, correct examples
usually have correct temporal and spatial localization. In
Fig. 10 we show 6 incorrect examples. Incorrect object lo-
calization is one of the most frequent failure reason, while
9
00:34.309 → 00:37.019 00:10.268 → 00:11.848
- What's that? Lesley: ...no extraneous spittle.
- Tea.
00:13.146 → 00:15.356
00:37.729 → 00:38.899 Lesley: On the other hand, no arousal.
Sheldon: When people are upset...
Q: Where is Leonard sitting before Sheldon brings him the tea ? Q: What does Lesley say there was none of when Leonard asked about the kiss?
A1: Sheldon's bed. A1: Lesley says there was no arousal. Pred GT
A2: The armchair. A2: Lesley says there was no passion.
A3: The floor. A3: There was no kiss.
A4: His bed. A4: Lesley says the kiss lacked a certain fire.
A5: The couch. Pred GT A5: Lesley says there was no excitement in the kiss.
(a) (b)
00:00.060 → 00:02.020 00:00.141 → 00:01.391

- Oh, hey, Leonard. Raj: I don't believe it.
- Good afternoon, Penny.
00:01.559 → 00:02.599
00:02.187 → 00:04.567 Howard: Neither do I.
Leonard: So, hi, hey. Uh...
Q: Who visited Penny in her house before dinner? Q: What is Leonard holding when he is listening to Raj?
A1: No one visited Penny in her house. A1: A notepad.
A2: Howard visited Penny in her house. A2: A book.
A3: Raj visited Penny in her house. A3: A yellow cup . Pred GT
A4: Sheldon visited Penny in her house. A4: A cell phone.
A5: Leonard visited Penny in her house. Pred GT A5: A set of keys.
(c) (d)
00:41.444 → 00:43.274
Leonard: - What's up? 00:50.790 → 00:52.290
- Yeah, well, I'm at work too. Leonard, it's 2 in the morning.
00:43.446 → 00:46.656 00:53.918 → 00:59.018

Penny: And you'll never guess who's - So?
here infecting my entire station. - So it's my turn.
Q: Where was Penny when she called to Leonard? Q: Where was Leonard when Sheldon walked into the living room at 2am?
A1: Penny was working at a restaurant. Pred GT A1: On the couch.
A2: Penny was at home. A2: In the time machine. Pred GT
A3: Penny was walking in the street. A3: In the kitchen.
A4: Penny was at bed. A4: In his room.
A5: Penny was in the kitchen. A5: He wasn't there.
(e) (f)
Figure 9. Correct prediction examples from STAGE. The span predictions are shown on the top of each example, each block represents a
frame, the color indicates the model’s confidence for the predicted spans. For each QA, we show grounding examples and scores for one
frame in GT span, GT boxes are shown in green. Model predicted answers are labeled by Pred, GT answers are labeled by GT.
10
00:00.343 → 00:03.763
00:27,095 → 00:42.315 Past Howard: I haven't seen your
Leonard: Sheldon?
Oreos!
00:45.072 → 01:08.032
00:03.972 → 00:07.062
Leonard: Hello?
Past Howard: Just take your bath
without them!
Q: What is Leonard holding when he comes out of the bedroom? Q: What was Raj doing when Howard was shouting at someone?
A1: Leonard is holding his cell phone. Pred A1: Raj was playing some music.
A2: Leonard is holding a baseball bat. A2: Raj was seated in the couch.
A3: Leonard is holding a shovel. A3: Raj was taking a shower. Pred
A4: Leonard is holding a coat hanger. A4: Raj was not in the room.
A5: Leonard is holding a mock lightsaber. GT A5: Raj was eating lots of cookies in his mouth as he watched Howard. GT
(a) (b)
00:17.350 → 00:19.640 00:26.568 → 00:27.818

Howard: Plus Superman and Godzilla. Leonard: Sounds like a breakthrough.
00:20.020 → 00:21.690 00:27.986 → 00:30.486

Leonard: No, no, no. Orcs are magic. Should I call Science and tell them to
hold the cover?
Q: Who grab a bottle after Leonard talked? Q: What is Leonard wearing when he is talking to Sheldon?
A1: Sheldon. A1: A scarf.
A2: Howard. A2: A hat.
A3: Penny. A3: A suit. GT
A4: Raj. GT A4: A kilt. Pred
A5: Leonard. Pred A5: Jogging pants.
(c) (d)
00:23.443 → 00:27.403 00:03.743 → 00:06.663

- You gotta take one for the team. Penny: ...something Elton John would
- Yeah. Sack up, dude. drive through the Everglades.
00:28,823 → 00:30.403 00:12.502 → 00:14.332

Leonard: Fine. Sheldon: It only moves in time.
Q: What was Leonard 's drink when they are talking about taking one for the team? Q: What direction did Sheldon turn to when Penny insulted their time machine ?
A1: Fanta. A1: He looked at his hands.
A2: bottle of water. Pred A2: To the left.
A3: Sprite. A3: Up towards the ceiling. Pred
A4: Gatorade. A4: He turned to Penny.
A5: Coke Cola. GT A5: To the right. GT
(e) (f)
Figure 10. Wrong prediction examples from STAGE. The span predictions are shown on the top of each example, each block represents a
frame, the color indicates the model’s confidence for the predicted spans. For each QA, we show grounding examples and scores for one
frame in GT span, GT boxes are shown in green. Model predicted answers are labeled by Pred, GT answers are labeled by GT.
11
References [17] K.-M. Kim, M.-O. Heo, S.-H. Choi, and B.-T. Zhang. Deep-
story: Video story qa by deep embedded memory networks.
[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, and In IJCAI, 2017. 1, 2, 3
S. Gould. Bottom-up and top-down attention for image cap-
[18] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
tioning and vqa. CoRR, abs/1707.07998, 2017. 1, 2, 5
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bern-
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. stein, and L. Fei-Fei. Visual genome: Connecting language
Zitnick, and D. Parikh. Vqa: Visual question answering. In and vision using crowdsourced dense image annotations. In-
ICCV 2015, 2015. 1, 2 ternational Journal of Computer Vision, 123:32–73, 2016.
[3] J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. 5
CoRR, abs/1607.06450, 2016. 5 [19] J. Lei, L. Yu, M. Bansal, and T. L. Berg. Tvqa: Localized,
[4] F. Chollet. Xception: Deep learning with depthwise sep- compositional video question answering. In EMNLP, 2018.
arable convolutions. 2017 IEEE Conference on Computer 1, 2, 3, 4, 7, 8, 9
Vision and Pattern Recognition (CVPR), pages 1800–1807, [20] Y. Li, Y. Song, and J. Luo. Improving pairwise ranking
2017. 5 for multi-label image classification. 2017 IEEE Conference
[5] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. on Computer Vision and Pattern Recognition (CVPR), pages
Human attention in visual question answering: Do humans 1837–1845, 2017. 6
and deep networks look at the same regions? In EMNLP, [21] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang. Bsn: Bound-
2016. 2 ary sensitive network for temporal action proposal genera-
[6] Y. Dauphin, A. Fan, M. Auli, and D. Grangier. Language tion. In ECCV, 2018. 6
modeling with gated convolutional networks. In ICML, [22] C. Liu, J. Mao, F. Sha, and A. L. Yuille. Attention correctness
2016. 5 in neural image captioning. In AAAI, 2016. 2
[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: [23] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical
Pre-training of deep bidirectional transformers for language question-image co-attention for visual question answering.
understanding. CoRR, abs/1810.04805, 2018. 5, 7 In NIPS, 2016. 1
[8] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, [24] J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk.
J. Winn, and A. Zisserman. The pascal visual object classes 2018 IEEE/CVF Conference on Computer Vision and Pat-
challenge: A retrospective. International journal of com- tern Recognition, pages 7219–7228, 2018. 1
puter vision, 111(1):98–136, 2015. 7 [25] T. Maharaj, N. Ballas, A. C. Courville, and C. J. Pal. A
[9] J. Gao, C. Sun, Z. Yang, and R. Nevatia. Tall: Temporal dataset and exploration of models for understanding video
activity localization via language query. 2017 IEEE Interna- data through fill-in-the-blank question-answering. 2017
tional Conference on Computer Vision (ICCV), pages 5277– IEEE Conference on Computer Vision and Pattern Recog-
5285, 2017. 2, 3 nition (CVPR), pages 7359–7368, 2017. 1, 3
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- [26] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J.
ing for image recognition. 2016 IEEE Conference on Com- Bethard, and D. McClosky. The Stanford CoreNLP natural
puter Vision and Pattern Recognition (CVPR), pages 770– language processing toolkit. In Association for Computa-
778, 2016. 5 tional Linguistics (ACL) System Demonstrations, pages 55–
[11] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, 60, 2014. 3
and B. Russell. Localizing moments in video with temporal [27] T. Onishi, H. Wang, M. Bansal, K. Gimpel, and
language. In EMNLP, 2018. 2, 6 D. McAllester. Who did what: A large-scale person-centered
[12] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, cloze dataset. EMNLP, 2016. 4
and B. C. Russell. Localizing moments in video with natural [28] J. Pennington, R. Socher, and C. D. Manning. Glove: Global
language. 2017 IEEE International Conference on Computer vectors for word representation. In EMNLP, 2014. 7
Vision (ICCV), pages 5804–5813, 2017. 2, 3, 6 [29] P. Rajpurkar, R. Jia, and P. S. Liang. Know what you don’t
[13] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Dar- know: Unanswerable questions for squad. In ACL, 2018. 2
rell. Natural language object retrieval. 2016 IEEE Confer- [30] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. S. Liang. Squad:
ence on Computer Vision and Pattern Recognition (CVPR), 100, 000+ questions for machine comprehension of text. In
pages 4555–4564, 2016. 2 EMNLP, 2016. 2
[14] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim. Tgif-qa: To- [31] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster r-cnn:
ward spatio-temporal reasoning in visual question answer- Towards real-time object detection with region proposal net-
ing. 2017 IEEE Conference on Computer Vision and Pattern works. IEEE Transactions on Pattern Analysis and Machine
Recognition (CVPR), pages 1359–1367, 2017. 1, 2, 3 Intelligence, 39:1137–1149, 2015. 5, 6
[15] L. Kaiser, A. N. Gomez, and F. Chollet. Depthwise sep- [32] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and
arable convolutions for neural machine translation. CoRR, B. Schiele. Grounding of textual phrases in images by re-
abs/1706.03059, 2018. 5 construction. In ECCV, 2016. 2
[16] S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. [33] M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidi-
Referitgame: Referring to objects in photographs of natural rectional attention flow for machine comprehension. CoRR,
scenes. In EMNLP, 2014. 2 abs/1611.01603, 2017. 6
12
[34] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus
regions for visual question answering. 2016 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
pages 4613–4621, 2016. 1
[35] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Ur-
tasun, and S. Fidler. Movieqa: Understanding stories in
movies through question-answering. 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
4631–4640, 2016. 1, 2, 3
[36] A. Trott, C. Xiong, and R. Socher. Interpretable counting for
visual question answering. CoRR, abs/1712.08697, 2018. 2
[37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all
you need. In NIPS, 2017. 5
[38] J. Welbl, P. Stenetorp, and S. Riedel. Constructing datasets
for multi-hop reading comprehension across documents.
Transactions of the Association of Computational Linguis-
tics, 06:287–302, 2018. 2
[39] J. Weston, A. Bordes, S. Chopra, and T. Mikolov. Towards
ai-complete question answering: A set of prerequisite toy
tasks. CoRR, abs/1502.05698, 2016. 2
[40] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. R.
Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend
and tell: Neural image caption generation with visual atten-
tion. In ICML, 2015. 2
[41] A. W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen,
M. Norouzi, and Q. V. Le. Qanet: Combining local convo-
lution with global self-attention for reading comprehension.
CoRR, abs/1804.09541, 2018. 5, 6
[42] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L.
Berg. Mattnet: Modular attention network for referring ex-
pression comprehension. In The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), June 2018. 2
[43] L. Yu, E. Park, A. C. Berg, and T. L. Berg. Visual madlibs:
Fill in the blank description generation and question answer-
ing. 2015 IEEE International Conference on Computer Vi-
sion (ICCV), pages 2461–2469, 2015. 1, 2
[44] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Mod-
eling context in referring expressions. In ECCV, 2016. 2
[45] L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speaker-
listener-reinforcer model for referring expressions. In CVPR,
2017. 2
[46] Y. Yu, J. Choi, Y. Kim, K. Yoo, S.-H. Lee, and G. Kim. Su-
pervising neural attention models for video captioning by hu-
man gaze data. 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 6119–6127, 2017.
1, 2
[47] L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and
M. Rohrbach. Grounded video description. CoRR,
abs/1812.06587, 2018. 1, 2
[48] Y. Zhu, O. Groth, M. S. Bernstein, and L. Fei-Fei. Visual7w:
Grounded question answering in images. 2016 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
pages 4995–5004, 2016. 1
13

1904 11574

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

1904 11574

Diunggah oleh

Hak Cipta:

Format Tersedia

TVQA+: Spatio-Temporal Grounding for Video Question Answering

Jie LeiLicheng Yu Tamara L. Berg Mohit Bansal

We present the task of Spatio-Temporal Video Question 00:02.314 → 00:06.732

dataset with 310.8k bounding boxes, linking depicted ob-

Comprehensive experiments and analyses demonstrate the

Table 1. Data Statistics for TVQA+ dataset.

annotations could provide additional useful training source

moment localization, object grounding and video QA. First,

20% 20% frame-wise regional visual representations and neural lan-

Figure 4. Bounding box/image ratios (left), and span length distri-

Conv Encoder (ed

Attention Attention Start and End

Encoder Encoder Encoder

RCNN BERT BERT

pooled video-text representation Ak , we use a linear layer

Model vfeat tfeat QA Acc. Grd. mAP Temp. mIoU ASA

00:04.545 → 00:06.475 00:22.236 → 00:23.776

00:00.060 → 00:02.020 00:00.141 → 00:01.391

00:43.446 → 00:46.656 00:53.918 → 00:59.018

00:17.350 → 00:19.640 00:26.568 → 00:27.818

00:20.020 → 00:21.690 00:27.986 → 00:30.486

00:23.443 → 00:27.403 00:03.743 → 00:06.663

00:28,823 → 00:30.403 00:12.502 → 00:14.332

Anda mungkin juga menyukai