Anda di halaman 1dari 1

CS3244-17 Reading Comprehension On Lecture Notes

Nguyen Van Hoang, Lee Pei Xuan, Kevin Leonardo, Calvin Tantio, Luong Quoc Trung, Tan Joon Kai Daniel
School of Computing, National University of Singapore

Abstract Modified DrQA Model Analysis


This project explores the application of open- Macroscopic level - Retriever (Table 2):
domain Question Answering (QA) in learning • Tf-idf retriever performs worse on larger
materials with a contribution of a lecture note datasets (Rec@1: 0.93 on LN-test, 0.81 on LN,
dataset, called LNQA, annotated with question- 0.74 on SQuAD)
answer pairs. Our approach is to improve the over-
• Specifying department of queries improve
all pipeline of lecture note reading comprehension
retrievers (0.81 Rec@1 vs. 0.91 Rec@1) →
involving context retrieving (finding the relevant
beneficial to build a department classifier on
slides) and text reading (identifying the correct
questions in the long run to eliminate the need
information). Experiments show that initializing
to input the department.
our text reader model with a pre-trained version
on SQuAD significantly improve its performance • Baseline SOTA classifier fastText could only
on much limited lecture note dataset, comparing achieve 0.72 Rec@1 while requiring approx. 0.9
with both training from scratch and inferring from Rec@1 to outperform baseline retriever without
the pre-trained model. Narrowing down the search department → improve by adding more data
space by specifying departments of questions also or exploring other classification methods.
helps improve document retriever results, thus we Macroscopic level - Reader (Table 4):
examine state-of-the-art sentence classifiers in pre-
• Warm-start models (wsQA, sd-wsQA)
dicting departments of questions.
outperforms cold-start and direct inferring
Motivation models (csQA, iQA) → beneficial to
initialized training on smaller dataset (LNQA)
with pre-trained models on larger dataset
Recent success in QA (SQuAD)
A system searching answering student’s • Scaled-down model (sd-wsQA) outperforms
queries in provided lecture notes → ef- full model (wsQA) → beneficial to
fectively assist revision scaled-down the originally complex model
trained on larger dataset (SQuAD) for better
Document Retriever generalization on smaller dataset (LNQA)
Figure 1: An overview of our modified question-answering system based on DrQA [1]. • Sequential model-based optimization approach
Retrieving contexts containing candidate (TPE) improves hyper-parmeter tuning on
answers by returning top n contexts with both speed and performance - Table 5
highest similarity to given question: Context Question iQA csQA wsQA sd-wsQA Truth Microscopic level - Reader (Table 1):
∗ estimating the prior What does likelihood estimating maximum maximum maximum • Both warm-start models (wsQA and sd-wsQA)
c = arg max tf idf (q) · tf idf (c) (1)
c maximum likelihood MLE estimate the likelihood likelihood likelihood give best results in the sample test point.

Document Reader estimate mle represent? prior estimate estimate estimate Contributions
Table 1: A sample QA pair in test set.
Signals:
• Constructing LNQA - a QA dataset on
• Input:question q, paragraph p Hypotheses Approach & Results (cont.) lecture notes
• Output: best answer span • Examine transfer learning from
Word representations: • Due to reduced complexity and better Classifier Data-set Rec@1 Rec@5 pre-trained QA model on larger dataset
generalization, training on LNQA from (SQuAD) to a smaller dataset (LNQA)
• Glove Word Embedding (only feature scaled-down warm-start model (sd-wsQA) on fastText LN 0.7231 0.86
• Examine improvement of context
of q) SQuAD improves reader performance
compared to direct inference model (iQA), Table 3: Recall@k of question classifier retrieval when specifying the
• Exact Match: 1 if p can be exactly
cold-start model (csQA), and full warm-start departments of questions
matched to one question word, 0 model (wsQA) Transfer learning on Document reader - • Examine SOTA sentence classifier in
otherwise • Narrowing search space by specifying Experiments: department prediction
• Linguistics features: POS, NER, TF departments improves retriever performance.
• Dataset: SQuAD(S), LNQA(L) • Examine improvement of time taken
• Aligned question embedding: Similarity • A SOTA sentence classifier (fastText) can
obtain relatively good performance in for hyper-parameter tuning when using
between p and q department prediction on question.
Model Pre Train Test EM F1 Tree-Structured Parzen Estimator
p1, ..., pm = BiLST M (pei, ...pen) (2) DrQA S S 69.5 78.8
Approach & Results References
q1, ..., ql = BiLST M (qei, ...qel ) (3) sdDrQA S S 62.9 72.6
2 independent classifiers predicting for the LNQA dataset: iQA S L 13.5 43.3 [1] Danqi Chen, Adam Fisch, Jason Weston, and
answer start and end: csQA L L 9.9 41.6
• Outsourcing the data gathering process Antoine Bordes.
Pstart(i) ∝ exp(piWsq ) (4) to the public using MTurk wsQA DrQA L L 26.1 56.2 Reading wikipedia to answer open-domain
Pend(i) ∝ exp(piWeq ) (5) sd-wsQA sdDrQA L L 28.7 56.9 questions.
Department specification on Document Table 4: Exact match and F1 scores of different
arXiv preprint arXiv:1704.00051, 2017.
Where W is the weight matrix to be Retriever - Experiments: QA models. Acknowledgements
trained.
• Retrieving w/ and w/o department
We choose the best span from token i to Hyper-parameter tuning: We would like to extend our gratitude to the
• fastText (SOTA sentence classifier)
token i0 such that i <= i0 <= i + 15 and • Grid search (GS) CS3244 teaching team for the opportunity to em-
Pstart(i) × Pend(i0) is maximized. Data-set Dept Rec@1 Rec@5 bark on this project, our anonymous reviewers
• Tree-structured Parzen Estimator for their invaluable feedback, and our Mechanical
SQuAD 0.74 0.91 (TPE) Turk respondents for their great work in our HITs.
LN-test 0.93 0.98 Special thanks go out to MIT OpenCourseWare
Method Time EM F1 for making education materials openly accessible,
LN 0.81 0.97
GS 3d, 20h 24.4 55.22 without which LNQA would not have been built.
LN X 0.91 0.98
Table 2: Recall@k of retriever on different datasets. TPE <1d 28.7 57.79
Table 5: Performance of different tuning methods.

Anda mungkin juga menyukai