Anda di halaman 1dari 2

Statement of purpose

Timur Garipov
I wish to pursue a Ph.D. in Computer Science at the Massachusetts Institute of Technology with the intention to
work as a researcher after graduation. My research interests focus on areas of machine learning, deep learning and
probabilistic modelling.
Since early school years, I have been passionate about computer science and mathematics. My interest originated
from coding simple video games and other entertainment-oriented software in middle school and later continued
with algorithms and data structures. In high school, I very actively participated in programming contests, attended
specialized summer schools and took 5-hour training sessions four days a week in order to prepare for national and
international competitions. As a result, I won prizes at the Russian Olympiad in Informatics (the main national
programming competition) three years in a row, being among the top 16 at the national level in my graduation
year.
During my Bachelor’s studies at Lomonosov Moscow State University, I became deeply interested in another area of
computer science — machine learning. I was fascinated by its ability to solve problems that are infeasible for classical
algorithms but are easy for humans. Machine learning is one of the most dynamic areas of modern applied mathematics,
it combines high level abstract mathematical concepts with practical scientific computing tools and provides solutions
for a wide range of application domains which made me even more excited to contribute to this field. Therefore, I
decided to join the department of Mathematical Methods of Forecasting, specifically the Bayesian Methods Research
Group led by Dr. Dmitry Vetrov. I was excited to join Dr. Vetrov’s group as it is in my opinion the strongest machine
learning group in Russia. After a very competitive selection process, I was admitted to the group and started doing
research under the supervision of Dr. Vetrov.
Research Experience
Since I joined Dr. Vetrov’s group, I considered research to be the most exciting and important part of my studies.
The first project I worked on was devoted to the compression of deep neural networks (DNNs) via a multidimensional
tensor decomposition called Tensor Train (TT). I proposed an effective way of representing kernels of convolutional
layers in TT format which allows the reduction of the number of parameters in convolutional layers by a factor of 2-3
with negligible effect on the predictive accuracy. I was particularly excited about the possibility of training extremely
wide networks (for example with 105 units in fully-connected layers) using the proposed approach, which is impossible
otherwise. This project developed into my Bachelor’s thesis, and we also published our results at the NIPS workshop
on tensor methods in Machine Learning in 2016. Attending the NIPS conference and seeing the pulse of the ML
community reinforced my motivation to pursue PhD and become a researcher in this field.
During my work on tensor decompositions in Neural Networks, I learned about Bayesian deep learning and I was
extremely excited about this research direction, as it combined the principled probabilistic ideas with highly-practical
impactful methods. I told Dr. Vetrov that I want to work on Bayesian deep learning and we decided to develop a
framework for Bayesian incremental training of DNNs — continuous learning in a setting where training data arrives
sequentially in chunks. Working on this project I gained practical experience in working with complex probabilistic
models, efficient methods of inference, and implementing them in a framework of stochastic computational graphs; I
created the code base with the methods and experimental routines. The results of this research were published as a
workshop paper at ICLR 2018.
Working on Bayesian deep learning, I became curious about the structure of the modes of the posterior distribution
and consequently the geometry of DNN loss landscapes. I posed the following question: are different local optima of
the loss isolated or can they be connected with a non-linear curve so that the loss remains low along the curve? I
came up with a possible way to find such curves inspired by the applications of the re-parametrization trick, Monte
Carlo estimation and stochastic optimization in Bayesian Machine Learning. I proposed to define a parametric curve
in the weight space and find parameters which minimize the line integral of the loss along the curve. After obtaining
positive preliminary results, I and my mentor started a collaboration with my colleagues from Cornell University and
commenced to work really hard to turn this idea into a top-tier conference paper. In multiple experimental setups (on
fully-connected, convolutional and recurrent networks), we consistently observed that the local optima of the loss are
in fact connected by very simple paths of low loss. I was fascinated by this discovery, as it meant that Bayesian deep

1
learning may be much simpler than we thought before. For example, MCMC methods don’t need to be able to cross
barriers of high loss in order to explore the high posterior density region, which is an extremely hard task. Further, the
fact that even such a simple observation hasn’t yet been done showed how little we know about deep neural networks
and how much there is yet to discover. Inspired by mode-connectivity, we proposed a practical ensembling method
which averages the predictions of models obtained from the SGD trajectory; we showed that this method can construct
a high-performing ensemble in the time needed to train a single network. It was extremely exciting to work on this
project, and we put all our efforts for it to succeed. As a results the paper was accepted for a spotlight presentation
at NIPS 2018.
Immediately after submitting the mode connectivity paper, we started to work on a follow-up project inspired by our
ensembling results. Analyzing the ensembling method that we proposed, we found that averaging weights instead of
predictions of the models along SGD trajectory gives very similar predictive performance. Further, it incurs almost no
computational overhead compared to SGD, and consistently leads to better generalization. We called the new method
Stochastic Weight Averaging (SWA). A large fraction of our work on this paper was devoted to understanding why
SWA works so well. Motivated by the discussion of flat and sharp optima in the literature, we measured the width of
SWA solutions along random directions and found that SWA significantly increases the width compared to the SGD
solution. I was very excited by both the practicality of SWA and the geometrical observations behind it. In just one
month after the mode-connectivity paper was submitted, we wrote a paper on SWA and it was accepted to UAI 2018
for an Oral presentation. We also published two follow-up papers considering the applications of SWA to uncertainty
estimation and reinforcement learning at the UAI workshop on Uncertainty in Deep Learning.
Concurrently with my research, I worked on a data science project for Sberbank (the largest bank in Russia) and did
an internship at Google in summer 2018. In these projects I applied the knowledge and skills acquired in machine
learning courses and gained experience working with real-world data; I paid particular attention to interpretation of
the results. I also understood the crucial role of the system design including data pipelines and improved my software
development skills, which helped me a lot in my research.
Research interests
Although deep learning became a ubiquitous tool that proved to be extremely effective in a very wide range of
applications, we still understand surprisingly little about the mechanics of DNN training and generalization. While I
am open to venturing into new areas, in my PhD I would be excited to contribute to developing a better understanding
of deep learning models and techniques we use to train them. So far I have focused on empirical analysis of loss surfaces,
and I believe there is still a lot to be discovered in this area, as even simple results such as mode connectivity and
the effectiveness of weight averaging were unknown in the community before. I would also be excited to take a more
theoretical approach to better understand geometrical properties of deep nets as well as their relation to generalization.
I am fascinated by probabilistic modelling and Bayesian Machine Learning. While Bayesian deep learning has yet to
prove its practicality, it shows great promise in modelling predictive uncertainty, and creating powerful generative and
discriminative models for structured data. I believe that in order to achieve strong empirical results with Bayesian
deep learning, we need to look deeper into the geometry of loss surfaces and posterior distribution, and I would be
excited to work on it in my PhD.
There are many outstanding faculty members at Massachusetts Institute of Technology whose research interests align
perfectly with the directions I want to work on in my PhD. It would be a privilege for me to join the group of professor
Tommi Jaakkola, and to extend his work on probabilistic inference and deep learning. The work of professor Tamara
Broderick on Bayesian Inference is also very related to my research interests. I am excited about the work of professor
David Sontag on complex probabilistic models and inference algorithms.
I believe that my passion and intention for research in machine learning, my past research and industry experience
and strong academic background will be beneficial for my graduate studies.

Anda mungkin juga menyukai