I have joined Google’s Search team at it’s Bangalore office. I am working in areas related to Machine Learning and Natural Language Processing.

Before that, I was a Visiting Research Scholar at the Machine Learning and Perception lab at Georgia Tech led by Prof. Dhruv Batra. I work on Machine Learning problems at the intersection of vision and language. Prior to that I was on a research internship at the Statistics and Machine Learning Group at Indian Institute of Science, Bangalore. Before that I was working as a Platform Engineer at Soroco. I completed my Bachelor’s in Computer Science and Engineering from the Indian Institute of Technology Kharagpur in May 2016.

Select Awards / Honors

  • Aug 2009 – NTSE Scholarship (National Talent Search Examination) by Department of Science and Technology, Government of India
    • Awarded to less than 1% applicants based on their scholastic achievements.
  • Aug 2011 – KVPY Fellow (Kishore Vaigyanik Protsahan Yojana Fellow) by Department of Science and Technology, Government of India
    • Awarded to just ~100 students all over India with high academic potential.
  • Sept. 2012 – Aditya Birla Scholarship by Aditya Birla Group
    • Awarded to just 15 students all over the from the best engineering and management colleges of India.
  • Feb 2013 – OP Jindal Engineering and Management Scholarship by OP Jindal Group
    • Awarded to just 4 students from each of the few top-tier engineering and management colleges in India.


Visual Landmark Selection for Generating Grounded and Interpretable Navigation Instructions.
S. Agarwal , D. Parikh, D. Batra, P. Anderson, S. Lee
CVPR 2019 Workshop on Deep Learning for Semantic Visual Navigation
[pdf] / code(to be released soon)
Two-stage instruction generation

Instruction following for vision-and-language navigation (VLN) has prompted significant research efforts developing more powerful “follower” models since its inception. However, the inverse task of generating visually grounded instructions given a trajectory – or learning a “speaker” model – has been largely unexamined. We present a “speaker” model that generates navigation instructions in two stages, by first selecting a series of discrete visual landmarks along a trajectory using hard attention, and then second generating language instructions conditioned on these landmarks. This two-stage approach improves over prior work, while also permitting greater interpretability. We hope to extend this to a reinforcement learning setting where landmark selection is optimized to maximize a follower’s performance without disrupting the model’s language fluency.

AppTechMiner: Mining Applications and Techniques from Scientific Articles
M. Singh, S. Dan, S. Agarwal , P. Goyal, A. Mukherjee,
Joint Conference on Digital Libraries (JCDL) 2017:6th International Workshop On Mining Scientific Publications
A sample Phrase-Cloud representing the proportion of papers for an area.

A rule-based information extraction framework that automatically constructs a knowledge base of all application areas and problem solving "techniques", given a text corpus of research papers. "Techniques" include tools, methods, datasets or evaluation metrics. We also categorize individual research articles based on their application areas and the techniques proposed/improved in the article. Our system achieves high average precision (~82%) and recall (~84%) in knowledge base creation. It also performs well in application and technique assignment to an individual article (average accuracy ~66%). We demonstrated the framework for the domain of computational linguistics but this can be easily generalized to any other field of research.



Publication number: 20180189093

Abstract: Techniques for executing one or more instances of a computer program using virtual machines, the computer program comprising multiple computer program portions including a first computer program portion. The techniques include determining whether an instance of any of the multiple computer program portions is to be executed; when it is determined that a first instance of the first computer program portion is to be executed, accessing first information specifying a first set of one or more virtual machine resources required for executing the first instance of the first computer program portion; determining whether any one of the plurality of virtual machines has at least the first set of virtual machine resources available; and when it is determined that a first of the plurality of virtual machines has the first set virtual machine resources available, causing the first virtual machine to execute the first instance of the first computer program portion.

Filed: January 5, 2018

Publication date: July 5, 2018

Inventors: Sanyam Agarwal, Rohan Narayan Murty, George Peter Nychis, Wolfgang Richter, Nishant Kumar Jain, Surabhi Mour, Shreyas H. Karanth, Shashank Anand

Select Projects

Fast GPU-Based Simulator for Room-to-Room dataset
Sample trajectory in Room-to-Room dataset

Room-to-Room dataset is a commond dataset used in several vision and language navigation tasks. The dataset contains real-world panoramic scans of building interiors provided my the Matterport3D dataset . An agent can choose to move in this 3D environment by taking actions. I optimized the original Room-to-Room simulator to use GPU for state update operations when the agent takes any actions. Combined with caching of repeated computations this resulted in a simulator that is 17x faster frame rate on a single GPU! Training time of most vision-and-language models can be brought down significantly using this simulator instead of the original simulator.

ICLR 2018 reproducibility challenge: Interpretable Counting for Visual Question Answering
Ques: How many sheepskin are grazing?

The paper improves upon the state-of-the art accuracy for counting based questions in VQA. They do it by enforcing the prior that each count corresponds to a well defined region in the image and is not diffused all over it. They hard-attend over a fixed set of candiate regions (taken from pre-trained Faster-R-CNN network) in the image by fusing it with the information from the question. They use a variant of REINFORCE - Self Critical Training - which is well suited for generating sequences.

The first public PyTorch implementations of Skip-Thought Vectors
[code] [blog]
Skip-Thoughts model

Skip-Thoughts uses the order of the sentences to “self-supervise” itself. The underlying assumption here is that whatever, in the content of a sentence, leads to a better reconstruction of the neighbouring sentences is also the essence of the sentence. The decoders are trained to minimise the reconstruction error of the previous and the next sentences given the embedding of current sentence. This reconstruction error is back-propagated to the Encoder which must now pack as much information about the current sentence that will help the Decoders minimise the error in generating the previous and next sentences.

The first public PyTorch implementations of Attentive Recurrent Comparators
[code] [blog]
ARC comparing between two images of the same character

Attentive Recurrent comparators state of the art performance in one-shot learning on the Omniglot dataset. The key idea is to make the training phase as close as possible to evaluation phase. Since the evaluation phase is comparitive, making the training phase also comparitive will lead to better generalization. In training, the model learns to compare between two randomly chosen images of either same or different characters.

Robot Soccer, IIT Kharagpur
Bronze medal at FIRA 2015, South Korea

Led the Artificial Intelligence team. Created a framework for learning robot control policies for playing soccer using Q-Learning. Our group won Bronze medal in FIRA 2015, South Korea.