Robot Learning Proseminar/Seminar SS 2021

(1)

Albert-Ludwigs-Universität Freiburg

Robot Learning Proseminar/Seminar

SS 2021

Robot Learning Lab

(2)

Evaluation

▪ Abstract →

At most 2 pages

▪ _{Presentation →}

_{At most 20 minutes}

▪ _{Summary →}

_{At most 7 pages}

_{excluding bibliography and ﬁgures}

▪ _{Final grade → Abstract, Presentation, Summary, Seminar participation}

Evaluation Bachelor’s Due Date Master’s Due Date

Paper Abstract 18/06/2021 18/06/2021 Seminar Presentation 23/07/2021 16/07/2021 Paper Summary 30/07/2021 30/07/2021

Bachelor’s Seminar:https://rl.uni-freiburg.de/teaching/ss21/robotlearning-bachelor/

(3)

Enrollment Procedure

Select three papers in decreasing order of preference

Register for the seminar in HISInOne

Students selected based on HISInOne Priority Suggestion

Students assigned a paper based on their paper preference

Bachelor’s Form Master’s Form

On 29/04/2021 On 29/04/2021

(4)

● Tremendous progress on complex, high dimensional data

○ Speech Recognition

○ _{Computer Vision}

○ Natural Language Understanding

● _{Autonomous systems smart enough to operate in the real world}

Robot Learning

(5)

Perception

● Complex environments

● Noisy observations and sensors

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow, Waleed et. al., 2017

Rohit Mohan and Abhinav Valada, "EfficientPS: Efficient Panoptic Segmentation", arXiv preprint arXiv:2004.02307, 2020.Patil, Ashok Kumar & Kumar, G Ajay & Kim, Tae-Hyoung & Chai, Young-Ho. (2018). Hybrid approach for alignment of a pre-processed three-dimensional point cloud, video, and CAD model using partial point cloud in retrofitting applications.

International Journal of Distributed Sensor Networks. 14. 155014771876645. 10.1177/1550147718766452.

(6)

Unknown, Open World

● Unknown world → Many unlabelled samples

● _{Uncertainty estimation}

● _{Adversarial attacks}

Towards Open Set Recognition, Scheirer et. al., 2012

Explaining and Harnessing Adversarial Examples, Goodfellow et. al., 2014

(7)

Autonomous Decision Making

● Reinforcement learning for short- and long-term decision making

(8)

Reinforcement Learning

● Model free RL

○ Adapts to complex scenarios

○ _{Directly optimise policy}

○ Data intensive

● Model-based RL

○ Learns a world model

○ _{Promise of better generalisation}

Learning Latent Dynamics for Planning from Pixels, Hafner et. al., 2019

(9)

Expensive Real World Data

● Sim2Real

○ Domain adaptation

○ _{Action and dynamics noise}

● Oﬄine RL

○ Large amounts of unstructured data

○ Little annotated / expert data

Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection. Sergey Levine, Peter Pastor, Alex Krizhevsky, Deirdre Quillen D4RL: Datasets for Deep Data-Driven Reinforcement Learning, Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, Sergey Levine

(10)

Weak- and Self-Supervision

● Provide labels for simpler tasks

○ Object presence and absence

○ _{Consistency over time}

○ Viewpoint invariance

● _{Reduce oversight}

○ Automatic resets

○ _{Reward labelling}

Time-Contrastive Networks: Self-Supervised Learning from Video, Sermanet et. al., 2018 TossingBot: Learning to Throw Arbitrary Objects, Zeng et. al., 2019.

(11)

Bachelor Seminar Topics

(12)

1. PolarNet: An Improved Grid Representation for

Online LiDAR Point Cloud Semantic Segmentation

● Uses polar representation instead of spherical or BEV

● Proposes a novel ring-based CNN for use with the polar representation

● _{Beat the SOTA while using lesser parameters}

Supervisor: Nikhil Gosala - Link to paper

Source: Zhang et al. 2020

(13)

▪ Point clouds quantised and projected into a polar grid

▪ _{Ring CNN applied to polar grid to learn the semantic mapping.}

1. PolarNet: An Improved Grid Representation for

Online LiDAR Point Cloud Semantic Segmentation

Source: Zhang et al. 2020

(14)

● _{Addresses the challenge of drone navigation in urban environments}

● Uses data from cars and bicycles to address lack of drone-relevant data

● _{Learns the steering angle and collision probability to control the drone}

2. DroNet: Learning to Fly by Driving

Source: Loquercio et al.

(15)

2. DroNet: Learning to Fly by Driving

Video Link

(16)

Supervisor: Juana Valeria Hurtado - Link to paper

3. Let there be Color: Joint End to End Learning of

Global and Local Image Priors for Automatic Image

Colorization with Simultaneous Classiﬁcation

● Approach to introduce colour into grayscale images

● Merges patch-level local features with image-level global features

Source: IIzuka et al.

(17)

3. Let there be Color: Joint End to End Learning of

Global and Local Image Priors for Automatic Image

Colorization with Simultaneous Classiﬁcation

● Local features capture texture or objects

● Global features capture time of day, location (indoors/outdoors)

● _{Changing the global features can help re-imagine the image}

(18)

4. Context Encoders - Feature Learning by

Inpainting

● _{Approach to hallucinate missing / blacked out regions in image}

● Validates the ability of a network to understand structure in scenes

● _{L2 loss → Captures overall structure}

● Adversarial loss → Chooses one mode out of several possible ones

(19)

4. Context Encoders - Feature Learning by

Inpainting

(20)

Supervisor: Dr. Daniele Cattaneo - Link to paper

5. R2D2: Repeatable and Reliable Detector and

Descriptor

● _{Approach to generate reliable and repeatable keypoints}

● Detect-and-Describe instead of Detect-then-Describe

● _{Good keypoints:}

○ Repeatable (generate strong cues, like corners)

○ Reliable (ability to uniquely match across images)

● _{Predict discriminativeness along with keypoints}

Source: Revaud et al.

(21)

5. R2D2: Repeatable and Reliable Detector and

Descriptor

Source: Revaud et al.

Ground Truth Keypoints Keypoint Repeatability Keypoint Reliability

(22)

6. SegMatch: Segment Based Place Recognition

in 3D Point Clouds

● _{A place recognition approach for 3D point clouds for use in SLAM}

● Generates and matches 3D segments:

○ Alleviates the need for accurate object detection or notion of “object”

○ More informative as opposed to keypoint descriptors

(23)

6. SegMatch: Segment Based Place Recognition

in 3D Point Clouds

▪

Video Link

Segment Matching Loop Closure Detection

(24)

Supervisor: Eugenio Chisari - Link to paper

7. Sim-to-Real Transfer of Robotic Control with

Dynamics Randomization

● _{Bridge the “reality-gap” between simulation and real-world}

● Randomizes the robot simulator dynamics for policy generalisation

● _{Policies trained in simulation can directly be used in the real world!}

● LSTM used to model the memory nodes of the DeepRL network

Source: Peng et al.

Policy Network Value Network

(25)

7. Sim-to-Real Transfer of Robotic Control with

Dynamics Randomization

Video Link

(26)

8. Continuous Control for High-Dimensional State

Spaces: An Interactive Learning Approach

● _{Introduces human corrective feedback into the policy optimisation phase}

● No reward function needed

● _{Sample size greatly reduced because of expert intervention}

● Human actively corrects the errors of the policy and physically interacts with

the system during the learning process

Source: Perez-Dattari et al.

(27)

8. Continuous Control for High-Dimensional State

Spaces: An Interactive Learning Approach

(28)

Supervisor: Daniel Honerkamp - Link to paper

9. Playing Atari with Deep Reinforcement

Learning

● _{CNN to directly learn a control policy from high-dimensional input using RL}

● Network is trained with a variant of Q-Learning

● _{Single model trained once plays multiple Atari games}

● This paper led to a $500 million acquisition of DeepMind by Google

Source: Mnih et al.

(29)

9. Playing Atari with Deep Reinforcement

Learning

(30)

10. Recurrent Models of Visual Attention

● _{CNNs have high computational complexity}

● CNNs replaced with RNNs to control the computation complexity

○ RNN adaptively selects only the most relevant features for a given task

○ The next location to be attended to depends on the past regions and the task

● The computation time is independent of the input size

● _{Model is non-diﬀerentiable and is trained using RL for a speciﬁc task}

(31)

10. Recurrent Models of Visual Attention

Source: kevinzakka GitHub

● _{The network incrementally combines the diﬀerent regions to learn a}

dynamic representation of the image

(32)

Supervisor: Dr. Tim Welschehold- Link to paper

11. Eﬃciency and Equity are Both Essential: A

Generalized Traﬃc Signal Controller with Deep

Reinforcement Learning

● _{An RL method to learn a traffic signal control policy to optimise traffic flow}

● Reward function accounts for both Equity (Variance) and Eﬃciency (Mean)

● _{Adaptive discounting used to reduce the reward for transition phases}

Source: Yan et al.

(33)

Supervisor: Dr. Tim Welschehold - Link to paper

12. Use the Force, Luke! Learning to Predict

Physical Forces by Simulating Eﬀects

● _{Approach to infer forces exerted on objects from a video}

● Predicts physical information that can be used to recreate the interaction

● _{Force and contact point prediction used to deform object mesh to recreate}

the action

Source: Ehsani et al.

(34)

12. Use the Force, Luke! Learning to Predict

Physical Forces by Simulating Eﬀects

Video Link

(35)

Master Seminar Topics

(36)

1. Perceive, Attend, and Drive: Learning Spatial

Attention for Safe Self-Driving

Source: Wei et al.

● _{Lot of computation in perception is wasted on regions that don’t matter!}

● An approach to automatically attend to important regions of the image for

use with motion planning

● The attention mask increases safety by allowing focused computation

(37)

1. Perceive, Attend, and Drive: Learning Spatial

Attention for Safe Self-Driving

Video Link

(38)

2. Orthographic Feature Transform for Monocular

3D Object Detection

● _{An approach to convert a 2D frontal perspective view to a 3D orthographic}

view for 3D object detection in monocular images

● _{Orthographic Feature Transform maps features in frontal view to features in}

the Bird’s eye view

● Generates a holistic view where object scale and distance are meaningful

Source: Roddick et al.

(39)

2. Orthographic Feature Transform for Monocular

3D Object Detection

Source: Roddick et al.

Video Link

(40)

3. Neural Scene Graphs for Dynamic Scenes

Source: Ost et. al.

● _{A neural rendering approach for accurate view synthesis of dynamic}

scenes using scene graph representations

● _{The network decomposes the scene into static and dynamic objects and}

encodes their representations using scene graphs

● Novel scenes can be generated by manipulating the learnt scene graphs

(41)

3. Neural Scene Graphs for Dynamic Scenes

Source: Ost et. al.

(42)

4. Semantic Object Prediction and Spatial Sound

Super-Resolution with Binaural Sounds

● _{Approach to perform semantic object prediction using binaural sound}

● Network trained using a teacher-student approach

○ A vision network acts as a teacher

○ Audio network is the student and learns to generate the same results as the teacher

● Depth estimation and spatial sound super resolution used to boost results

Source: Vasudevan et al.

(43)

4. Semantic Object Prediction and Spatial Sound

Super-Resolution with Binaural Sounds

Source: Vasudevan et al.

(44)

5. xMUDA: Cross-Modal Unsupervised Domain

Adaptation for 3D Semantic Segmentation

● _{Transfer features learnt on one dataset to another dataset}

● Exploits multiple modalities to improve the result of unsupervised domain

adaptation between datasets

● Each modality learns to mimic the better qualities of the other, making the

overall domain adaptation objective better

Source: Jaritz et al.

(45)

5. xMUDA: Cross-Modal Unsupervised Domain

Adaptation for 3D Semantic Segmentation

Source: Jaritz et al.

(46)

6. STaR: Self-Supervised Tracking and

Reconstruction of Rigid Objects in Motion with

Neural Rendering

● _{Self-supervised tracking and dynamic scene reconstruction approach from}

multi-view RGB videos

● _{Explicitly handles rigid motion, which is the main cause of failure in}

competing approaches

● NeRFs and rigid poses are jointly optimised to achieve the goal

Source: Yuan et al.

(47)

6. STaR: Self-Supervised Tracking and

Reconstruction of Rigid Objects in Motion with

Neural Rendering

Video Link

(48)

7. Self-Supervised Correspondence in Visuomotor

Policy Learning

● _{Approach to improve generalisation of visuomotor policy learning using}

self-supervised correspondences

● _{Self-supervised dense correspondence training generalises across}

diﬀerent objects using very little data

● Also propose a technique for multi-camera time-synchronised dense

spatial correspondence learning

Source: Florence et al.

(49)

7. Self-Supervised Correspondence in Visuomotor

Policy Learning

Video Link

(50)

8. Accelerating Reinforcement Learning with

Learned Skill Priors

● _{Approach taps into previously learned skills to learn a prior over skills}

● Deep Latent Variable model jointly learns embedding space of skills and

skill priors

● Skill priors accelerate downstream policy learning on complex tasks

Source: Pertsh et al.

(51)

8. Accelerating Reinforcement Learning with

Learned Skill Priors

Source: Pertsh et al. Competing approaches Their approach

(52)

9. Asymmetric Self-Play for Automatic Goal

Discovery in Robotic Manipulation

● _{A zero-shot generalisation approach to solve multiple unseen tasks using}

asymmetric self-play

● _{Two agents play against one another - Alice and Bob}

○ Alice proposes a challenge to Bob

○ Bob tries to solve that challenge

● _{Bob learns a policy by observing Alice’s trajectory and cloning its behaviour}

● ASP always gives an achievable goal and comes up with many novel goals

(53)

9. Asymmetric Self-Play for Automatic Goal

Discovery in Robotic Manipulation

Video Link

(54)

10. Language as a Cognitive Tool to Imagine

Goals in Curiosity-Driven Exploration

● _{A DeepRL approach that automatically imagines out-of-distribution goals}

for goal discovery and open-ended learning

● _{The model interprets the OOD goal using separation of:}

○ Learned goal-achievement reward function

○ Policy relying on deep-sets, gated attention and object-centered representation

Source: Colas et al.

(55)

10. Language as a Cognitive Tool to Imagine

Goals in Curiosity-Driven Exploration

Source: Colas et al. Goals provided by the social partner

Imagined goals (Soma goals are meaningless)

(56)

11. Reinforcement Learning with Videos:

Combining Oﬄine Observations with Interaction

● _{Model learns a policy using human}

demonstration videos along with online data collected by the robot

● Human experience videos don’t have reward or

motion cues

○ Model estimates a reward using domain-invariant representation of the images

● The model successfully learns challenging vision

based skills with much less data

Source: Schmeckpeper et al.

(57)

11. Reinforcement Learning with Videos:

Combining Oﬄine Observations with Interaction

Source: Schmeckpeper et al.

(58)

12. A GMM Layer Jointly Optimized with

Discriminative Features within a Deep Neural

Network Architecture

● _{Propose a NN with GMM as its ﬁnal layer and}

jointly trained using Asynchronous SGD

● _{The best of both worlds:}

○ Superior feature extraction of NNs

○ Well-studied and structured classiﬁcation of GMMs

● Evaluate the improvement obtained by:

○ Joint training of the GMM and the other layers

○ Changing network depth in the GMM network and the vanilla one

○ Using a GMM model instead of a vanilla NN

Source: Variani et al.

(59)