Technical reading

7 minute read

Published: November 01, 2020

This is a list of select research works that I’ve read and like from most to least recently read, much like a communication of my stream of consciousness. I would like to think that I could have collaborated to write some of these and it is my dream that one day I might produce works like these.

2024

Map Learning with Lane Segment Perception for Autonomous Driving at ICLR 2024: A DETR-based approach for detecting lane segments and their relationships for real-time map learning.
Center-based 3D Object Detection and Tracking at CVPR 2021: Representing 3D objects as points using a keypoint heatmap detection scheme.
BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation at ICRA2023: LiDAR + Camera fusion in the Bird’s Eye View Representation space for autonomous driving tasks.
BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers at TPAMI 2024: Using BEV-space queries and a transformer to project image features into the BEV space
PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images at ICCV 2023: Extended the 3D positional embedding in PETR to temporal modeling and task-specific queries for autonomous driving tasks.

2023

Boundary Loss for Highly Unbalanced Segmentation at PMLR 2019: A loss function that takes the form of a distance metric over on the space of contours instead of regions, allowing application on highly unbalanced segmentation tasks with relatively stable training.
Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations at MICCAI 2017: Generalizes the Dice loss with class re-balancing properties through weighting for highly unbalanced segmentation.
Masked-attention Mask Transformer for Universal Image Segmentation at CVPR 2022: Using a mask to constrain cross-attention within predicted mask regions for dense output prediction tasks, where learnable queries serve as mask proposals (akin to region proposals).

2022

Uncertainty Weighted Losses at CVPR 2018: Simple scheme to learn loss weights in a multi-task framework in a principled manner.
GAN-supervised Dense Visual Alignment at CVPR 2022: Really cool way to use GANs for image congealing, with applications in augmented reality and image editing. They jointly optimize their loss with respect to both a transformation given by a spatial transformer network and a latent code that represents pose.
End-to-End Multi-Person Pose Estimation with Transformers at CVPR 2022: This hits home for me because it’s very closely related to something I worked on at Wrnch. It uses the hungarian set-based loss and output positional encodings that DETR suggests to perform for the first time fast fully-differentiable multi-person pose estimation.
Formal Algorithms For Transformers on ArXiV 2022. Excellent mathematically precise overview of transformer architectures.
Learning Image Representations with a Deformable Grid at ECCV 2020: Representing images on a deformed grid to better align with high-frequency image content. One reason to be excited about this is the ability to make a prediction at lower resolution which translates without sub-pixel error to higher resolution.
Aligning Semantic Segmentation Maps with Implicit Neural Representations at ECCV 2022: The use of implicit neural representations to align features at different levels of an upsampling pyramid and produce segmentation maps at arbitrary resolution. This work is related to Defgrid learning (paper above) in that they both propose strategies that are helpful for image representation at lower resolution, saving compute costs and at the same time having the potential to be more precise.
Pix2PixHD at CVPR 2018: Application of conditional GANs for image synthesis from semantic label maps.

2021

“Tutorial: Pay Attention to What You Need: Do Structural Priors Still Matter in the Age of Billion Parameter Models?” at NIPS 2021: General principles to bring structure to deep learning systems, partly inspired by Daniel Kahneman’s Thinking Fast and Slow.
Masked Autoencoders are Scalable Vision Learners at CVPR 2022: Presents a new self-learning based pre-training method that works surprisingly well for Transformer models in vision tasks.
Robustness of Vision Transformers to occlusions at NIPS 2021: A revealing set of experiments that show that vision transformers are curiously more robust to occlusions when compared to ConvNets.
Spectral Norm for GAN training at NIPS 2021: A study on why spectral normalization has been successful in stabilizing GAN training.
Knowledge Distillation for Object Detection at CVPR 2021: A knowledge distillation scheme for object detection that is based on distilling information in regions of most disagreement between teacher and student.

2020

CoordConv in NIPS 2018: brilliant exposition of a curios way in which ConvNets fail and a balatantly simple solution to allow a conv kernel to be aware of it’s position in terms of pixel coordinates. Note that this paper weirdly has translation invariance mixed up with translation equivarance a lot.
SOLO and SOLOv2 on ArXiV 2020: Simple and elegant idea of segmenting instances of objects by separating out in channels the mask predictions by location in image and size; improved in the following work by dynamic convolutions and a fast Matrix NMS scheme.
A Metric Learning Reality Check at ECCV 2020: Through benchmark and comparison on deep metric learning research over the past few years, exposing flaws in experimental methodology of several of them.
What Matters in Unsupervised Optical Flow at ECCV 2020: A thorough study on the different losses, occlusion handling and smoothness regularization strategies in optical flow learning, leading to a new and simple unsupervised optical flow technique that sets new sota in multiple areas.
DETR at ECCV 2020: Object detection with Transformer networks by performing directly fixed set prediction, without the need for RPNs and NMS.
3D pose estimation with 2D marginal heatmaps at WACV 2019: A beautiful alternative to representing 3D pose ground truth with the memory and computation heavy volumetric heatmaps.
Associative Embeddings at NIPS 2017: A way to supervise networks for simultaneous detection and grouping.
Transformers for Image Recognition on arXiV 2020: The first successful application of a Transformer directly on images, by breaking them down into sequences of image patches, for recognition tasks.

2019

Domain Adversarial Training of Neural Networks at JMLR 2017: Elegant idea to use gradient ascent on a “domain discriminator head” enabled by a gradient reversal layer to learn domain invariant features.
Sampling Matters in Deep Embedding Learning at ICCV 2017: A study on embeddings for metric learning, emphasizing the importance of training data sampling for learning good embeddings. Also proposes a simple margin-based metric learning loss.
Rethinking the Inception Architecture for Computer Vision at CVPR 2016: I actually like this paper most for something they only very briefly touch upon, the concept of label smoothing regularization. LSR is a way to regularize classifiers by adding noise to the class probability targets in the cross-entropy loss. By preventing the correct class logit from becoming much larger than all other logits, it makes the model more adaptable.

pre 2018

Deconvolution and checkerboard artifacts.
Spatial Pyramid Pooling in TPAMI 2015.
Batch renormalization from NIPS 2017.

Share on

Twitter Facebook LinkedIn

Tiny

Technical reading

2024

2023

2022

2021

2020

2019

pre 2018

Share on

You May Also Enjoy

Non-technical reading

Personality profiles of researchers and leaders

On Leadership at Tech Startups

Research Philosophy