International Conference
on Computer Vision (ICCV) 2023: A retrospective

15 min readDec 5, 2023

Photo of the Eiffel Tower taken by the author after the first night of ICCV 2023 in Paris.

Earlier this month, I had the privilege of attending the International Conference on Computer Vision (ICCV) in Paris. After years of virtual research and other technology conferences, I was excited to revisit the experience of meeting those involved in the state-of-the-art in an open research-oriented forum like ICCV. It did not disappoint.

General layout of ICCV 2023. Photo taken by author.

In this article, I share a retrospective covering:

Some of the state-of-the-art research presented at the conference.
What excited me the most given other momentum in the Bay Area right now, and how this research can be incorporated into startups.
How I personally experienced the conference and what impressed me about the inclusiveness at ICCV 2023.
What we can take away from all of it to apply elsewhere.

1. A glimpse into the state-of-the-art

It can be daunting to try to keep up with any given research area, especially the AI or machine learning space right now, when it seems like there are so many papers published to Arxiv daily. This makes it extremely difficult to either understand the macro direction of research trends or what is pertinent to the area of application an individual is working on.

Looking at the growth of submissions to ICCV alone, interest and research in the computer vision space has significantly increased since 2017.

Slide on trends in submissions from the Opening Remarks. Link here.

Thankfully, the conference organizers and committees review a massive number of submissions and distill top papers by area into the orals presented at the event, with other relevant papers available via posters. Everyone has an opportunity to ask questions or discuss the papers with the authors.

This year, over 8,260 papers were submitted, of which 7,000~ reviewers narrowed down the pool to 2,161 accepted papers, and only 152 of them made it to orals.

Slide from the Technical Program section of the Opening Remarks. Link here.

Taking a look at the accepted papers by the top ten topic areas, it is interesting to observe the following (from slide 21 here) as we think about research trends in the computer vision:

3D from multiview and sensors — Includes simultaneous localization and mapping (SLAM) problems used in autonomous vehicles, robotics, 3D scene generation (neural radiance views), visual querying of video, augmented reality, and other use cases.
Image and video synthesis — Generating relevant images and video.
Transfer, low-shot, continual, long-tail learning — Creating performant deep models from a large number of images that follow a long-tailed class distribution. For example, these models can be easily biased towards dominant classes (e.g. cats and dogs) and perform poorly on tail classes (e.g. elephants).
Low-level and physics based vision —Detailed features that are used to describe an image. Low-level vision includes considering things like corners, angles, and colors while physics based vision examines lighting, weathering, reflections, shape, etc.
Vision and language — Targeting truly multimodal models where both vision augments language and language augments vision.
Segmentation, grouping, and shape analysis — Grouping and segmentation processes with issues of shape perception and
representation that includes topics such as abstraction, interpolation between early and middle vision, ocluded figures, and contours.
3D from a single image and shape-from-x —Reconstructing a 3D image from a single image or a shape using point clouds, geometry estimation, and other methods.
Self-, semi-, meta-, unsupervised learning — Developing better models using self-directed (no labeled data), semi (some labeled data), or meta (changing the model based on learning) unsupervised learning.
Recognition: Detection — Conducting object recognition and distinguishing subjects from the background.
Adversarial attack and defense — Crafting perturbations to inputs that are difficult to perceive by humans but that cause models to product incorrect predictions (attack) and the corresponding methods to defend against this (defense).

We will not delve into all of these papers, but there were a few trends and topic areas below that I wanted to highlight in more detail.

Video and robotics as the future

Video is really hard tech, but an area ripe for research and innovation.

Used in two ways: exteroception which teaches us about the external world. We build mental models of behavior (physical, social..) and use them to interpret, predict, and control
Proprioception tells me about my current state in the world. Helps produce an episodic memory situated in space and time, and guides action in a context-specific way (ego centric views as area of research)

Scaling token-based LLM-like models is not the answer, as this is not feasible in its current state. Capturing the essence of the 4D world and complexity translates to a much higher token count than text, as shown in the workshop slide example below. To address these, new methods should be developed.

Slide image on the future of computer vision workshop taken by the author at ICCV 2023.

Robotics is also a good focus area for further research because it combines everything that computer vision does with the difficulties of the physical world. For example, robotics must handle multiple spatiotemporal scales to even take an action where the finest scale can be understood as physical action, but the larger scales are best understood in terms of goals and intentionality. So, a single action can be thought of as a combination of movements and goals within the complex constraints of an environment.

Data poisoning, safety, and adversarial attacks

Foundation and deep learning models are vulnerable to multiple forms of attack, including data poisoning and adversarial attacks. While I do not see as much discourse on this subject as generative AI in practice, it is critical that we pay attention to research in this space to ensure applications built with (generative) AI are secure.

In fact, it is commonly known that CNNs are highly vulnerable to adversarial attacks and so there has been ongoing research into making CNNs more robust. One paper proposes an adjustment to Mixture of Experts that adds adversarial training to it where the “robustness of routers
(i.e., gating functions to select data-specific experts) and
robustness of experts (i.e., the router-guided pathways defined by the subnetworks of the backbone CNN)” are alternately optimized for.

“Robust Mixture-of-Expert Training for Convolutional Neural Networks” ICCV 2023 slides on bi-level optimization. Paper link.

Using this framework, what the authors call “AdvMoE”, or Adversarial Mixture of Experts, they see robustness increased up to 4% and training cost reduced by more than 50% while maintaining accuracy.

“Robust Mixture-of-Expert Training for Convolutional Neural Networks” ICCV 2023 slides on performance comparisons. Paper link.

Vision-Language Pre-trained models (VLP) are also highly subject to adversarial attacks. As we see text-to-image models becoming more popular, it is important to recognize that now multimodal encoders can be exploited to transfer vulnerabilities across modalities.

The paper “Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models” looks at the current state of vulnerabilities (below) and proposes a new method of attack that further exploits the multimodal nature of modern foundation models that also goes beyond white box (full access to model) settings.

“Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models” ICCV 2023 slides on the current state of adversarial attacks on VLP. Paper link.

Compared to existing adversarial attacks, the proposed method, Set-level Guidance Attack (SGA), exploits the inherent need for transferability in these models by taking advantage of complex cross-modal interactions:

“Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models” ICCV 2023 slides on the approach of Set-level Guidance Attacks. Paper link.

The paper cites impressive results that against fused (ALBEF) and aligned (TCL) multimodal modules can be up to 30% more effective on both white box and black box settings, and some improvement by architecture (CLIP VIT versus CLIP CNN) with CNNs performing worse.

On the other hand, some of these adversarial techniques are now also helping artists and others protect their art from being wrongfully used for training.

There is also research that looks at evaluating how we assess data purification methods and how to apply more memory efficient techniques to improve adversarial purification. One paper in this space that stood out to me was “Robust Evaluation of Diffusion-Based Adversarial Purification”, where the authors instead of relying on the adjoint method, which depends on the performance of an underlying numerical solver, the authors propose using a surrogate back propagation process to approximate gradients from iterative procedures, revealing that some recent works might be less robust than claimed.

Example of the full gradient calculation which can be challenging from a memory perspective. The surrogate process proposes an iterative approach. Link.

Attacks and defenses like these highlight the need to ensure models are robust to adversarial attacks as the adoption of text-to-image or other multimodal models increases and for further conversation on responsible applications of AI.

NERF, poses, and the art of 3D

When looking at augmented reality applications or images generated by vision models, some results can look like something straight out of a horror movie with how distorted the figures are. Humans appear in virtual reality with half of their bodies going through the bench, and generated human poses can have limbs appearing out of nowhere.

A lot of posters and papers presented made a lot of progress in this space, such as realistic reconstructions of poses like here incorporating actual constraints or those that incorporate real-world physics through kinematic observation, as below.

An example of aphysically impossible generated model. Paper link.

Example of how taking into account knowledge of physics can product correct models. Paper link.

The progress in generating scenes (e.g., NERF), camera angles, models, and more has exciting implications for the future of video, robotics, augmented and virtual reality, and other applied areas of computer vision.

2. Translating research into fuel for startups

Robotics as an allegory for teaching machines

Networks have traditionally only been used in robotics to encode states for policy learning, but now roboticists are thinking about how to use the embedding space effectively and more creatively. That includes using LLMs to interact with the world, how to interpret the world through hierarchal representations, and how to efficiently represent image observations. Examples of compelling research include:

Learning a controller in the latent space can make robot learning more data efficient e.g., learning functional distances based on goal and observation using contrastive learning: https://arxiv.org/pdf/2303.08135.pdf
Learning to generalize grasping policies across geometry, pose, category, etc. Dynamic vision-based policy learning without knowing object state: https://arxiv.org/abs/2303.00938
Incorporating game theory for autonomous vehicle decision making: https://arxiv.org/pdf/2107.01110.pdf

Improving robotics also teaches us about teaching machines more generally — these are all things we also need if we want to apply foundation models more creatively either through reinforcement learning or other ways of automated decision-making based on observations.

Personalization on another level

Advancements in computer vision will also improve personalization on multiple levels. One creative way is through egocentric views from a given person’s perspective or scene. Example papers include:

Egocentric video — EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries Jinjie Mai, Abdullah Hamdi, Silvio Giancola, Chen Zhao, Bernard Ghanem
EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity Zijie Jiang, Masatoshi Okutomi
ClothPose: A Real-world Benchmark for Visual Analysis of Garment Pose via An Indirect Recording Solution Wenqiang Xu, Wenxin Du, Han Xue, Yutong Li, Ruolin Ye, Yan-Feng Wang, Cewu Lu

Example of realistically fitted clothing based on this paper.

Metadata and data management — retail analytics or what kinds of images that contain x are viewed

With enhancements in computer vision, it also seems like there will be improved metadata and collection of metadata about how users are interacting with images or models. For example, the Segment Anything paper results will allow retailers and others to automatically extract and catalogue data about images that consumers are interacting with.

Illustration of how the Segment Anything model works. Link.

Other papers to look at include:

Towards Open-Vocabulary Video Instance Segmentation Haochen
Wang, Cilin Yan, Shuai Wang, Xiaolong Jiang, Xu Tang, Yao Hu, Weidi Xie, Efstratios Gavves
Progressive Spatio-Temporal Prototype Matching for Text-Video
Retrieval Pandeng Li, Chen-Wei Xie, Liming Zhao, Hongtao Xie, Jiannan
Ge, Yun Zheng, Deli Zhao, Yongdong Zhang

This is pretty exciting to consider how analytics can be improved to enhance the user experience.

Including fairness, privacy, and explainability in vision applications

A common approach to applying machine learning models now is to download an existing model (over 10,000 now on HuggingFace alone) and then fine-tune it. However, it is commonly recognized that there are social biases in all pre-trained models, and it is not apparent what those biases are. Nonprofits and research institutions are doing manual expert testing of some large language models (LLMs) to identify, for example, political biases in some of the popular LLMs. This is ultimately because we have no visibility into what data was used for training the original model (not fine-tuning data).

I found several interesting papers at ICCV related to these topics around mitigating bias, enhancing privacy, and practical methods we can apply as we implement generative AI systems to ensure fairer outcomes.

The authors of Overwriting Pretrained Bias with Finetuning Data approached the question of do the biases in pre-trained models transfer when you fine-tune them for downstream tasks, and, if so, how can people mitigate these? They specifically tackled the two most common forms of bias as defined in computer vision for the paper as seen in the image below from their presentation.

The type of biases that are common in computer vision and that the authors try to address. Link.

This is also a useful framework for us more generally to consider, as many struggle to identify, quantify, and then create testing methods to address bias in generative AI and other types of machine learning.

What stands out in the paper is how they apply a scientific approach to setting up the experiment, testing, and then proposing a solution that can help mitigate the underlying biases of the original pre-trained model through careful fine-tuning. This type of testing is one we should rigorously do to test for bias before putting any out-of-the-box model into production.

Example of how the authors test for bias in pre-trained models and the associated metrics for measuring success: FPR (false positive rate) for spurious correlations and AUC (area under the curve) for underrepresentation. Link.

For the experiment, the study starts by analyzing bias as spurious correlations, where a sensitive attribute, like gender, might be incorrectly correlated with other attributes in a model. Specifically, they look at those in the CelebA dataset labeled “male,” and that have eyebags as a spurious correlation. They also note that the image of “male” is also not gender inclusive, as this all relies on human or semi-automated (based on human input) labeling. However, by reducing the strength of the correlation with only 100 images in the eyebag example, the bias was reduced by half.

The key takeaway for this paper is that while pre-trained models may introduce bias into finetuned models, careful curation and manipulation of the finetuning dataset can correct for these biases, thus reconciling performance with fairness in model outcomes.

Another paper that impressed me at ICCV on this topic was ITI-GEN: Inclusive Text-to-Image Generation. The authors here also target the biases in pre-trained models but from a different lens. Their goal is to align models to the human definition of inclusive, which they define as where generative images are uniformly distributed across attributes of interest. Below is an example the authors provide of the type of bias they then attempt to tackle:

Example of poor distribution of attributes (e.g. eyeglasses, male, age, skin tone) from the authors. Link.

Many current solutions rely strictly on text-based debiasing. However, as previous research shows, there are limits to this kind of naive solution. Since text-based debiasing can only convey so much information, it struggles with things like negation (not wearing glasses) or skin tone.

Example of the limitations of text-based debiasing such as negation. Link.

The novel approach from this paper is that, whereas many multimodal approaches have text augmenting images, they use images to augment text since detailed attributes (e.g., glasses) can be conveyed via information-rich images.

Example of the curated image datasets now available that can convey more diverse attributes to augment text-to-image instead of relying only on text to label image attributes. Link.

These visual differences can be translated through the method below into natural language, the output of which is actually inclusive tokens that can be used for alignment across models.

Using these embeddings could help build more inclusive models in a much more scalable and generalizable way.

That being said, text-based debiasing and testing will still be a key component of any AI Safety strategy. Red teaming is still necessary to identify bias in foundation models. However, one of the hardest things to do is to actually uncover bugs. Models likely hallucinate more than we think they do, it is simply the case that we do not recognize every hallucination.

The authors of Adaptive Testing of Computer Vision Models propose a way to improve text-based testing and debiasing by using multimodal models to prompt us, the humans, with more creative and out-of-distribution (OOD) tests.

The authors summarize the current challenges in identifying bugs, bias, etc. in computer vision models. Link.

To address this problem, the authors propose adaptive testing where the model suggests prompts using related embeddings that move toward useful (“hill climb”) tests that move closer to out-of-distribution questions as shown in the image below.

Summary of approaches for current-state red-teaming and that which is assisted by a multimodal model. Link.

These can also be generalized to other images and models. Fine-tuning on the identified examples can also fix those bugs in the future, helping developers determine where to deploy and what is safe to implement through efficient testing.

The last paper I will mention in this section is PGFed: Personalize Each Client’s Global Objective for Federated Learning. Federated learning has not been a prominent topic lately, but the authors of this paper propose a more efficient and privacy-preserving method for using personalized models for federated learning with explicit knowledge transfer versus the conventional global model that is then distributed. From the engineering side, I found this to make the use of federated learning more feasible for those interested in deploying in a more privacy-oriented way.

A new approach to federated learning emphasizes personalized models through explicit knowledge transfer for different clients instead of having a traditional global model dispersed to many clients. Link.

This section of papers was one of the most interesting to me and offered insights into how, on the applied side, we might translate research into better responsible AI practices.

3. Diversity and inclusion at the conference

Last but not least, I was personally very impressed with how the ICCV organizers supported diversity at the conference. It was significantly better than what I see at industry events run by the private sector.

Their support included everything from having an actual dedicated committee at all to travel scholarships, on-site healthcare, and even volunteer events at local high schools to teach young women about computer vision.

Slide 8 on the diversity chairs and summary of efforts from the Opening Remarks. Link here.

Slide 14 breakdown of attendance over the years, including 18% that identify as women in 2023 from the Opening Remarks. Link here.

Diversity efforts resulted in a better balance in gender not only because of the committee efforts for providing scholarships and support, but also through the highlighting of the actual work of underrepresented groups (not just by gender). This includes the Women in Computer Vision workshop, LatinX in Computer Vision workshop, and even an Arabic language speakers meetup.

It would be great to see big tech and others adopt similar efforts or policies to support better representation at tech conferences.

Conclusion

This retrospective turned out to be a lot longer than I expected, but I had so many thoughts after the conference that I wanted to capture at least some of them for future reference and to share more broadly.

I hope this provided an overview of some of the key research highlights (I intentionally spent more time on less known areas, otherwise the entire retrospective would have only been about NERFs or gaussian splats), and some of the things I’m excited about going forward.

Reference

ICCV 2023 Outreach Event

The ICCV 2023 High School Outreach Event is jointly organized by ICCV 2023 and "Femmes et Mathématiques".

sites.google.com

Accueil

Accueil - International Conference on Computer Vision - October 2-6, 2023 - Paris - France - ICCV2023

iccv2023.thecvf.com

Papers

Gao, I., Ilharco, G., Lundberg, S., & Ribeiro, M. T. (2023). Adaptive testing of computer vision models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4003–4014).

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., … & Girshick, R. (2023). Segment anything. arXiv preprint arXiv:2304.02643.

Lu, D., Wang, Z., Wang, T., Guan, W., Gao, H., & Zheng, F. (2023). Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 102–111). Link.

Luo, J., Mendieta, M., Chen, C., & Wu, S. (2023). PGFed: Personalize Each Client’s Global Objective for Federated Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3946–3956).

Shan, S., Ding, W., Passananti, J., Zheng, H., & Zhao, B. Y. (2023). Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models. arXiv preprint arXiv:2310.13828.

Wang, A., & Russakovsky, O. (2023). Overcoming Bias in Pretrained Models by Manipulating the Finetuning Dataset. arXiv preprint arXiv:2303.06167.

Zhang, Y., Cai, R., Chen, T., Zhang, G., Zhang, H., Chen, P. Y., … & Liu, S. (2023). Robust Mixture-of-Expert Training for Convolutional Neural Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 90–101). Link.

Zhang, C., Chen, X., Chai, S., Wu, C. H., Lagun, D., Beeler, T., & De la Torre, F. (2023). ITI-GEN: Inclusive Text-to-Image Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3969–3980).

International Conferenceon Computer Vision (ICCV) 2023: A retrospective