Explore cutting-edge developments in computer vision through this advanced graduate-level course that delves deep into transformer architectures and their revolutionary applications in visual understanding. Master the fundamentals of artificial intelligence and deep learning before progressing to sophisticated transformer models including Vision Transformers (ViT), Data-efficient Image Transformers (DeiT), and Swin Transformers with their hierarchical windowing approaches. Investigate how transformers have transformed object detection through DETR and Deformable DETR architectures, enabling end-to-end detection without traditional anchor-based methods. Examine video understanding capabilities through Video Vision Transformers (ViViT), Video Swin Transformers, and space-time attention mechanisms that capture temporal dynamics in visual sequences. Discover emerging applications in multimodal learning, including CLIP's approach to learning transferable visual models from natural language supervision and zero-shot text-to-image generation techniques. Analyze advanced topics such as high-resolution image synthesis using taming transformers, semantic segmentation from sequence-to-sequence perspectives, and the intriguing properties that emerge in self-supervised vision transformers. Study general perception frameworks like Perceiver that use iterative attention mechanisms and explore how ConvNets are being reimagined for modern computer vision challenges. Engage with contemporary research through student project presentations that demonstrate practical applications of these advanced concepts, providing hands-on experience with state-of-the-art computer vision methodologies and transformer-based architectures.

Syllabus

CAP6412 2022: Lecture 27 - Final Project Presentations
CAP6412 2022: Lecture 26 -Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspectic...
CAP6412 2022: Lecture 25 - Emerging Properties in Self-Supervised Vision Transformers
CAP6412 2022: Lecture 24 - Project Presentations
CAP6412 2022: Lecture 23 -Rethinking and Improving Relative Position Encoding for Vision Transformer
CAP6412 2022: Lecture 22 - Multiview Transformers for Video Recognition
CAP6412 2022: Lecture 21 - A ConvNet for the 2020s
CAP6412 2022: Lecture 20 - Intriguing Properties of Vision Transformers
CAP6412 2022: Lecture 19 - Zero-Shot Text-to-Image Generation
CAP6412 2022: Lecture 18 - Taming Transformers for High-Resolution Image Synthesis
CAP6412 2022: Lecture 17b - Project Presentations (continued)
CAP6412 2022: Lecture 17 - Project Presentations
CAP6412 2022: Lecture 16 - Perceiver: General Perception with Iterative Attention
CAP6412 2022: Lecture 15 - Learning Transferable Visual Models From Natural Language Supervision
CAP6412 2022: Lecture 14 - Deformable DETR: Deformable Transformers for End-to-End Object Detection
CAP6412 2022: Lecture 13 - End-to-End Object Detection with Transformers
CAP6412 2022: Lecture 12 - Video Swin Transformer
CAP6412 2022: Lecture 11 - Project Presentations
CAP6412 2022: Lecture 10 - Is Space-Time Attention All You Need for Video Understanding?
CAP6412 2022: Lecture 9 - ViViT: A Video Vision Transformer
CAP6412 2022: Lecture 8 - Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
CAP6412 2022: Lecture 7 -Training data-efficient image transformers & distillation through attention
CAP6412 2022: Lecture 6 - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
CAP6412 2022: Lecture 5 - DETr, TimeS, ViViT, Transforms in Vision: Asurvey
CAP6412 2022: Lecture 4 - VIT, DeIT, SWIN
CAP6412 2022: Lecture 3 - Transformer
CAP6412 2022: Lecture 2 - Advances in Computer Vision Employing Deep Learning
CAP6412 2022: Lecture 1 - Artificial Intelligence Revolution