Understanding Computer Vision: Recognition Techniques and

School
Toronto Metropolitan University**We aren't endorsed by this school
Course
CPS 843
Subject
Computer Science
Date
Dec 11, 2024
Pages
113
Uploaded by EarlGrasshopperPerson4
CPS834/CPS8307Introduction to Computer VisionDr. Omar FalouToronto Metropolitan UniversityFall 2024
Background image
Introduction to Recognition
Background image
Where we go from hereWhat we know: GeometryWhat is the shape of the world? How does that shape appear in images?How can we infer that shape from one or more images?What’s next: RecognitionWhat are we looking at?Representations of visual contentNew representations for 3D geometryGenerative models
Background image
What is “Recognition”?Next few slides adapted from Li, Fergus, & Torralba’s excellent short course on category and object recognition
Background image
Verification: is that a lamp?What is “Recognition”?
Background image
Detection: where are the people?What is “Recognition”?
Background image
Identification: is that Potala Palace?What is “Recognition”?
Background image
Palace?Classification: what objects are present?What is “Recognition”?mountaintreebannerstreetlamppeople
Background image
Scene and context categorizationWhat is “Recognition”?outdoorcity
Background image
Scene and context categorizationActivity / Event Recognition What is “Recognition”?what are these people doing?
Background image
Object recognition: Is it really so hard?This is a chairFind the chair in this image Output of normalized correlation
Background image
Object recognition: Is it really so hard?Find the chair in this image Pretty much garbage:Simple template matching is not going to do the trick
Background image
Object recognition: Is it really so hard?Find the chair in this image A “popular method is that of template matching, by point to point correlation of a model pattern with the image pattern. These techniques are inadequate for three-dimensional scene analysis for many reasons, such as occlusion, changes in viewing angle, and articulation of parts.” Nivatia & Binford, 1977.
Background image
Works well for object instances (or distinctive images such as logos)Not great for generic object categoriesWhy not use SIFT matching for everything?
Background image
Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422 And it can get a lot harder
Background image
Applications: Photography
Background image
Applications: Shutter-free Photographyhttps://ai.googleblog.com/2019/04/take-your-best-selfie-automatically.html(Also features “kiss detection”)Take Your Best Selfie Automatically, with Photobooth on Pixel 3
Background image
Applications: Assisted / autonomous driving
Background image
Applications: Roboticshttps://arc.cs.princeton.edu/
Background image
Applications: Photo organizationSource: Google PhotosNot Pizzas!
Background image
Applications: medical imagingDermatologist-level classification of skin cancerhttps://cs.stanford.edu/people/esteva/nature/
Background image
Variability:Camera position,Illumination,Shape,etcWhy is recognition hard?Svetlana Lazebnik
Background image
Challenge: lots of potential classes
Background image
Challenge: variable viewpointMichelangelo 1475-1564
Background image
Challenge: variable illuminationimage credit: J. Koenderink
Background image
Challenge: scale
Background image
Challenge: deformation
Background image
Challenge: OcclusionMagritte, 1957
Background image
Challenge: background clutterKilmeny Niland. 1995
Background image
Challenge: intra-class variationsSvetlana Lazebnik
Background image
A brief history of image recognitionWhat worked in 2011 (pre-deep-learning era in computer vision)Optical character recognitionFace detectionInstance-level recognition (what logo is this?)Pedestrian detection (sort of)… that’s about it
Background image
A brief history of image recognitionWhat works now, post-2012 (deep learning era and beyond)Robust object classification across thousands of object categories (rivalling human capabilities)“Spotted salamander”
Background image
What works now, post-2012 (deep learning era and beyond)Face recognition at scaleA brief history of image recognitionFaceNet, CVPR 2015https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html
Background image
What works now, post-2012 (deep learning era and beyond)High-quality image/video synthesisA brief history of image recognitionA Style-Based Generator Architecture for Generative Adversarial NetworksTero Karras (NVIDIA), Samuli Laine (NVIDIA), Timo Aila (NVIDIA)http://stylegan.xyz/paperThese people are not real they were produced by our generator that allows control over different aspects of the image.
Background image
An illustration of an avocado sitting in a therapist's chair, saying 'I just feel so empty inside' with a pit-sized hole in its center. The therapist, a spoon, scribbles notes.What works now, post-2012 (deep learning era and beyond)High-quality image/video synthesisA brief history of image recognitionSeveral giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance…DALL-E 3Sora
Background image
Privacy invasion (e.g., face/person recognition, biometrics)Bias in AI methods (e.g., recognition systems that perform worse on certain demographics)Bias in training data (e.g., used to learn or perpetuate biased associations)Sources of training data (copyright issues, consent issues, etc.)Generative media (e.g., deepfakes, disinformation)Societal impacts
Background image
Learning TechniquesE.g. choice of classifier or inference methodRepresentationLow level: SIFT, HoG, GIST, edgesMid level: Bag of words, sliding window, deformable modelDeep learned featuresLatent diffusion modelsDataMore is always better (as long as it is good data)Annotation (labeling data) has historically been a key challengeNow we are seeing powerful models trained from more noisy labelsWhat Matters in Recognition?
Background image
Learning TechniquesE.g. choice of classifier or inference methodRepresentationLow level: SIFT, HoG, GIST, edgesMid level: Bag of words, sliding window, deformable modelDeep learned featuresLatent diffusion modelsDataMore is always better (as long as it is good data)Annotation (labeling data) has historically been a key challengeNow we are seeing powerful models trained from more noisy labelsWhat Matters in Recognition?
Background image
24 Hrs in Photoshttps://www.kesselskramer.com/project/24-hrs-in-photos/Flickr Photos From 1 Day in 2011
Background image
PASCAL VOC [2005-2012]NotCrowdsourced, bounding boxes, 20 categoriesCIFAR-10 [2009]60000 32x32 color images in 10 classes (6000 images per class)ImageNet [2010 current]Huge, Crowdsourced, Hierarchical, IconicobjectsCOCO (Common Objects in Context) [2014 current]Crowdsourced, large-scale objectsLAION 5B [2022 current]5.85 billion noisy image-text pairsDatasets
Background image
20 object categories (aeroplane to TV/monitor) Three challenges:Classification challenge (is there an X in this image?)Detection challenge (draw a box around every X)Segmentation challenge (which class is each pixel?)
Background image
ImageNet Large Scale Visual Recognition Challenge (ILSVRC)20 object classes 22,591 images1000 object classes 1,431,167 imagesDalmatianhttp://image-net.org/challenges/LSVRC/{2010,2011,2012}2010-2017
Background image
Variety of object classes in ILSVRC
Background image
Variety of object classes in ILSVRC
Background image
Few shot learningHow do we generalize from only a small number of examples?What’s Still Hard?
Background image
Fine-grained classificationHow do we distinguish between more subtle class differences?What’s Still Hard?Animal->Bird->Oriole…
Background image
Image ClassificationSome Slides from Fei-Fei Li, Justin Johnson, Serena Yeunghttp://vision.stanford.edu/teaching/cs231n/
Background image
ReferencesStanford CS231Nhttp://cs231n.stanford.edu/Many slides courtesy of Abe Davis
Background image
Input: an imageOutput: the class label for that imageLabel is generally one or more of the discrete labels used in traininge.g. {cat, dog, cow, toaster, apple, tomato, truck, … }Image classifiers in a nutshelldef classifier(image)://Do some stuffreturn class_label;“Toaster”“Cat”“Dog”
Background image
Image classification demohttps://cloud.google.com/vision/docs/drag-and-dropSee also: https://aws.amazon.com/rekognition/https://www.clarifai.com/https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/
Background image
The Semantic GapWhat we seeWhat the computer sees
Background image
The same class of object can appear verydifferently in different imagesVariation Makes Recognition HardViewpoint VariationLighting VariationDeformationBackground ClutterOcclusion
Background image
Distinct realities can produce the same image…We generally can’t compute the “right” answer, but we can compute the most likely one…We need some kind of prior to condition on. The Problem is Under-constrainedI think there may be a spy among us…
Background image
An image is just a bunch of numbersLet’s stack them up into a vectorOur training data is just a bunch of high-dimensional points nowImages As High-Dimensional VectorsThe Space ofAll Images
Background image
ToastersCatsAn image is just a bunch of numbersLet’s stack them up into a vectorOur training data is just a bunch of high-dimensional points nowDivide space into different regions for different classesImages As High-Dimensional VectorsThe Space ofAll Images
Background image
ToastersCatsAn image is just a bunch of numbersLet’s stack them up into a vectorOur training data is just a bunch of high-dimensional points nowDivide space into different regions for different classesImages As High-Dimensional VectorsThe Space ofAll Images
Background image
Define a distribution over space for each classorAn image is just a bunch of numbersLet’s stack them up into a vectorOur training data is just a bunch of high-dimensional points nowDivide space into different regions for different classesImages As High-Dimensional VectorsThe Space ofAll ImagesToastersCats
Background image
How high-dimensional is an image?Let’s consider an iPhone X photo:4032 x 3024 pixelsEvery pixel has 3 colors36,578,304 pixels (36.5 Mega pixels)In practice, images sit on a lower-dimensional manifoldThink of image features and dimensionality reduction as ways to represent images by their location on such manifoldsImage Features and Dimensionality ReductionThe Space ofAll Images
Background image
Side Note:This also lets us deal with images of different sizes, crops, etc.Image Features and Dimensionality ReductionHow high-dimensional is an image?Let’s consider an iPhone X photo:4032 x 3024 pixelsEvery pixel has 3 colors36,578,304 pixels (36.5 Mega pixels)In practice, images sit on a lower-dimensional manifoldThink of image features and dimensionality reduction as ways to represent images by their location on such manifolds
Background image
Collect a database of images with labelsUse ML to train an image classifierEvaluate the classifier on test imagesTraining & Testing a ClassifierSlide from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Training & Testing a Classifier
Background image
Training & Testing a Classifier
Background image
Nearest NeighborkNN (“k-Nearest Neighbors”)Linear ClassifierNeural NetworkDeep Neural NetworkTransformersClassifiers
Background image
First idea: Nearest Neighbor (NN) ClassifierTrainRemember all training images and their labelsPredictFind the closest (most similar) training imagePredict its label as the true label
Background image
CIFAR-10 and NN resultsSlides from Andrej Karpathy and Fei-Fei Lihttp://vision.stanford.edu/teaching/cs231n/
Background image
CIFAR-10 and NN resultsSlides from Andrej Karpathy and Fei-Fei Lihttp://vision.stanford.edu/teaching/cs231n/
Background image
k-nearest neighborFind the k closest points from training dataTake majority votefrom K closest points
Background image
Background image
Background image
How to Define Distance Between ImagesSlides from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
HyperparameterChoice of distance metricSlide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Demo: http://vision.stanford.edu/teaching/cs231n-demos/knn/
Background image
HyperparametersWhat is the best distance to use?What is the best value of k to use?These are hyperparameters: choices about the algorithm that we set rather than learnHow do we set them?One option: try them all and see what works best
Background image
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Hyperparameter Tuning
Background image
Recap: How to pick hyperparameters?MethodologyTrain and testTrain, validate, testTrain an initial modelValidate to find hyperparametersTest to understand generalizability
Background image
N training images, M test imagesTraining: O(1)Testing: O(MN)We often need the opposite:Slow training is okFast testing is necessarykNN Complexity and Storage
Background image
k-Nearest Neighbors: SummaryIn image classification we start with a training set of images and labels, and must predict labels on the test setThe K-Nearest Neighbors classifier predicts labels based on nearest training examplesDistance metric and K are hyperparametersChoose hyperparameters using the validation set; only run on the test set once at the very end!
Background image
Problems with KNN: Distance Metrics
Background image
As the number of dimensions increases, the same amount of data becomes more sparse.Amount of data we need ends up being exponential in the number of dimensionsProblems with KNN: The Curse of DimensionalityAnimation from https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
Background image
Linear Classifiers
Background image
Nearest NeighborsStore every imageFind nearest neighbors at test time, and assign same classLinear Classification vs. Nearest Neighbors
Background image
Nearest NeighborsStore every imageFind nearest neighbors at test time, and assign same classLinear ClassifierStore hyperplanes that best separate different classesWe can compute continuous class score by calculating (signed) distance from hyperplaneLinear Classification vs. Nearest NeighborsWe can interpret this as a linear "score function” for each class.
Background image
Score functionsSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Parametric ApproachSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Parametric Approach: Linear ClassifierSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Parametric Approach: Linear ClassifierSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Linear ClassifierSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Interpretation: AlgebraicSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Background image
Parameters define a hyperplane for each class:We can think of each class score as defining a distribution that is proportional to distance from the corresponding hyperplaneInterpretation: GeometricThe Space ofAll Images
Background image
Hard Cases for a Linear Classifier
Background image
We can think of the rows in as templates for each classInterpretation: Template matchingRows of Win
Background image
Background image
Background image
Linear classificationOutput scores
Background image
Loss functions
Background image
Given ground truth labels (yi), scores f(xi, W)how unhappy are we with the scores?Loss function or objective/cost function measures unhappinessDuring training, want to find the parameters W that minimize the loss functionLoss function, cost/objective function
Background image
Two classes (e.g., “cat” and “not cat”)AKA “positive” and “negative” classesSimpler example: binary classificationcatnot cat
Background image
0:negative0:positive++bbiiiiwxxwxxLinear classifiersWhich hyperplane is best? We need a loss functionto decide Find linear function (hyperplane) to separate positive and negative examples
Background image
One possibility: Number of misclassified examplesProblems: discrete, can’t break tiesWe want the loss to lead to good generalizationWe want the loss to work for more than 2 classesWhat is a good loss function?Loss: 2Loss: 0Loss: 0
Background image
Interpret Scores as unnormalized log probabilities of classesSoftmax classifierSquashes values into probabilities ranging from 0 to 1(score function)Example with three classes:
Background image
Softmax classifier0.060.820.12Softmax “probabilities”
Background image
Cross-entropy lossCross-entropy quantifies how well the predicted probability distribution aligns with the actual distribution of classes. The lower the cross-entropy, the better the predictions match the true labels. Cross-entropy loss is designed to penalize incorrect predictions more heavily, encouraging the model to assign high probabilities to the correct classes.
Background image
Cross-entropy loss(score function)
Background image
Cross-entropy loss(score function)We call Licross-entropy lossfyi : score of correct class
Background image
Cross-entropy loss(score function)We call Licross-entropy loss
Background image
Cross-entropy loss is just one possible loss functionOne nice property is that it reinterprets scores as probabilities, which have a natural meaningSVM (max-margin) loss functions also used to be popularBut currently, cross-entropy is the most common classification lossLosses
Background image
Have score function and loss functionCurrently, score function is based on linear classifierNext, will generalize to convolutional neural networksFind W and b to minimize lossSummaryAverage of cross-entropy loss over all training examplesRegularization term
Background image