Understanding Computer Vision: Recognition Techniques and
School
Toronto Metropolitan University**We aren't endorsed by this school
Course
CPS 843
Subject
Computer Science
Date
Dec 11, 2024
Pages
113
Uploaded by EarlGrasshopperPerson4
CPS834/CPS8307Introduction to Computer VisionDr. Omar FalouToronto Metropolitan UniversityFall 2024
Introduction to Recognition
Where we go from here•What we know: Geometry•What is the shape of the world? •How does that shape appear in images?•How can we infer that shape from one or more images?•What’s next: Recognition•What are we looking at?•Representations of visual content•New representations for 3D geometry•Generative models
What is “Recognition”?Next few slides adapted from Li, Fergus, & Torralba’s excellent short course on category and object recognition
•Verification: is that a lamp?What is “Recognition”?
•Detection: where are the people?What is “Recognition”?
•Identification: is that Potala Palace?What is “Recognition”?
Palace?•Classification: what objects are present?What is “Recognition”?mountaintreebannerstreetlamppeople…
•Scene and context categorizationWhat is “Recognition”?•outdoor•city•…
•Scene and context categorization•Activity / Event Recognition What is “Recognition”?what are these people doing?
Object recognition: Is it really so hard?This is a chairFind the chair in this image Output of normalized correlation
Object recognition: Is it really so hard?Find the chair in this image Pretty much garbage:Simple template matching is not going to do the trick
Object recognition: Is it really so hard?Find the chair in this image A “popular method is that of template matching, by point to point correlation of a model pattern with the image pattern. These techniques are inadequate for three-dimensional scene analysis for many reasons, such as occlusion, changes in viewing angle, and articulation of parts.” Nivatia & Binford, 1977.
•Works well for object instances (or distinctive images such as logos)•Not great for generic object categoriesWhy not use SIFT matching for everything?
Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422 And it can get a lot harder
Applications: Photography
Applications: Shutter-free Photographyhttps://ai.googleblog.com/2019/04/take-your-best-selfie-automatically.html(Also features “kiss detection”)Take Your Best Selfie Automatically, with Photobooth on Pixel 3
A brief history of image recognition•What worked in 2011 (pre-deep-learning era in computer vision)•Optical character recognition•Face detection•Instance-level recognition (what logo is this?)•Pedestrian detection (sort of)•… that’s about it
A brief history of image recognition•What works now, post-2012 (deep learning era and beyond)•Robust object classification across thousands of object categories (rivalling human capabilities)“Spotted salamander”
•What works now, post-2012 (deep learning era and beyond)•Face recognition at scaleA brief history of image recognitionFaceNet, CVPR 2015https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html
•What works now, post-2012 (deep learning era and beyond)•High-quality image/video synthesisA brief history of image recognitionA Style-Based Generator Architecture for Generative Adversarial NetworksTero Karras (NVIDIA), Samuli Laine (NVIDIA), Timo Aila (NVIDIA)http://stylegan.xyz/paperThese people are not real –they were produced by our generator that allows control over different aspects of the image.
An illustration of an avocado sitting in a therapist's chair, saying 'I just feel so empty inside' with a pit-sized hole in its center. The therapist, a spoon, scribbles notes.•What works now, post-2012 (deep learning era and beyond)•High-quality image/video synthesisA brief history of image recognitionSeveral giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance…DALL-E 3Sora
•Privacy invasion (e.g., face/person recognition, biometrics)•Bias in AI methods (e.g., recognition systems that perform worse on certain demographics)•Bias in training data (e.g., used to learn or perpetuate biased associations)•Sources of training data (copyright issues, consent issues, etc.)•Generative media (e.g., deepfakes, disinformation)•… Societal impacts
•Learning Techniques•E.g. choice of classifier or inference method•Representation•Low level: SIFT, HoG, GIST, edges•Mid level: Bag of words, sliding window, deformable model•Deep learned features•Latent diffusion models•Data•More is always better (as long as it is good data)•Annotation (labeling data) has historically been a key challenge•Now we are seeing powerful models trained from more noisy labelsWhat Matters in Recognition?
•Learning Techniques•E.g. choice of classifier or inference method•Representation•Low level: SIFT, HoG, GIST, edges•Mid level: Bag of words, sliding window, deformable model•Deep learned features•Latent diffusion models•Data•More is always better (as long as it is good data)•Annotation (labeling data) has historically been a key challenge•Now we are seeing powerful models trained from more noisy labelsWhat Matters in Recognition?
24 Hrs in Photoshttps://www.kesselskramer.com/project/24-hrs-in-photos/Flickr Photos From 1 Day in 2011
•PASCAL VOC [2005-2012]•NotCrowdsourced, bounding boxes, 20 categories•CIFAR-10 [2009]•60000 32x32 color images in 10 classes (6000 images per class)•ImageNet [2010 –current]•Huge, Crowdsourced, Hierarchical, Iconicobjects•COCO (Common Objects in Context) [2014 –current]•Crowdsourced, large-scale objects•LAION 5B [2022 –current]•5.85 billion noisy image-text pairsDatasets
•20 object categories (aeroplane to TV/monitor) •Three challenges:•Classification challenge (is there an X in this image?)•Detection challenge (draw a box around every X)•Segmentation challenge (which class is each pixel?)
•Few shot learning•How do we generalize from only a small number of examples?What’s Still Hard?
•Fine-grained classification•How do we distinguish between more subtle class differences?What’s Still Hard?Animal->Bird->Oriole…
Image ClassificationSome Slides from Fei-Fei Li, Justin Johnson, Serena Yeunghttp://vision.stanford.edu/teaching/cs231n/
References•Stanford CS231N•http://cs231n.stanford.edu/•Many slides courtesy of Abe Davis
•Input: an image•Output: the class label for that image•Label is generally one or more of the discrete labels used in training•e.g. {cat, dog, cow, toaster, apple, tomato, truck, … }Image classifiers in a nutshelldef classifier(image)://Do some stuffreturn class_label;“Toaster”“Cat”“Dog”
•The same class of object can appear verydifferently in different imagesVariation Makes Recognition HardViewpoint VariationLighting VariationDeformationBackground ClutterOcclusion
•Distinct realities can produce the same image…•We generally can’t compute the “right” answer, but we can compute the most likely one…•We need some kind of prior to condition on. The Problem is Under-constrainedI think there may be a spy among us…
•An image is just a bunch of numbers•Let’s stack them up into a vector•Our training data is just a bunch of high-dimensional points nowImages As High-Dimensional VectorsThe Space ofAll Images
ToastersCats•An image is just a bunch of numbers•Let’s stack them up into a vector•Our training data is just a bunch of high-dimensional points now•Divide space into different regions for different classesImages As High-Dimensional VectorsThe Space ofAll Images
ToastersCats•An image is just a bunch of numbers•Let’s stack them up into a vector•Our training data is just a bunch of high-dimensional points now•Divide space into different regions for different classesImages As High-Dimensional VectorsThe Space ofAll Images
•Define a distribution over space for each classor•An image is just a bunch of numbers•Let’s stack them up into a vector•Our training data is just a bunch of high-dimensional points now•Divide space into different regions for different classesImages As High-Dimensional VectorsThe Space ofAll ImagesToastersCats
•How high-dimensional is an image?•Let’s consider an iPhone X photo:•4032 x 3024 pixels•Every pixel has 3 colors•36,578,304 pixels (36.5 Mega pixels)•In practice, images sit on a lower-dimensional manifold•Think of image features and dimensionality reduction as ways to represent images by their location on such manifoldsImage Features and Dimensionality ReductionThe Space ofAll Images
Side Note:This also lets us deal with images of different sizes, crops, etc.Image Features and Dimensionality Reduction•How high-dimensional is an image?•Let’s consider an iPhone X photo:•4032 x 3024 pixels•Every pixel has 3 colors•36,578,304 pixels (36.5 Mega pixels)•In practice, images sit on a lower-dimensional manifold•Think of image features and dimensionality reduction as ways to represent images by their location on such manifolds
•Collect a database of images with labels•Use ML to train an image classifier•Evaluate the classifier on test imagesTraining & Testing a ClassifierSlide from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
First idea: Nearest Neighbor (NN) Classifier•Train•Remember all training images and their labels•Predict•Find the closest (most similar) training image•Predict its label as the true label
CIFAR-10 and NN resultsSlides from Andrej Karpathy and Fei-Fei Lihttp://vision.stanford.edu/teaching/cs231n/
CIFAR-10 and NN resultsSlides from Andrej Karpathy and Fei-Fei Lihttp://vision.stanford.edu/teaching/cs231n/
k-nearest neighbor•Find the k closest points from training data•Take majority votefrom K closest points
How to Define Distance Between ImagesSlides from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
•HyperparameterChoice of distance metricSlide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Hyperparameters•What is the best distance to use?•What is the best value of k to use?•These are hyperparameters: choices about the algorithm that we set rather than learn•How do we set them?•One option: try them all and see what works best
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Slide composited from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Hyperparameter Tuning
Recap: How to pick hyperparameters?•Methodology•Train and test•Train, validate, test•Train an initial model•Validate to find hyperparameters•Test to understand generalizability
•N training images, M test images•Training: O(1)•Testing: O(MN)•We often need the opposite:•Slow training is ok•Fast testing is necessarykNN –Complexity and Storage
k-Nearest Neighbors: Summary•In image classification we start with a training set of images and labels, and must predict labels on the test set•The K-Nearest Neighbors classifier predicts labels based on nearest training examples•Distance metric and K are hyperparameters•Choose hyperparameters using the validation set; only run on the test set once at the very end!
Problems with KNN: Distance Metrics
•As the number of dimensions increases, the same amount of data becomes more sparse.•Amount of data we need ends up being exponential in the number of dimensionsProblems with KNN: The Curse of DimensionalityAnimation from https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
Linear Classifiers
•Nearest Neighbors•Store every image•Find nearest neighbors at test time, and assign same classLinear Classification vs. Nearest Neighbors
•Nearest Neighbors•Store every image•Find nearest neighbors at test time, and assign same class•Linear Classifier•Store hyperplanes that best separate different classes•We can compute continuous class score by calculating (signed) distance from hyperplaneLinear Classification vs. Nearest NeighborsWe can interpret this as a linear "score function” for each class.
Score functionsSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Parametric ApproachSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Parametric Approach: Linear ClassifierSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Parametric Approach: Linear ClassifierSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Linear ClassifierSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
Interpretation: AlgebraicSlide adapted from Andrej Karpathy and Fei-Fei Li http://vision.stanford.edu/teaching/cs231n/
•Parameters define a hyperplane for each class:•We can think of each class score as defining a distribution that is proportional to distance from the corresponding hyperplaneInterpretation: GeometricThe Space ofAll Images
Hard Cases for a Linear Classifier
•We can think of the rows in as templates for each classInterpretation: Template matchingRows of Win
Linear classificationOutput scores
Loss functions
•Given ground truth labels (yi), scores f(xi, W)•how unhappy are we with the scores?•Loss function or objective/cost function measures unhappiness•During training, want to find the parameters W that minimize the loss functionLoss function, cost/objective function
•Two classes (e.g., “cat” and “not cat”)•AKA “positive” and “negative” classesSimpler example: binary classificationcatnot cat
0:negative0:positive++bbiiiiwxxwxxLinear classifiersWhich hyperplane is best? We need a loss functionto decide •Find linear function (hyperplane) to separate positive and negative examples
•One possibility: Number of misclassified examples•Problems: discrete, can’t break ties•We want the loss to lead to good generalization•We want the loss to work for more than 2 classesWhat is a good loss function?Loss: 2Loss: 0Loss: 0
•Interpret Scores as unnormalized log probabilities of classesSoftmax classifierSquashes values into probabilities ranging from 0 to 1(score function)Example with three classes:
Cross-entropy lossCross-entropy quantifies how well the predicted probability distribution aligns with the actual distribution of classes. The lower the cross-entropy, the better the predictions match the true labels. Cross-entropy loss is designed to penalize incorrect predictions more heavily, encouraging the model to assign high probabilities to the correct classes.
Cross-entropy loss(score function)
Cross-entropy loss(score function)We call Licross-entropy lossfyi : score of correct class
Cross-entropy loss(score function)We call Licross-entropy loss
•Cross-entropy loss is just one possible loss function•One nice property is that it reinterprets scores as probabilities, which have a natural meaning•SVM (max-margin) loss functions also used to be popular•But currently, cross-entropy is the most common classification lossLosses
•Have score function and loss function•Currently, score function is based on linear classifier•Next, will generalize to convolutional neural networks•Find W and b to minimize lossSummaryAverage of cross-entropy loss over all training examplesRegularization term