Lecture4

.pdf

School

Cardiff University**We aren't endorsed by this school

Course

MATH MA3701

Subject

Mathematics

Date

Dec 21, 2024

Pages

Uploaded by ProfessorWaterRaven30

Mathematics of Artificial IntelligenceLecture 4Alexander BalinskyCardiff School of MathematicsAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 41 / 33

Contents1Distance Measures and Measures of Similarity2Introduction to Classification: Nearest Neighbour3Model Evaluation and SelectionHand-written digit recognitionMetrics for Evaluating Classifier Performance4Bayes Classification MethodsAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 42 / 33

Distance Measures and Measures of SimilarityDistance Measures and Measures of SimilarityWe now take a short detour to study the general notion ofdistancemeasures.Definition (Distance Measure)Suppose we have a set of points, called aspace. Adistance measureonthis space is a functiond(x, y)that takes two points in the space asarguments and produces a real number, and satisfies the following axioms:1.d(x, y)≥0(no negative distances).2.d(x, y) = 0if and only ifx=y(distances are positive, except for thedistance from a point to itself).3.d(x, y) =d(y, x)(distance is symmetric).4.d(x, y)≤d(x, z) +d(z, y)(the triangle inequality).Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 43 / 33

Distance Measures and Measures of SimilarityExample (Distances)Euclidean Distances orL2-norm:d([x1, x2, . . . , xn],[y1, y2, . . . , yn]) =vuutnXi=1(xi−yi)2Lr-norm (r≥1):d([x1, x2, . . . , xn],[y1, y2, . . . , yn]) =nXi=1|xi−yi|r!1/r.L1-norm:Manhattan distance.L∞-norm: is defined as the maximum of|xi−yi|over all dimensionsi.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 44 / 33

Distance Measures and Measures of SimilarityExample (Distances)Jaccard Distance: we define theJaccard distanceof sets byd(x, y) = 1−SIM(x, y). That is, the Jaccard distance is 1 minusthe ratio of the sizes of the intersection and union of setsxandy.We must verify that this function is a distance measure.Cosine Distance: We do not distinguish between a vector and amultiple of that vector. Then the cosine distance between two pointsis theanglethat the vectors to those points make.Edit Distance: This distance makes sense when points arestrings.The distance between two stringsx=x1x2. . . xnandy=y1y2. . . ymis thesmallestnumber ofinsertionsanddeletionsofsingle characters that will convertxtoy.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 45 / 33

Distance Measures and Measures of SimilarityExample (Edit Distance)The edit distance between the stringsx=abcdeandy=acfdegis 3.To convert x to y:Deleteb.Insertfafterc.Insertgaftere.No sequence of fewer than three insertions and/or deletions will convertxtoy. Thus,d(x, y) = 3.Example (Distances)Hamming Distance: Given a space of vectors, we define theHamming distance between two vectors to be the number ofcomponents in which they differ. It should be obvious that Hammingdistance is a distance measure.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 46 / 33

Introduction to Classification: Nearest NeighbourNearest Neighbour Classification (lazy learners)The idea is to estimate the classification of anunseen instanceusing theclassification of the instance or instances that are closest to it.ExampleSupposing we have a training set with just two instances such as thefollowingThere are six attribute values, followed by a classification (positive ornegative). We are then given a third instanceWhat should its classification be?Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 47 / 33

Introduction to Classification: Nearest Neighbourk-Nearest Neighbour ork-NNclassificationDefinition (Basic k-Nearest Neighbour Classification Algorithm)-Find thektraining instances that are closest to the unseen instance.-Take the most commonly occurring classification for thesekinstances.Issues:There are several key issues that affect the performance ofkNN.-One is the choice ofk. An estimate of the best value for k can be obtainedby cross-validation.-Another issue is the approach to combining the class labels. The simplestmethod is to take a majority vote. A more sophisticated approach, which isusually much less sensitive to the choice ofk, weights each object’s vote byits distance. Various choices are possible; for example, the weight factor isoften taken to be the reciprocal of the squared distance:wi= 1/d(y, z)2.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 48 / 33

Model Evaluation and SelectionModel Evaluation and SelectionNow that you may have build a classification model, there may be manyquestions going through your mind.For example, suppose you used data from previous sales to build aclassifier to predict customer purchasing behaviour.You would like an estimate of how accurately the classifier can predictthe purchasing behaviour of future customers.You may even have tried different methods to build more than oneclassifier and now wish to compare theiraccuracy.Questions:What is accuracy? How can we estimate it?Are some measures of a classifier’s accuracy more appropriate thanother?How can we obtain areliableaccuracy estimate?Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 49 / 33

Model Evaluation and SelectionThemisclassification rateon thetraining set.However, what we care about isgeneralization error, which is theexpected value of the misclassification rate when averaged over futuredata. This can be approximated by computing the misclassificationrate on a large independent test set, not used during model training.Overfitting: When we fit highly flexible models, we need to becareful that we do not overfit the data, that is, we should avoid tryingto model every minor variation in the input, since this is more likely tobe noise than true signal. This is illustrated in Figure below, wherewe see that using a high degree polynomial results in a curve that isvery ”wiggly”. It is unlikely that the true function has such extremeoscillations. Thus using such a model might result in accuratepredictions of future outputs.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 410 / 33

Model Evaluation and SelectionAs another example, we will consider theKNNclassifier. The value ofKcan have a large effect on the behavior of this model. WhenK= 1, themethod makes no errors on the training set (since we just return the labelsof the original training points), but the resulting prediction surface is very”wiggly”. Therefore the method may not work well at predicting futuredata.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 411 / 33

Model Evaluation and SelectionClassifiers and models depend on parameters.Typical:60% - Training examples20% - Cross-validation examples20% Test examples.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 412 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 413 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 414 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 415 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 416 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 417 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 418 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 419 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 420 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 421 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 422 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 423 / 33

Model Evaluation and SelectionHand-written digit recognitionAlexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 424 / 33

Model Evaluation and SelectionMetrics for Evaluating Classifier PerformanceCancer classification example1. Train model2. Find that you got 1% error on test set.(99% correct diagnoses)3. Only 0.50% of patients have cancer (skewed classes).4. y=0 has 0.5% error, 99.5% accuracy!Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 425 / 33

Model Evaluation and SelectionMetrics for Evaluating Classifier PerformancePrecision/Recally=1 in presence of rare class that we want to detect.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 426 / 33

Model Evaluation and SelectionMetrics for Evaluating Classifier PerformanceDefinitionF-measure:Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 427 / 33

Bayes Classification MethodsWhat are Bayesian classifiers?Bayesian classifiers arestatistical classifiers. They can predict classmembership probabilities such as the probability that a given tuple belongsto a particular class.Bayesian classification is based on Bayes’ theorem.Na¨ıve Bayesianclassifiers assume that the effect of an attribute value on agiven class is independent of the value of the other attributes.This assumption is calledclass-conditional independence.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 428 / 33

Bayes Classification MethodsLetXbe a data tuple.As usual, it is described by measurements made on a set ofnattributes.LetHbe some hypothesis such as that the data tupleXbelongs to aspecified classC.For classification problems, we want to determineP(H|X), theprobability that the hypothesisHholds given the ”evidence” orobserved data tupleX.Bayes’ TheoremP(H|X) =P(X|H)P(H)P(X).Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 429 / 33

Bayes Classification MethodsNa¨ıve Bayesian ClassificationThena¨ıve Bayesianclassifier, orsimple Bayesianclassifier, works asfollows:1.LetDbe a training set of tuples and their associated class label.As usual, each tuple is represented byn-dimensional attribute vector,X= (x1, x2, . . . , xn), depictingnmeasurements made on the tuplefromnattributes, respectively,A1, A2, . . . , An.2.Suppose that there aremclasses,C1, C2, . . . , Cm.Given a tuple,X, the classifier will predict thatXbelong to the classhaving thehighest posteriorprobability, conditioned onX. That is,Xbelongs to the classCiif and only ifP(Ci|X)> P(Cj|X),for1≤j≤m, j̸=i.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 430 / 33

Bayes Classification MethodsNa¨ıve Bayesian Classification3.By Bayes’ theorem,P(Ci|X) =P(X|Ci)P(Ci)P(X).AsP(X)is constant for all classes, onlyP(X|Ci)P(Ci)needs to bemaximized.4.The class prior probabilities:-If the class prior probabilities are not known, then it is commonlyassumed that the classes are equally likely, that is,P(C1) =P(C2) =· · ·=P(Cm),and we would therefore maximizeP(X|Ci).-Otherwise, we maximizeP(X|Ci)P(Ci). Note that the class priorprobabilities may be estimated byP(Ci) =|Ci,D||D|,where|Ci,D|is the number of training tuples of classCiinD.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 431 / 33

Bayes Classification MethodsNa¨ıve Bayesian Classification5.How to computeP(X|Ci)?To reduce computation and complexity in evaluatingP(X|Ci), thena¨ıve assumption ofclass-conditional independenceis made.This presumes that the attributes’ values are conditionallyindependent of one another, given the class label of the tuple (i.e.that there are no dependence relationship among the attributes).ThusP(X|Ci) =nYk=1P(xk|Ci)=P(x1|Ci)×P(x2|Ci)× · · ·P(xn|Ci).Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 432 / 33

Bayes Classification MethodsWe can easily estimate the probabilitiesP(x1|Ci), P(x2|Ci),· · ·P(xn|Ci)from training tuples.(a)IfAkis categorical, thenP(xk|Ci)is the number of tuples of classCiinDhaving the valuexkforAk, divided by|Ci,D|, the number oftuples of classCiin dataD.If the count is zero, we can use thepseudo-countmethod to obtaina prior probability. The adjusted estimates with pseudo-counts aregiven asP(xj|Ci) =n(xj) + 1ni+mj,wheremj=|dom(Aj)|.(b)IfAkis continuous-valued, we need to do a bit more work:parameterestimates.Alexander Balinsky (Cardiff School of Mathematics)Mathematics of Artificial IntelligenceLecture 433 / 33