1. What is an outlier?
a. An outlier is a data object that seems to be created from a different structure, as it significantly deviates from the other data objects. The median of a data set is not significantly impacted by outliers, while the mean is.
2. How do outliers differ from noise?
a. Noise is a random error or variance in a data set, while an outlier is one or two points that deviate from normality. Noise should be removed from the data set during the data cleaning and preparation phase and before the detection of outliers. Finally, outliers can be interesting and helpful to look at, while undetected noise can lead to unwanted, unpredictable, and uninteresting information.
3. What is the difference between outlier detection and novelty
…show more content…
Contextual outliers
c. Collective outliers
5. What is the difference between contextual and collective outliers?
a. Contextual outliers deviate are categorized as outliers in a specifically selected context, while collective outliers are a group of data objects that significantly deviate from the data set as a whole. It is important to note that in collective outliers, each individual data point might not be an outlier, but the group as a whole provides more insight into the data set at hand.
6. Give an example of contextual and collective outliers.
a. Contextual outlier: during a normal week, Joe usually charges around $100 to his credit card. During the week leading up to Christmas, Joe charges over $1000 to his credit card. Out of context, this may look like fraud but because we know people’s spending habits around the holidays it probably isn’t an outlier.
b. Collective outlier: Let’s say Dr. Velkoski teaches three night classes here at DePaul each week. Each class has 25 students enrolled. If we take a look at the distribution of attendance and students who show up to class, we notice that the average weekly attendance for each class is 19 students. The last two weeks, however, attendance has been dropping. Looking at each class individually, we might not notice a trend, but looking at the whole, we notice that the drop in attendance is bringing down the average amount of students in each class as a whole, making it a collective
…show more content…
What is Grubb’s test?
a. Also called the Maximum Norm Residual Test. Grubb’s test compares the z-score of an object that is suspected to be an outlier and the following defined value:
In this case, N is the number of data object and is the value of the t-distribution at a confidence level of α/2N. If the z-score is greater than the defined value, the object can be considered to be an outlier.
10. What is Mahalanobis distance?
a. Mahalanobis distance is a method that can be used to detect multivariate outliers. It calculates the difference from the object in question (o) to the mean vector of the data set (ō). In the formula below, S is the covariance matrix. The outcome MDist(o,ō) is a univariate variable and can be applied to Grubb’s test for outlier detection.
11. What is a chi-squared statistic?
a. The chi-squared statistic is another method that can be employed for multivariate outlier detection. It differs from Mahalanobis’ test because it operates under the assumption that the data set has a normal distribution. It calculates the distance of an object (oi) from the mean of the i-th dimension (Ei). If the chi-squared statistic is large, then it can be said that the object, oi, is an