Receiver operating characteristics (ROC), is one of the most popular evaluation techniques for Machine Learning models. Mainly started in the domain of Medical Diagnosis, ROC has become very popular among Machine Learning Researchers.
It’s a very nice technique to visualize the mode performance & depict the tradeoff between the true prediction and false alarm rate. While in the preliminary phase of Machine learning domain everyone uses ‘Accuracy’ as the model performance, but Accuracy has some serious issue in its definition itself. Before going into deeper it’s time to be familiar with some terms that we will deal with in the rest of the article. Suppose we want to make a machine learning model that we will use for cancer detection.
True Positive is the count of model’s correctly predicting cancer patients, False Positive is the count of a noncancer patient is classified as having cancer, True negative is the count of noncancer patients properly classifies as not having cancer and False negative is the count of cancer patient classified as not having cancer. These measures are very important for any model and people use confusion matrix or contingency table to represent them.
![]() |
Figure A, Confusion Matrix |
Now, coming back to the problem of accuracy , Accuracy is defined as :(TP + TN) / Total Points under test
Now under most of real world cases TN counts is very large, just like people not having cancer is very large relative to cancer patients & this is prior information. While measuring the model performance, doctors are interested in how many actual cancers patients are properly classified & that why mainly the model was built. But if we look at the definition of accuracy, even if TN counts is high that makes the accuracy high but not actually making a good sense of the performance of the model.Also what if we are dealing with multiclass classification problem and class are not necessarily balanced. So we need something smarter.
ROC Space:
![]() |
Figure B, classification on threshold |
Before staring details of ROC space , lets have an understanding how we plot ROC.
We take the following steps , first measure the range of values that probability distribution takes. This is P(Cancer| Features), & we know if this value is greater than P(Not Cancer| Features) , then we classify it as Cancer . But for now we will take a range of threshold values and compare the TP and FP values wrt that threshold values (Fig B). So we get a range of TP and FP values based on different threshold( generally 0 - 1). We plot it with TP rate in the Y axis and FP in X axis. With this basic idea it’s time to understand the ROC space and how to judge your model.
Typically a ROC curve looks like fig C . Where leftmost part represent threshold 1 and rightmost with threshold 0. The point near (0,0) is the region where threshold is very high , so model hardly classify any sample as positive . So in this region the TP rate and FP rate very low in this region . This is why in lower left region the values are very low.
On the other hand the region near top right is of threshold 0 , so model classify all sample as positive making TP and FPm rate very high.This makes the graph taking large values in top right corner. The region we are interested in is the TOP Left where the model accuracy is actually measured. The closer this region is to the top left is better for the model performance.We also term the lower left region as strong evidence region and top right region as weal evidence region. The dotted line y=X is very interesting here, as it represents the totally random model where prediction is absolutely random.So if we are working with 2 class classification problem then it’s 50 % 50 % chance of being in either class. So if the model predicts 50 % of the true cancer patients as cancer the it will also predict 50 % of the noncancer patient as cancer. What makes the situation worse is if it predicts with 90 % TP rate then it will have 90 % FP rate and that’s not acceptable. So what’s a ROC Curve of a good model looks like , it should be reaching point D as stiff as possible and then reach top right corner. In the ROC space, any point to the top left of another point is always better as it extracts more information for the feature and is a better model. Now if you look at the lower right triangle then it is represent a model that perform worse than a random model. We already discussed that for models like Naïve bayes or Neural network where along with classification we get a probabilistic score of a features for all classes we can use the threshold techniques for cauculating the TP rate and FP rate and draw the ROC curve , but what about the other discrete models which just outputs a class and no score value .It is interesting that for those cases we represent them in ROC space as a point and the model performance depends on the position of the point. There are techniques to convert a discrete model to probabilistic one to draw the full curve rather than just polotting a pont. For that porpuse we can take approaches like for decision tree thye number of instance in a node can represent a score or even emsemble tenchiques can be applied and using voting techniques get a probabilistic score , so that later we can work on the threshold .
If we draw the curve for very few test sample or rather validation sample for which we already know the true label , then it turns out to be a step function and as we increase the validation set size , it becomes more smooth.
One majot advantage of ROC corve is , it is independent of the class distribution counts , so even if one class has a higer number of sample then also the shape fo the curve doesn’t change unlike precison recall curve.
Selecting the best Model:
One important measure of slope in the ROC space is given by,
All points falling on a line are basically poses the same performance and cost (Cost of misclassifying a positive as negative). These lines of same slope are also termed as iso-performance lines. Models with ROC curve slope line more towards the top left are always better as they come with less cost penalty. Also we have a concept of ROC Convex Hull which basically covers the other curves bellow it as referred to fig D . Convex hull helps us in eliminating non optimal models .
![]() |
Figure D , ROC and Area Under the Curve |
Like in the figure C we have 4 models and 4 ROC curve drawn. We can see that curve A and C actually acts as convex hull and B, D are bellow the convex hull , so we can eliminate model B & D as A and C will eventually out performs them. Now the question arises when to use model A and when model B. It basically depends on the analysis of the problem statement itself and co relating that with the iso-performance lines α and β.
For example, if we have equal cost of FP and FN and in the validation set prior probability of negative sample is higher that prior of positive by 10 times , them m value becomes 10. So model with ROC curve A will fit in this situation as its slope is close to 10. Now consider another situation where cost of false negative is 10 times more than false positive and prior same for both , so in this case m =1/10 and curve B fits best in this situation. This helps in identifying the best model for the problem domain.
Area Under ROC Curve:
As we have discussed that ROC curve is a 2D graph representation of a model but what if need a single scalar value to measure the model performance, we use AUC measure for that purpose, It is the Area Under the ROC curve. ROC is plotted in unit area square, so AUC value range from 0 to 1, ideally it should be above 0.5 as anything bellow 0.5 is worse than a random prediction.
Reference: An introduction to ROC analysis by Tom Fawcett, Institute for the Study of Learning and Expertise, 2164 Staunton Court, Palo Alto, CA 94306, USA
Available online 19 December 2005
Comments
Post a Comment