Introduction
This post is a continuation of the ROC and AUC Interpretation. Please make sure that you understand that post before reading this one.
In this post, we will implement a ROC AUC Score in Python with O(nlogn) runtime complexity.
You can also access this post on my personal page maitbayev.github.io/posts/roc-auc-implementation/ with better code highlight.
Explanation
Similar to the previous post we have:
A dataset with positive and negative items
An ML model that predicts a probability score from 0 to 1, representing the probability that the input belongs to the positive class
We want to compute the ROC AUC score of our model predictions. The algorithm that we are going to implement is explained more easily with a visualization (press the play button):
A few notes from the animation video:
The ROC score is the sum of the areas of trapezoids formed by two adjacent points on the ROC curve
Some trapezoids have zero area
We process the dataset items in order of their probability scores, from the highest to the lowest
Implementation
Let’s setup our environment:
import numpy as np
np.random.seed(0)
n = 100
target = np.random.randint(0, 2, n)
predicted = np.random.rand(n)
We randomly generated targets and predicted probability scores. Let’s check the result of sklearn.metrics.roc_auc_score
:
import sklearn
sklearn.metrics.roc_auc_score(target, predicted)
np.float64(0.4277597402597403)
Our implementation should have the same score.
Trapezoid Area
First, let’s implement a helper function that finds the area of the trapezoid defined by two points (x0,y0) and (x1,y1).
To achieve this, we can add the area of the rectangle and the area of the right triangle, which is:
We can express the formula with Python code:
def trapezoid_area(p0, p1):
return (p1[0] - p0[0]) * (p0[1] + p1[1]) / 2.0
ROC AUC Score
Now our main implementation:
def roc_auc_score(target, predicted):
n = target.shape[0]
num_positive = np.sum(target == 1)
num_negative = n - num_positive
# argsort in reverse order
order = np.argsort(predicted)[::-1]
last = [0, 0]
num_true_positive = 0
num_false_positive = 0
score = 0
for index in range(n):
# Make sure that the new threshold is unique
if index == 0 or predicted[order[index]] != predicted[order[index - 1]]:
# True positive rate
tpr = num_true_positive / num_positive
# False positive rate
fpr = num_false_positive / num_negative
# New point on the ROC curve
cur = [fpr, tpr]
score += trapezoid_area(last, cur)
last = cur
if target[order[index]] == 1:
num_true_positive += 1
else:
num_false_positive += 1
score += trapezoid_area(last, [1, 1])
return score
Let’s verify the result:
roc_auc_score(target, predicted)
np.float64(0.4277597402597403)
Nice, we got exactly the same result as sklearn
.
It is better explained in the code, but roughly our algorithm is:
Sort items by their predicted scores, from largest to smallest
Process the sorted items one by one in a loop
Form the current point on the ROC curve by the ratios:
(num_false_positive / num_negative, num_true_positive / num_positive)
Add the trapezoid area formed by the previous point and the current one
If the current item is positive, then increase
num_true_positive
by oneIf the current item is negative, then increase
num_false_positive
by one
The End
I hope you enjoyed this post.