-
Notifications
You must be signed in to change notification settings - Fork 183
Description
Summary
Inconsistent output from predict_proba on an SVC classifier, when compared with the scikit-learn implementation.
Minimal reproducible example
from sklearn.svm import SVC as SklearnSVC
from sklearnex.svm import SVC as SklearnexSVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
# Generate data
X, y = make_classification(n_samples=15000, n_features=50, n_classes=10, n_informative=20, n_redundant=5, n_clusters_per_class=4, class_sep=0.2, flip_y=0.05, random_state=42)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train sklearn SVC
sklearn_svc = SklearnSVC(probability=True, random_state=42)
sklearn_svc.fit(X_train, y_train)
sklearn_proba = sklearn_svc.predict_proba(X_test)
sklearn_proba_max = sklearn_proba.max(1)
# Train sklearnex SVC
sklearnex_svc = SklearnexSVC(probability=True, random_state=42)
sklearnex_svc.fit(X_train, y_train)
sklearnex_proba = sklearnex_svc.predict_proba(X_test)
sklearnex_proba_max = sklearnex_proba.max(1)
# Visualize maximal probabilites per sample
fig, ax = plt.subplots()
# Get the index of the maximum value for each row in both arrays
predicted_class_sklearn = np.argmax(sklearn_proba, axis=1)
predicted_class_sklearnex = np.argmax(sklearnex_proba, axis=1)
mask_same_class_predicted = predicted_class_sklearnex == predicted_class_sklearn
# Scatter point is red if they predicted different classes, black if they agree on predicted class
c = ["black" if agreement else "red" for agreement in mask_same_class_predicted]
ax.scatter(sklearn_proba_max, sklearnex_proba_max, s=5, c=c, alpha=.5)
# Add legend
from matplotlib.lines import Line2D
legend_elements = [
Line2D([0], [0], marker='o', color='w', markerfacecolor='black', markersize=8, alpha=0.5, label='Same class predicted'),
Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=8, alpha=0.5, label='Different class predicted')
]
ax.legend(handles=legend_elements, loc='best', frameon=False)
ax.spines[["right", "top"]].set_visible(False)
ax.set_title("sklearn vs sklearnex max probability per sample")
ax.set_xlabel("sklearn max probability")
ax.set_ylabel("sklearnex max probability")
ax.set_xlim([0,1])
ax.set_ylim([0,1])Expected outcome
Output from sklearnex.svm.predict_proba should be similar/consistent with output from sklearn.svm.predict_proba. This would produce a plot with scatter points lying on a roughly straight line going from (0,0) to (1,1).
Actual outcome
Output from sklearnex.svm.predict_proba is very different from the output of sklearn.svm.predict_proba. The former almost always predicts high scores, while the latters predictions are more evenly distributed from 0-1. See plots below:
Environment
Red Hat Enterprise Linux 8.10 (Ootpa)
scikit-learn version: 1.7.2
scikit-learn-intelex version: 2199.9.9
numpy version: 2.3.4
Python version: 3.11.13
Wrap up
This is my first issue on this project. Please let me know, if I need to provide additional information.