Skip to content

predict_proba results are (wildly) inconsistent between sklearn.svm.SVC and sklearnex.svm.SVC #2909

@thwit

Description

@thwit

Summary
Inconsistent output from predict_proba on an SVC classifier, when compared with the scikit-learn implementation.

Minimal reproducible example

from sklearn.svm import SVC as SklearnSVC
from sklearnex.svm import SVC as SklearnexSVC

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

# Generate data
X, y = make_classification(n_samples=15000, n_features=50, n_classes=10, n_informative=20, n_redundant=5, n_clusters_per_class=4, class_sep=0.2, flip_y=0.05, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train sklearn SVC
sklearn_svc = SklearnSVC(probability=True, random_state=42)
sklearn_svc.fit(X_train, y_train)
sklearn_proba = sklearn_svc.predict_proba(X_test)
sklearn_proba_max = sklearn_proba.max(1)

# Train sklearnex SVC
sklearnex_svc = SklearnexSVC(probability=True, random_state=42)
sklearnex_svc.fit(X_train, y_train)
sklearnex_proba = sklearnex_svc.predict_proba(X_test)
sklearnex_proba_max = sklearnex_proba.max(1)

# Visualize maximal probabilites per sample
fig, ax = plt.subplots()


# Get the index of the maximum value for each row in both arrays
predicted_class_sklearn = np.argmax(sklearn_proba, axis=1)
predicted_class_sklearnex = np.argmax(sklearnex_proba, axis=1)
mask_same_class_predicted = predicted_class_sklearnex == predicted_class_sklearn

# Scatter point is red if they predicted different classes, black if they agree on predicted class
c = ["black" if agreement else "red" for agreement in mask_same_class_predicted]
ax.scatter(sklearn_proba_max, sklearnex_proba_max, s=5, c=c, alpha=.5)

# Add legend
from matplotlib.lines import Line2D
legend_elements = [
    Line2D([0], [0], marker='o', color='w', markerfacecolor='black', markersize=8, alpha=0.5, label='Same class predicted'),
    Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=8, alpha=0.5, label='Different class predicted')
]
ax.legend(handles=legend_elements, loc='best', frameon=False)

ax.spines[["right", "top"]].set_visible(False)
ax.set_title("sklearn vs sklearnex max probability per sample")
ax.set_xlabel("sklearn max probability")
ax.set_ylabel("sklearnex max probability")
ax.set_xlim([0,1])
ax.set_ylim([0,1])

Expected outcome
Output from sklearnex.svm.predict_proba should be similar/consistent with output from sklearn.svm.predict_proba. This would produce a plot with scatter points lying on a roughly straight line going from (0,0) to (1,1).

Actual outcome
Output from sklearnex.svm.predict_proba is very different from the output of sklearn.svm.predict_proba. The former almost always predicts high scores, while the latters predictions are more evenly distributed from 0-1. See plots below:

Image Image

Environment
Red Hat Enterprise Linux 8.10 (Ootpa)
scikit-learn version: 1.7.2
scikit-learn-intelex version: 2199.9.9
numpy version: 2.3.4
Python version: 3.11.13

Wrap up
This is my first issue on this project. Please let me know, if I need to provide additional information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions