Skip to content

Commit 2f6b1f6

Browse files
authored
FIX divide by sqrt(2) the median entry in SMOTENC (#1014)
1 parent 111ff73 commit 2f6b1f6

File tree

3 files changed

+20
-7
lines changed

3 files changed

+20
-7
lines changed

doc/over_sampling.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -203,11 +203,11 @@ or relying on `dtype` inference if the columns are using the
203203
>>> print(sorted(Counter(y_resampled).items()))
204204
[(0, 30), (1, 30)]
205205
>>> print(X_resampled[-5:])
206-
[['A' 0.5246469549655818 2]
207-
['B' -0.3657680728116921 2]
208-
['B' 0.9344237230779993 2]
209-
['B' 0.3710891618824609 2]
210-
['B' 0.3327240726719727 2]]
206+
[['A' 0.52... 2]
207+
['B' -0.36... 2]
208+
['B' 0.93... 2]
209+
['B' 0.37... 2]
210+
['B' 0.33... 2]]
211211

212212
Therefore, it can be seen that the samples generated in the first and last
213213
columns are belonging to the same categories originally presented without any

doc/whats_new/v0.11.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,14 @@ Version 0.11.1
66
Changelog
77
---------
88

9+
Bug fixes
10+
.........
11+
12+
- Fix a bug in :class:`~imblearn.over_sampling.SMOTENC` where the entries of the
13+
one-hot encoding should be divided by `sqrt(2)` and not `2`, taking into account that
14+
they are plugged into an Euclidean distance computation.
15+
:pr:`1014` by :user:`Guillaume Lemaitre <glemaitre>`.
16+
917

1018
Version 0.11.0
1119
==============

imblearn/over_sampling/_smote/base.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -671,13 +671,18 @@ def _fit_resample(self, X, y):
671671

672672
# In the edge case where the median of the std is equal to 0, the 1s
673673
# entries will be also nullified. In this case, we store the original
674-
# categorical encoding which will be later used for inversing the OHE
674+
# categorical encoding which will be later used for inverting the OHE
675675
if math.isclose(self.median_std_, 0):
676676
self._X_categorical_minority_encoded = _safe_indexing(
677677
X_ohe.toarray(), np.flatnonzero(y == class_minority)
678678
)
679679

680-
X_ohe.data = np.ones_like(X_ohe.data, dtype=X_ohe.dtype) * self.median_std_ / 2
680+
# With one-hot encoding, the median will be repeated twice. We need to divide
681+
# by sqrt(2) such that we only have one median value contributing to the
682+
# Euclidean distance
683+
X_ohe.data = (
684+
np.ones_like(X_ohe.data, dtype=X_ohe.dtype) * self.median_std_ / np.sqrt(2)
685+
)
681686
X_encoded = sparse.hstack((X_continuous, X_ohe), format="csr")
682687

683688
X_resampled, y_resampled = super()._fit_resample(X_encoded, y)

0 commit comments

Comments
 (0)