Skip to content

Commit 87ef4fc

Browse files
solegalliglemaitre
andauthored
DOC improve ENN documentation (#1021)
Co-authored-by: Guillaume Lemaitre <[email protected]>
1 parent c3c150d commit 87ef4fc

File tree

1 file changed

+28
-16
lines changed

1 file changed

+28
-16
lines changed

doc/under_sampling.rst

Lines changed: 28 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -255,14 +255,23 @@ majority class is removed, whereas on the right, the entire Tomek's link is remo
255255

256256
.. _edited_nearest_neighbors:
257257

258-
Edited data set using nearest neighbours
259-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
258+
Editing data using nearest neighbours
259+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
260260

261-
:class:`EditedNearestNeighbours` applies a nearest-neighbors algorithm and
262-
"edit" the dataset by removing samples which do not agree "enough" with their
263-
neighboorhood :cite:`wilson1972asymptotic`. For each sample in the class to be
264-
under-sampled, the nearest-neighbours are computed and if the selection
265-
criterion is not fulfilled, the sample is removed::
261+
Edited nearest neighbours
262+
~~~~~~~~~~~~~~~~~~~~~~~~~
263+
264+
The edited nearest neighbours methodology uses K-Nearest Neighbours to identify the
265+
neighbours of the targeted class samples, and then removes observations if any or most
266+
of their neighbours are from a different class :cite:`wilson1972asymptotic`.
267+
268+
:class:`EditedNearestNeighbours` carries out the following steps:
269+
270+
1. Train a K-Nearest neighbours using the entire dataset.
271+
2. Find each observations' K closest neighbours (only for the targeted classes).
272+
3. Remove observations if any or most of its neighbours belong to a different class.
273+
274+
Below the code implementation::
266275

267276
>>> sorted(Counter(y).items())
268277
[(0, 64), (1, 262), (2, 4674)]
@@ -272,12 +281,12 @@ criterion is not fulfilled, the sample is removed::
272281
>>> print(sorted(Counter(y_resampled).items()))
273282
[(0, 64), (1, 213), (2, 4568)]
274283

275-
Two selection criteria are currently available: (i) the majority (i.e.,
276-
``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) the
277-
nearest-neighbors have to belong to the same class than the sample inspected to
278-
keep it in the dataset. Thus, it implies that `kind_sel='all'` will be less
279-
conservative than `kind_sel='mode'`, and more samples will be excluded in
280-
the former strategy than the latest::
284+
285+
To paraphrase step 3, :class:`EditedNearestNeighbours` will retain observations from
286+
the majority class when **most**, or **all** of its neighbours are from the same class.
287+
To control this behaviour we set ``kind_sel='mode'`` or ``kind_sel='all'``,
288+
respectively. Hence, `kind_sel='all'` is less conservative than `kind_sel='mode'`,
289+
resulting in the removal of more samples::
281290

282291
>>> enn = EditedNearestNeighbours(kind_sel="all")
283292
>>> X_resampled, y_resampled = enn.fit_resample(X, y)
@@ -288,9 +297,12 @@ the former strategy than the latest::
288297
>>> print(sorted(Counter(y_resampled).items()))
289298
[(0, 64), (1, 234), (2, 4666)]
290299

291-
The parameter ``n_neighbors`` allows to give a classifier subclassed from
292-
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors and make
293-
the decision to keep a given sample or not.
300+
The parameter ``n_neighbors`` accepts integers. The integer refers to the number of
301+
neighbours to examine for each sample. It can also take a classifier subclassed from
302+
``KNeighborsMixin`` from scikit-learn. When passing a classifier, note that, if you
303+
pass a 3-Nearest Neighbors classifier, only 2 neighbours will be examined for the cleaning, as the
304+
third sample is the one being examined for undersampling since it is part of the
305+
samples provided at `fit`.
294306

295307
Repeated Edited Nearest Neighbours
296308
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0 commit comments

Comments
 (0)