Skip to content

Commit 87b5761

Browse files
committed
update user guide enn
1 parent 360a8ee commit 87b5761

File tree

1 file changed

+27
-16
lines changed

1 file changed

+27
-16
lines changed

doc/under_sampling.rst

Lines changed: 27 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -248,14 +248,23 @@ majority class is removed, whereas on the right, the entire Tomek's link is remo
248248

249249
.. _edited_nearest_neighbors:
250250

251-
Edited data set using nearest neighbours
252-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
251+
Editing data set using nearest neighbours
252+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
253253

254-
:class:`EditedNearestNeighbours` applies a nearest-neighbors algorithm and
255-
"edit" the dataset by removing samples which do not agree "enough" with their
256-
neighboorhood :cite:`wilson1972asymptotic`. For each sample in the class to be
257-
under-sampled, the nearest-neighbours are computed and if the selection
258-
criterion is not fulfilled, the sample is removed::
254+
Edited nearest neighbours
255+
~~~~~~~~~~~~~~~~~~~~~~~~~
256+
257+
The edited nearest neighbours methodology uses KNN to identify the neighbours of the
258+
targeted class samples, and then removes observations if all or most of their
259+
neighbours are from a different class :cite:`wilson1972asymptotic`.
260+
261+
:class:`EditedNearestNeighbours` carries out the following steps:
262+
263+
1. Train a KNN using the entire dataset (typically a 3-KNN).
264+
2. Finds each observations 3 closest neighbours (only for the targeted classes).
265+
3. Removes observations if any or most of its neighbours belong to a different class.
266+
267+
Below the implementation::
259268

260269
>>> sorted(Counter(y).items())
261270
[(0, 64), (1, 262), (2, 4674)]
@@ -265,12 +274,12 @@ criterion is not fulfilled, the sample is removed::
265274
>>> print(sorted(Counter(y_resampled).items()))
266275
[(0, 64), (1, 213), (2, 4568)]
267276

268-
Two selection criteria are currently available: (i) the majority (i.e.,
269-
``kind_sel='mode'``) or (ii) all (i.e., ``kind_sel='all'``) the
270-
nearest-neighbors have to belong to the same class than the sample inspected to
271-
keep it in the dataset. Thus, it implies that `kind_sel='all'` will be less
272-
conservative than `kind_sel='mode'`, and more samples will be excluded in
273-
the former strategy than the latest::
277+
278+
To paraphrase step 3, :class:`EditedNearestNeighbours` will retain observations from
279+
the majority class when **most**, or **all** of its neighbours are from the same class.
280+
To control this behaviour we set ``kind_sel='mode'`` or ``kind_sel='all'``,
281+
respectively. Hence, `kind_sel='all'` is less conservative than `kind_sel='mode'`,
282+
resulting in a removal of more samples::
274283

275284
>>> enn = EditedNearestNeighbours(kind_sel="all")
276285
>>> X_resampled, y_resampled = enn.fit_resample(X, y)
@@ -281,9 +290,11 @@ the former strategy than the latest::
281290
>>> print(sorted(Counter(y_resampled).items()))
282291
[(0, 64), (1, 234), (2, 4666)]
283292

284-
The parameter ``n_neighbors`` allows to give a classifier subclassed from
285-
``KNeighborsMixin`` from scikit-learn to find the nearest neighbors and make
286-
the decision to keep a given sample or not.
293+
The parameter ``n_neighbors`` accepts integers. The integer refers to the number of
294+
neighbours to examine for each sample. It can also take a classifier subclassed from
295+
``KNeighborsMixin`` from scikit-learn. When passing a classifier, note that, if you
296+
pass a 3-KNN classifier, only 2 neighbours will be examined for the cleaning, as the
297+
third sample is the one being examined for exclusion.
287298

288299
:class:`RepeatedEditedNearestNeighbours` extends
289300
:class:`EditedNearestNeighbours` by repeating the algorithm multiple times

0 commit comments

Comments
 (0)