@@ -255,14 +255,23 @@ majority class is removed, whereas on the right, the entire Tomek's link is remo
255
255
256
256
.. _edited_nearest_neighbors :
257
257
258
- Edited data set using nearest neighbours
259
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
258
+ Editing data using nearest neighbours
259
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
260
260
261
- :class: `EditedNearestNeighbours ` applies a nearest-neighbors algorithm and
262
- "edit" the dataset by removing samples which do not agree "enough" with their
263
- neighboorhood :cite: `wilson1972asymptotic `. For each sample in the class to be
264
- under-sampled, the nearest-neighbours are computed and if the selection
265
- criterion is not fulfilled, the sample is removed::
261
+ Edited nearest neighbours
262
+ ~~~~~~~~~~~~~~~~~~~~~~~~~
263
+
264
+ The edited nearest neighbours methodology uses K-Nearest Neighbours to identify the
265
+ neighbours of the targeted class samples, and then removes observations if any or most
266
+ of their neighbours are from a different class :cite: `wilson1972asymptotic `.
267
+
268
+ :class: `EditedNearestNeighbours ` carries out the following steps:
269
+
270
+ 1. Train a K-Nearest neighbours using the entire dataset.
271
+ 2. Find each observations' K closest neighbours (only for the targeted classes).
272
+ 3. Remove observations if any or most of its neighbours belong to a different class.
273
+
274
+ Below the code implementation::
266
275
267
276
>>> sorted(Counter(y).items())
268
277
[(0, 64), (1, 262), (2, 4674)]
@@ -272,12 +281,12 @@ criterion is not fulfilled, the sample is removed::
272
281
>>> print(sorted(Counter(y_resampled).items()))
273
282
[(0, 64), (1, 213), (2, 4568)]
274
283
275
- Two selection criteria are currently available: (i) the majority (i.e.,
276
- `` kind_sel='mode' ``) or (ii) all (i.e., `` kind_sel='all' ``) the
277
- nearest-neighbors have to belong to the same class than the sample inspected to
278
- keep it in the dataset. Thus, it implies that ` kind_sel='all' ` will be less
279
- conservative than `kind_sel='mode' `, and more samples will be excluded in
280
- the former strategy than the latest ::
284
+
285
+ To paraphrase step 3, :class: ` EditedNearestNeighbours ` will retain observations from
286
+ the majority class when ** most **, or ** all ** of its neighbours are from the same class.
287
+ To control this behaviour we set `` kind_sel='mode' `` or `` kind_sel='all' ``,
288
+ respectively. Hence, `kind_sel='all' ` is less conservative than ` kind_sel='mode' `,
289
+ resulting in the removal of more samples ::
281
290
282
291
>>> enn = EditedNearestNeighbours(kind_sel="all")
283
292
>>> X_resampled, y_resampled = enn.fit_resample(X, y)
@@ -288,9 +297,12 @@ the former strategy than the latest::
288
297
>>> print(sorted(Counter(y_resampled).items()))
289
298
[(0, 64), (1, 234), (2, 4666)]
290
299
291
- The parameter ``n_neighbors `` allows to give a classifier subclassed from
292
- ``KNeighborsMixin `` from scikit-learn to find the nearest neighbors and make
293
- the decision to keep a given sample or not.
300
+ The parameter ``n_neighbors `` accepts integers. The integer refers to the number of
301
+ neighbours to examine for each sample. It can also take a classifier subclassed from
302
+ ``KNeighborsMixin `` from scikit-learn. When passing a classifier, note that, if you
303
+ pass a 3-Nearest Neighbors classifier, only 2 neighbours will be examined for the cleaning, as the
304
+ third sample is the one being examined for undersampling since it is part of the
305
+ samples provided at `fit `.
294
306
295
307
Repeated Edited Nearest Neighbours
296
308
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0 commit comments