@@ -248,14 +248,23 @@ majority class is removed, whereas on the right, the entire Tomek's link is remo
248
248
249
249
.. _edited_nearest_neighbors :
250
250
251
- Edited data set using nearest neighbours
252
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
251
+ Editing data set using nearest neighbours
252
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
253
253
254
- :class: `EditedNearestNeighbours ` applies a nearest-neighbors algorithm and
255
- "edit" the dataset by removing samples which do not agree "enough" with their
256
- neighboorhood :cite: `wilson1972asymptotic `. For each sample in the class to be
257
- under-sampled, the nearest-neighbours are computed and if the selection
258
- criterion is not fulfilled, the sample is removed::
254
+ Edited nearest neighbours
255
+ ~~~~~~~~~~~~~~~~~~~~~~~~~
256
+
257
+ The edited nearest neighbours methodology uses KNN to identify the neighbours of the
258
+ targeted class samples, and then removes observations if all or most of their
259
+ neighbours are from a different class :cite: `wilson1972asymptotic `.
260
+
261
+ :class: `EditedNearestNeighbours ` carries out the following steps:
262
+
263
+ 1. Train a KNN using the entire dataset (typically a 3-KNN).
264
+ 2. Finds each observations 3 closest neighbours (only for the targeted classes).
265
+ 3. Removes observations if any or most of its neighbours belong to a different class.
266
+
267
+ Below the implementation::
259
268
260
269
>>> sorted(Counter(y).items())
261
270
[(0, 64), (1, 262), (2, 4674)]
@@ -265,12 +274,12 @@ criterion is not fulfilled, the sample is removed::
265
274
>>> print(sorted(Counter(y_resampled).items()))
266
275
[(0, 64), (1, 213), (2, 4568)]
267
276
268
- Two selection criteria are currently available: (i) the majority (i.e.,
269
- `` kind_sel='mode' ``) or (ii) all (i.e., `` kind_sel='all' ``) the
270
- nearest-neighbors have to belong to the same class than the sample inspected to
271
- keep it in the dataset. Thus, it implies that ` kind_sel='all' ` will be less
272
- conservative than `kind_sel='mode' `, and more samples will be excluded in
273
- the former strategy than the latest ::
277
+
278
+ To paraphrase step 3, :class: ` EditedNearestNeighbours ` will retain observations from
279
+ the majority class when ** most **, or ** all ** of its neighbours are from the same class.
280
+ To control this behaviour we set `` kind_sel='mode' `` or `` kind_sel='all' ``,
281
+ respectively. Hence, `kind_sel='all' ` is less conservative than ` kind_sel='mode' `,
282
+ resulting in a removal of more samples ::
274
283
275
284
>>> enn = EditedNearestNeighbours(kind_sel="all")
276
285
>>> X_resampled, y_resampled = enn.fit_resample(X, y)
@@ -281,9 +290,11 @@ the former strategy than the latest::
281
290
>>> print(sorted(Counter(y_resampled).items()))
282
291
[(0, 64), (1, 234), (2, 4666)]
283
292
284
- The parameter ``n_neighbors `` allows to give a classifier subclassed from
285
- ``KNeighborsMixin `` from scikit-learn to find the nearest neighbors and make
286
- the decision to keep a given sample or not.
293
+ The parameter ``n_neighbors `` accepts integers. The integer refers to the number of
294
+ neighbours to examine for each sample. It can also take a classifier subclassed from
295
+ ``KNeighborsMixin `` from scikit-learn. When passing a classifier, note that, if you
296
+ pass a 3-KNN classifier, only 2 neighbours will be examined for the cleaning, as the
297
+ third sample is the one being examined for exclusion.
287
298
288
299
:class: `RepeatedEditedNearestNeighbours ` extends
289
300
:class: `EditedNearestNeighbours ` by repeating the algorithm multiple times
0 commit comments