@@ -237,14 +237,23 @@ figure illustrates this behaviour.
237
237
238
238
.. _edited_nearest_neighbors :
239
239
240
- Edited data set using nearest neighbours
241
- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
240
+ Editing data set using nearest neighbours
241
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
242
242
243
- :class: `EditedNearestNeighbours ` applies a nearest-neighbors algorithm and
244
- "edit" the dataset by removing samples which do not agree "enough" with their
245
- neighboorhood :cite: `wilson1972asymptotic `. For each sample in the class to be
246
- under-sampled, the nearest-neighbours are computed and if the selection
247
- criterion is not fulfilled, the sample is removed::
243
+ Edited nearest neighbours
244
+ ~~~~~~~~~~~~~~~~~~~~~~~~~
245
+
246
+ The edited nearest neighbours methodology uses KNN to identify the neighbours of the
247
+ targeted class samples, and then removes observations if all or most of their
248
+ neighbours are from a different class :cite: `wilson1972asymptotic `.
249
+
250
+ :class: `EditedNearestNeighbours ` carries out the following steps:
251
+
252
+ 1. Train a KNN using the entire dataset (typically a 3-KNN).
253
+ 2. Finds each observations 3 closest neighbours (only for the targeted classes).
254
+ 3. Removes observations if any or most of its neighbours belong to a different class.
255
+
256
+ Below the implementation::
248
257
249
258
>>> sorted(Counter(y).items())
250
259
[(0, 64), (1, 262), (2, 4674)]
@@ -254,12 +263,12 @@ criterion is not fulfilled, the sample is removed::
254
263
>>> print(sorted(Counter(y_resampled).items()))
255
264
[(0, 64), (1, 213), (2, 4568)]
256
265
257
- Two selection criteria are currently available: (i) the majority (i.e.,
258
- `` kind_sel='mode' ``) or (ii) all (i.e., `` kind_sel='all' ``) the
259
- nearest-neighbors have to belong to the same class than the sample inspected to
260
- keep it in the dataset. Thus, it implies that ` kind_sel='all' ` will be less
261
- conservative than `kind_sel='mode' `, and more samples will be excluded in
262
- the former strategy than the latest ::
266
+
267
+ To paraphrase step 3, :class: ` EditedNearestNeighbours ` will retain observations from
268
+ the majority class when ** most **, or ** all ** of its neighbours are from the same class.
269
+ To control this behaviour we set `` kind_sel='mode' `` or `` kind_sel='all' ``,
270
+ respectively. Hence, `kind_sel='all' ` is less conservative than ` kind_sel='mode' `,
271
+ resulting in a removal of more samples ::
263
272
264
273
>>> enn = EditedNearestNeighbours(kind_sel="all")
265
274
>>> X_resampled, y_resampled = enn.fit_resample(X, y)
@@ -270,9 +279,11 @@ the former strategy than the latest::
270
279
>>> print(sorted(Counter(y_resampled).items()))
271
280
[(0, 64), (1, 234), (2, 4666)]
272
281
273
- The parameter ``n_neighbors `` allows to give a classifier subclassed from
274
- ``KNeighborsMixin `` from scikit-learn to find the nearest neighbors and make
275
- the decision to keep a given sample or not.
282
+ The parameter ``n_neighbors `` accepts integers. The integer refers to the number of
283
+ neighbours to examine for each sample. It can also take a classifier subclassed from
284
+ ``KNeighborsMixin `` from scikit-learn. When passing a classifier, note that, if you
285
+ pass a 3-KNN classifier, only 2 neighbours will be examined for the cleaning, as the
286
+ third sample is the one being examined for exclusion.
276
287
277
288
:class: `RepeatedEditedNearestNeighbours ` extends
278
289
:class: `EditedNearestNeighbours ` by repeating the algorithm multiple times
0 commit comments