@@ -204,38 +204,49 @@ affected by noise due to the first step sample selection.
204
204
Cleaning under-sampling techniques
205
205
----------------------------------
206
206
207
- Cleaning under-sampling techniques do not allow to specify the number of
208
- samples to have in each class. In fact, each algorithm implement an heuristic
209
- which will clean the dataset.
207
+ Cleaning under-sampling methods "clean" the feature space by removing
208
+ either "noisy" observations or observations that are "too easy to classify", depending
209
+ on the method. The final number of observations in each targeted class varies with the
210
+ cleaning method and cannot be specified by the user.
210
211
211
212
.. _tomek_links :
212
213
213
214
Tomek's links
214
215
^^^^^^^^^^^^^
215
216
216
- :class: `TomekLinks ` detects the so-called Tomek's links :cite: `tomek1976two `. A
217
- Tomek's link between two samples of different class :math: `x` and :math: `y` is
218
- defined such that for any sample :math: `z`:
217
+ A Tomek's link exists when two samples from different classes are closest neighbors to
218
+ each other.
219
+
220
+ Mathematically, a Tomek's link between two samples from different classes :math: `x`
221
+ and :math: `y` is defined such that for any sample :math: `z`:
219
222
220
223
.. math ::
221
224
222
225
d(x, y) < d(x, z) \text { and } d(x, y) < d(y, z)
223
226
224
- where :math: `d(.)` is the distance between the two samples. In some other
225
- words, a Tomek's link exist if the two samples are the nearest neighbors of
226
- each other. In the figure below, a Tomek's link is illustrated by highlighting
227
- the samples of interest in green.
227
+ where :math: `d(.)` is the distance between the two samples.
228
+
229
+ :class: `TomekLinks ` detects and removes Tomek's links :cite: `tomek1976two `. The
230
+ underlying idea is that Tomek's links are noisy or hard to classify observations and
231
+ would not help the algorithm find a suitable discrimination boundary.
232
+
233
+ In the following figure, a Tomek's link between an observation of class :math: `+` and
234
+ class :math: `-` is highlighted in green:
228
235
229
236
.. image :: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001.png
230
237
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
231
238
:scale: 60
232
239
:align: center
233
240
234
- The parameter ``sampling_strategy `` control which sample of the link will be
235
- removed. For instance, the default (i.e., ``sampling_strategy='auto' ``) will
236
- remove the sample from the majority class. Both samples from the majority and
237
- minority class can be removed by setting ``sampling_strategy `` to ``'all' ``. The
238
- figure illustrates this behaviour.
241
+ When :class: `TomekLinks ` finds a Tomek's link, it can either remove the sample of the
242
+ majority class, or both. The parameter ``sampling_strategy `` controls which samples
243
+ from the link will be removed. By default (i.e., ``sampling_strategy='auto' ``), it will
244
+ remove the sample from the majority class. Both samples, that is that from the majority
245
+ and the one from the minority class, can be removed by setting ``sampling_strategy `` to
246
+ ``'all' ``.
247
+
248
+ The following figure illustrates this behaviour: on the left, only the sample from the
249
+ majority class is removed, whereas on the right, the entire Tomek's link is removed.
239
250
240
251
.. image :: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png
241
252
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
0 commit comments