@@ -197,38 +197,49 @@ affected by noise due to the first step sample selection.
197
197
Cleaning under-sampling techniques
198
198
----------------------------------
199
199
200
- Cleaning under-sampling techniques do not allow to specify the number of
201
- samples to have in each class. In fact, each algorithm implement an heuristic
202
- which will clean the dataset.
200
+ cleaning under-sampling methods "clean" the feature space by removing
201
+ either "noisy" or observations that are "too easy to classify", depending on the
202
+ method. The final number of observations in each targeted class varies with the
203
+ cleaning method and can't be specified by the user.
203
204
204
205
.. _tomek_links :
205
206
206
207
Tomek's links
207
208
^^^^^^^^^^^^^
208
209
209
- :class: `TomekLinks ` detects the so-called Tomek's links :cite: `tomek1976two `. A
210
- Tomek's link between two samples of different class :math: `x` and :math: `y` is
211
- defined such that for any sample :math: `z`:
210
+ A Tomek's link exists when two samples from different classes are closest neighbors to
211
+ each other.
212
+
213
+ Mathematically, a Tomek's link between two samples from different classes :math: `x`
214
+ and :math: `y` is defined such that for any sample :math: `z`:
212
215
213
216
.. math ::
214
217
215
218
d(x, y) < d(x, z) \text { and } d(x, y) < d(y, z)
216
219
217
- where :math: `d(.)` is the distance between the two samples. In some other
218
- words, a Tomek's link exist if the two samples are the nearest neighbors of
219
- each other. In the figure below, a Tomek's link is illustrated by highlighting
220
- the samples of interest in green.
220
+ where :math: `d(.)` is the distance between the two samples.
221
+
222
+ :class: `TomekLinks ` detects and removes Tomek's links :cite: `tomek1976two `. The
223
+ underlying idea is that Tomek's links are noisy or hard to classify observations and
224
+ would not help the algorithm find a suitable discrimination boundary.
225
+
226
+ In the following figure, a Tomek's link between an observation of class :math: `+` and
227
+ class :math: `-`is highlighted in green:
221
228
222
229
.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001 .png
223
230
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
224
231
:scale: 60
225
232
:align: center
226
233
227
- The parameter ``sampling_strategy `` control which sample of the link will be
228
- removed. For instance, the default (i.e., ``sampling_strategy='auto' ``) will
229
- remove the sample from the majority class. Both samples from the majority and
230
- minority class can be removed by setting ``sampling_strategy `` to ``'all' ``. The
231
- figure illustrates this behaviour.
234
+ When :class:`TomekLinks` finds a Tomek's link, it can either remove the sample of the
235
+ majority class, or both. The parameter ``sampling_strategy `` controls which samples
236
+ from the link will be removed. By default (i.e., ``sampling_strategy='auto' ``), it will
237
+ remove the sample from the majority class. Both samples, that is that from the majority
238
+ and the one from the minority class, can be removed by setting ``sampling_strategy `` to
239
+ ``'all' ``.
240
+
241
+ The following figure illustrates this behaviour: on the left, only the sample from the
242
+ majority class is removed, whereas on the right, the entire Tomek's link is removed.
232
243
233
244
.. image :: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png
234
245
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
0 commit comments