Skip to content

Commit 360a8ee

Browse files
committed
update tomeklinks docs
1 parent a8e44ae commit 360a8ee

File tree

1 file changed

+26
-15
lines changed

1 file changed

+26
-15
lines changed

doc/under_sampling.rst

Lines changed: 26 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -197,38 +197,49 @@ affected by noise due to the first step sample selection.
197197
Cleaning under-sampling techniques
198198
----------------------------------
199199

200-
Cleaning under-sampling techniques do not allow to specify the number of
201-
samples to have in each class. In fact, each algorithm implement an heuristic
202-
which will clean the dataset.
200+
cleaning under-sampling methods "clean" the feature space by removing
201+
either "noisy" or observations that are "too easy to classify", depending on the
202+
method. The final number of observations in each targeted class varies with the
203+
cleaning method and can't be specified by the user.
203204

204205
.. _tomek_links:
205206

206207
Tomek's links
207208
^^^^^^^^^^^^^
208209

209-
:class:`TomekLinks` detects the so-called Tomek's links :cite:`tomek1976two`. A
210-
Tomek's link between two samples of different class :math:`x` and :math:`y` is
211-
defined such that for any sample :math:`z`:
210+
A Tomek's link exists when two samples from different classes are closest neighbors to
211+
each other.
212+
213+
Mathematically, a Tomek's link between two samples from different classes :math:`x`
214+
and :math:`y` is defined such that for any sample :math:`z`:
212215

213216
.. math::
214217
215218
d(x, y) < d(x, z) \text{ and } d(x, y) < d(y, z)
216219
217-
where :math:`d(.)` is the distance between the two samples. In some other
218-
words, a Tomek's link exist if the two samples are the nearest neighbors of
219-
each other. In the figure below, a Tomek's link is illustrated by highlighting
220-
the samples of interest in green.
220+
where :math:`d(.)` is the distance between the two samples.
221+
222+
:class:`TomekLinks` detects and removes Tomek's links :cite:`tomek1976two`. The
223+
underlying idea is that Tomek's links are noisy or hard to classify observations and
224+
would not help the algorithm find a suitable discrimination boundary.
225+
226+
In the following figure, a Tomek's link between an observation of class :math:`+` and
227+
class :math:`-`is highlighted in green:
221228

222229
.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001.png
223230
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
224231
:scale: 60
225232
:align: center
226233

227-
The parameter ``sampling_strategy`` control which sample of the link will be
228-
removed. For instance, the default (i.e., ``sampling_strategy='auto'``) will
229-
remove the sample from the majority class. Both samples from the majority and
230-
minority class can be removed by setting ``sampling_strategy`` to ``'all'``. The
231-
figure illustrates this behaviour.
234+
When :class:`TomekLinks` finds a Tomek's link, it can either remove the sample of the
235+
majority class, or both. The parameter ``sampling_strategy`` controls which samples
236+
from the link will be removed. By default (i.e., ``sampling_strategy='auto'``), it will
237+
remove the sample from the majority class. Both samples, that is that from the majority
238+
and the one from the minority class, can be removed by setting ``sampling_strategy`` to
239+
``'all'``.
240+
241+
The following figure illustrates this behaviour: on the left, only the sample from the
242+
majority class is removed, whereas on the right, the entire Tomek's link is removed.
232243

233244
.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png
234245
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html

0 commit comments

Comments
 (0)