Skip to content

Commit c3c150d

Browse files
solegalliglemaitre
andauthored
DOC improve TomekLinks documentation (#1020)
Co-authored-by: Guillaume Lemaitre <[email protected]>
1 parent ec27259 commit c3c150d

File tree

1 file changed

+26
-15
lines changed

1 file changed

+26
-15
lines changed

doc/under_sampling.rst

Lines changed: 26 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -204,38 +204,49 @@ affected by noise due to the first step sample selection.
204204
Cleaning under-sampling techniques
205205
----------------------------------
206206

207-
Cleaning under-sampling techniques do not allow to specify the number of
208-
samples to have in each class. In fact, each algorithm implement an heuristic
209-
which will clean the dataset.
207+
Cleaning under-sampling methods "clean" the feature space by removing
208+
either "noisy" observations or observations that are "too easy to classify", depending
209+
on the method. The final number of observations in each targeted class varies with the
210+
cleaning method and cannot be specified by the user.
210211

211212
.. _tomek_links:
212213

213214
Tomek's links
214215
^^^^^^^^^^^^^
215216

216-
:class:`TomekLinks` detects the so-called Tomek's links :cite:`tomek1976two`. A
217-
Tomek's link between two samples of different class :math:`x` and :math:`y` is
218-
defined such that for any sample :math:`z`:
217+
A Tomek's link exists when two samples from different classes are closest neighbors to
218+
each other.
219+
220+
Mathematically, a Tomek's link between two samples from different classes :math:`x`
221+
and :math:`y` is defined such that for any sample :math:`z`:
219222

220223
.. math::
221224
222225
d(x, y) < d(x, z) \text{ and } d(x, y) < d(y, z)
223226
224-
where :math:`d(.)` is the distance between the two samples. In some other
225-
words, a Tomek's link exist if the two samples are the nearest neighbors of
226-
each other. In the figure below, a Tomek's link is illustrated by highlighting
227-
the samples of interest in green.
227+
where :math:`d(.)` is the distance between the two samples.
228+
229+
:class:`TomekLinks` detects and removes Tomek's links :cite:`tomek1976two`. The
230+
underlying idea is that Tomek's links are noisy or hard to classify observations and
231+
would not help the algorithm find a suitable discrimination boundary.
232+
233+
In the following figure, a Tomek's link between an observation of class :math:`+` and
234+
class :math:`-` is highlighted in green:
228235

229236
.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_001.png
230237
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
231238
:scale: 60
232239
:align: center
233240

234-
The parameter ``sampling_strategy`` control which sample of the link will be
235-
removed. For instance, the default (i.e., ``sampling_strategy='auto'``) will
236-
remove the sample from the majority class. Both samples from the majority and
237-
minority class can be removed by setting ``sampling_strategy`` to ``'all'``. The
238-
figure illustrates this behaviour.
241+
When :class:`TomekLinks` finds a Tomek's link, it can either remove the sample of the
242+
majority class, or both. The parameter ``sampling_strategy`` controls which samples
243+
from the link will be removed. By default (i.e., ``sampling_strategy='auto'``), it will
244+
remove the sample from the majority class. Both samples, that is that from the majority
245+
and the one from the minority class, can be removed by setting ``sampling_strategy`` to
246+
``'all'``.
247+
248+
The following figure illustrates this behaviour: on the left, only the sample from the
249+
majority class is removed, whereas on the right, the entire Tomek's link is removed.
239250

240251
.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png
241252
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html

0 commit comments

Comments
 (0)