Skip to content

DOC improve the documentation of CNN and OSS #1025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 11, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 43 additions & 17 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -306,20 +306,25 @@ impact by cleaning noisy samples next to the boundaries of the classes.

.. _condensed_nearest_neighbors:

Condensed nearest neighbors and derived algorithms
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Condensed nearest neighbors
^^^^^^^^^^^^^^^^^^^^^^^^^^^

:class:`CondensedNearestNeighbour` uses a 1 nearest neighbor rule to
iteratively decide if a sample should be removed or not
:cite:`hart1968condensed`. The algorithm is running as followed:
iteratively decide if a sample should be removed
:cite:`hart1968condensed`. The algorithm runs as follows:

1. Get all minority samples in a set :math:`C`.
2. Add a sample from the targeted class (class to be under-sampled) in
:math:`C` and all other samples of this class in a set :math:`S`.
3. Go through the set :math:`S`, sample by sample, and classify each sample
using a 1 nearest neighbor rule.
4. If the sample is misclassified, add it to :math:`C`, otherwise do nothing.
5. Reiterate on :math:`S` until there is no samples to be added.
3. Train a 1-Nearest Neigbhour on :math:`C`.
4. Go through the samples in set :math:`S`, sample by sample, and classify each one
using a 1 nearest neighbor rule (trained in 3).
5. If the sample is misclassified, add it to :math:`C`, and go to step 6.
6. Repeat steps 3 to 5 until all observations in :math:`S` have been examined.

The final dataset is :math:`S`, containing all observations from the minority class and
those from the majority that were miss-classified by the successive
1-Nearest Neigbhour algorithms.

The :class:`CondensedNearestNeighbour` can be used in the following manner::

Expand All @@ -329,23 +334,44 @@ The :class:`CondensedNearestNeighbour` can be used in the following manner::
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 24), (2, 115)]

However as illustrated in the figure below, :class:`CondensedNearestNeighbour`
is sensitive to noise and will add noisy samples.
:class:`CondensedNearestNeighbour` is sensitive to noise and may add noisy samples
(see figure later on).

One Sided Selection
~~~~~~~~~~~~~~~~~~~

In an attempt to remove the noisy observations introduced by
:class:`CondensedNearestNeighbour`, :class:`OneSidedSelection`
will first find the observations that are hard to classify, and then will use
:class:`TomekLinks` to remove noisy samples :cite:`hart1968condensed`.
:class:`OneSidedSelection` runs as follows:

1. Get all minority samples in a set :math:`C`.
2. Add a sample from the targeted class (class to be under-sampled) in
:math:`C` and all other samples of this class in a set :math:`S`.
3. Train a 1-Nearest Neighbors on :math:`C`.
4. Using a 1 nearest neighbor rule trained in 3, classify all samples in
set :math:`S`.
5. Add all misclassified samples to :math:`C`.
6. Remove Tomek Links from :math:`C`.

The final dataset is :math:`S`, containing all observations from the minority class,
plus the observations from the majority that were added at random, plus all
those from the majority that were miss-classified by the 1-Nearest Neighbors algorithms.

In the contrary, :class:`OneSidedSelection` will use :class:`TomekLinks` to
remove noisy samples :cite:`hart1968condensed`. In addition, the 1 nearest
neighbor rule is applied to all samples and the one which are misclassified
will be added to the set :math:`C`. No iteration on the set :math:`S` will take
place. The class can be used as::
Note that differently from :class:`CondensedNearestNeighbour`, :class:`OneSidedSelection`
does not train a K-Nearest Neighbors after each sample is misclassified. It uses the
1-Nearest Neighbors from step 3 to classify all samples from the majority in 1 pass.
The class can be used as::

>>> from imblearn.under_sampling import OneSidedSelection
>>> oss = OneSidedSelection(random_state=0)
>>> X_resampled, y_resampled = oss.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 174), (2, 4404)]

Our implementation offer to set the number of seeds to put in the set :math:`C`
originally by setting the parameter ``n_seeds_S``.
Our implementation offers the possibility to set the number of observations
to put at random in the set :math:`C` through the parameter ``n_seeds_S``.

:class:`NeighbourhoodCleaningRule` will focus on cleaning the data than
condensing them :cite:`laurikkala2001improving`. Therefore, it will used the
Expand Down