Skip to content

DOC: Note multiset-like behaviour of Index.union for indexes with duplicates #56137

@wence-

Description

@wence-

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

pandas.Index.union

Documentation problem

An index with non-unique entries can be modelled as a multiset, that is a pair $\mathcal{A} := (A, m)$ where $A$ is the unique entries and $m : A \to \mathbb{Z}^+$ counts the multiplicity of each entry. As a result of fixing #31326, it was decided to treat the Index.union operation as multiset union, where the union carrier set is just the union of the two carrier sets and the multiplicity of any entry is the max of the multiplicity of the input multiplicities (using the natural extension by zero for values outside the domain).

In contrast, all other setops treat indexes with duplicate entries as their carrier sets. Contrast with, for example, the set difference of two multisets which is the subtraction of the multiplicities (so there can still be repeated entries).

I suppose it is far too late to change Index.union to also uniquify its result, but it would be useful to document this somewhere.

Suggested fix for documentation

Add some mention of the multiset behaviour of Index.union

Metadata

Metadata

Assignees

No one assigned

    Labels

    Docssetopsunion, intersection, difference, symmetric_difference

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions