DOC: `value_counts` description doesn't match code logic

This issue has been created since 2022-09-19.

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

dev docs

last release docs

Documentation problem

value_counts utility incorrectly claims counting "unique rows", respectively "unique combinations".

In reality, it only does groupby followed by size, no nunique:

pandas/pandas/core/frame.py

Lines 6468 to 6592 in ca60aab

def value_counts(
self,
subset: Sequence[Hashable] | None = None,
normalize: bool = False,
sort: bool = True,
ascending: bool = False,
dropna: bool = True,
):
"""
Return a Series containing counts of unique rows in the DataFrame.
.. versionadded:: 1.1.0
Parameters
----------
subset : list-like, optional
Columns to use when counting unique combinations.
normalize : bool, default False
Return proportions rather than frequencies.
sort : bool, default True
Sort by frequencies.
ascending : bool, default False
Sort in ascending order.
dropna : bool, default True
Don’t include counts of rows that contain NA values.
.. versionadded:: 1.3.0
Returns
-------
Series
See Also
--------
Series.value_counts: Equivalent method on Series.
Notes
-----
The returned Series will have a MultiIndex with one level per input
column. By default, rows that contain any NA values are omitted from
the result. By default, the resulting Series will be in descending
order so that the first element is the most frequently-occurring row.
Examples
--------
>>> df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
... 'num_wings': [2, 0, 0, 0]},
... index=['falcon', 'dog', 'cat', 'ant'])
>>> df
num_legs num_wings
falcon 2 2
dog 4 0
cat 4 0
ant 6 0
>>> df.value_counts()
num_legs num_wings
4 0 2
2 2 1
6 0 1
dtype: int64
>>> df.value_counts(sort=False)
num_legs num_wings
2 2 1
4 0 2
6 0 1
dtype: int64
>>> df.value_counts(ascending=True)
num_legs num_wings
2 2 1
6 0 1
4 0 2
dtype: int64
>>> df.value_counts(normalize=True)
num_legs num_wings
4 0 0.50
2 2 0.25
6 0 0.25
dtype: float64
With `dropna` set to `False` we can also count rows with NA values.
>>> df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],
... 'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})
>>> df
first_name middle_name
0 John Smith
1 Anne <NA>
2 John <NA>
3 Beth Louise
>>> df.value_counts()
first_name middle_name
Beth Louise 1
John Smith 1
dtype: int64
>>> df.value_counts(dropna=False)
first_name middle_name
Anne NaN 1
Beth Louise 1
John Smith 1
NaN 1
dtype: int64
"""
if subset is None:
subset = self.columns.tolist()
counts = self.groupby(subset, dropna=dropna).grouper.size()
if sort:
counts = counts.sort_values(ascending=ascending)
if normalize:
counts /= counts.sum()
# Force MultiIndex for single column
if len(subset) == 1:
counts.index = MultiIndex.from_arrays(
[counts.index], names=[counts.index.name]
)
return counts

Suggested fix for documentation

Align the doc string to match the code logic.

I suggest not to mention "unique rows" or "unique combinations" and maybe mention the equivalence to groupby+size with the optional normalization.

Will be happy to submit a PR for that, or share my two cents in the discussion.

WooilKim wrote this answer on 2022-09-19

Let me work on this issue.

maciejskorski wrote this answer on 2022-09-19

Cool. I am reporting this as I have seen quite a few people confused by the behaviour. Let me know if I can be of any help (review, discuss).

WooilKim wrote this answer on 2022-09-19

@maciejskorski
Thanks.
I'll share the progress soon.

phofl wrote this answer on 2022-09-19

I don't understand the issue here. Why do you think that value_counts does not count unique rows? That this is done via groupby is an implementation detail that should not be visible in the docs

rhshadrach wrote this answer on 2022-09-19

Thanks for the report! I think we should stick to a description that doesn't rely on other parts of pandas (e.g. groupby) as much as possible. Otherwise it may be difficult for new users to understand what this method should do.

value_counts utility incorrectly claims counting "unique rows", respectively "unique combinations".

This is not how I read the docstring, but I can see that it can be interpreted that way! To me, Return a Series containing counts of unique rows in the DataFrame. means "For each unique row in the DataFrame, report the number of times it occurs".

What about something like:

Return a Series containing the number of times each unique row occurs in the DataFrame.

maciejskorski wrote this answer on 2022-09-19

Thanks for the report! I think we should stick to a description that doesn't rely on other parts of pandas (e.g. groupby) as much as possible. Otherwise it may be difficult for new users to understand what this method should do.

value_counts utility incorrectly claims counting "unique rows", respectively "unique combinations".

This is not how I read the docstring, but I can see that it can be interpreted that way! To me, Return a Series containing counts of unique rows in the DataFrame. means "For each unique row in the DataFrame, report the number of times it occurs".

What about something like:

Return a Series containing the number of times each unique row occurs in the DataFrame.

Better to my taste! This wording better separates “unique row” from “occurrences”.
Otherwise, it was too close to “unique occurrences” aka “count distinct”.

Maybe even "unique row -> unique row value". To emphasize counting of row values (tuples), rather than rows themselves (row can be seen unique merely by indexing). Also, row value is a well-known term (from SQL standard).

rhshadrach wrote this answer on 2022-09-20

Also, row value is a well-known term (from SQL standard).

But I don't believe it is used much in pandas - the two occurrences I see in the docs both explain the term. It seems better to me to indicate the index is ignored rather than using this terminology.

maciejskorski wrote this answer on 2022-09-20

I don't insist. I acknowledge that pandas is used by people with different backgrounds.

So please re-read my request as asking for decoupling “unique” and “count”, like proposed above, that would do the job.

Sorry for that verbose communication, but I come with research background where the choice of words means a lot :-)

More Details About Repo
Owner Name pandas-dev
Repo Name pandas
Full Name pandas-dev/pandas
Language Python
Created Date 2010-08-24
Updated Date 2022-09-29
Star Count 35374
Watcher Count 1122
Fork Count 15034
Issue Count 3579

YOU MAY BE INTERESTED

Issue Title Created Date Comment Count Updated Date
[Request] Support 1.18.2 pls 1 2022-05-08 2022-08-12
Add Screen#findAll to enable matching multiple template occurrences 0 2021-11-26 2022-08-20
Is this still supported? 5 2020-12-16 2022-05-02
Out-of-the-box setup doesn't work with drupal-project 1 2018-05-16 2022-05-02
Functional tests not working 11 2018-07-02 2022-05-02
No easy way to migrate from a parent foreign key to an MP_Node 0 2022-01-12 2022-09-28
Google Calendar Sync 1 2021-11-11 2022-08-01
tiflash crash frequently with error of TryFlushData faild 1 2022-08-11 2022-08-15
Question: application of the ProGraML 3 2021-08-25 2022-09-28
dlrs_SBQQ_SubscriptionTrigger: execution of AfterUpdate caused by: System.SObjectException: Invalid field 1 2021-10-01 2022-08-22
Launch shiny on bioconductor docker image 1 2021-12-02 2022-05-21
Exclude from scan option throwing errors 0 2021-12-10 2022-09-08
JDK18 Alpine Linux jdk_net_0 failures 6 2022-03-30 2022-09-17
Installation - Demo Shop 1 2022-06-17 2022-08-19
Two feeds not passed into network 3 2021-03-09 2022-09-24
After adding new widgets gridstack.js ignores columns in the last row 3 2021-06-29 2022-09-18
Forms.Dialog buttons with images autosize incorrect 2 2021-08-27 2022-09-25
version 0.11.0 TreeSelect Association bidirectionally 0 2022-06-30 2022-08-24
Dev server won't start with @apply rule in style block 1 2022-06-28 2022-09-07
lang: make it possible to get state of a program 1 2021-12-02 2022-09-04
Missing command flags 1 2021-10-11 2022-07-20
Adding curl to Dockerfile. 0 2021-05-07 2022-02-05
If more than one registers' values are passed into inject functions, they are inconsistent 2 2022-02-09 2022-09-28
MultiSheet策略下,rowspan不为0的行是最后一行时,生成的blankTd会被下一个sheet错误领取 2 2022-09-09 2022-09-29
Firework particles change colour when rendering in front of water 8 2021-03-07 2022-08-15
`ignoreInternalFunctionFalseReturn=false` does not trigger errors for (at least) `base64_decode` 4 2021-11-11 2022-07-08
Review Keycloak X Configuration Guide 0 2022-02-09 2022-09-23
Perpetually loading landing page of metaplex storefront 53 2021-12-28 2022-07-21
Problem with 0install binary 4 2015-06-23 2022-07-22
小狼毫使用嘸蝦米,常常無法輸入中文。 0 2021-10-06 2022-01-10
feature request: selecting places on line instead of executing 3 2022-09-13 2022-09-29
Value of a cell is changed to a floating point number if it contains only numerical values 5 2022-07-08 2022-09-09
Idea: Tooltips for segments & video labels 3 2022-02-03 2022-09-01
Managed certificate renewal for apex domain 4 2022-06-30 2022-09-22
MaxListenersExceededWarning: Possible EventEmitter memory leak detected 0 2022-09-23 2022-09-18
Splash not currently working with image in ios 3 2021-06-24 2022-09-25
Prometheus Exporter - Gauge counter metrics dropped with error 'failed to translate metric' 4 2021-11-24 2022-09-08
Partial deconstruction caused by three values xor 1 2021-10-19 2022-09-23
Feature Request: Add Information about Service Endpoint Requirements 0 2021-09-20 2022-09-21
Create modulesTestState helper 1 2022-07-14 2022-09-18
Need inf on mc 1 2021-10-22 2022-01-13
Lambda log errors reference line numbers of generated index.js, not user code 2 2018-05-23 2022-09-06
bug: ion-nav does not work in react 17 2021-10-01 2022-08-23
Debugging in VS Code throws uncaught exceptions 7 2021-11-13 2022-09-26
volvooncall integration broken in core-2021.11.4 6 2021-11-20 2022-08-22
Test tab debug doesn't run in virtualenv 4 2021-09-27 2022-07-20
aws-ecs deploy breaks on external ingress sg creation 3 2021-12-18 2022-09-22
[3.x] Allow TimeoutException to be passed null values 0 2021-03-15 2022-09-15
SSR guide: add directive transforms 1 2021-09-28 2022-09-27
Bump eslint from 7.8.1 to 7.31.0 1 2021-07-19 2022-07-28