PERF: Series.value_counts() in aggregation function is slower than collections.Counter()

This issue has been created since 2022-11-16.

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Series.value_counts() are slower especially if it counts for a small size of iterable elements.

Case1: value_counts() for counting small iterable elements

The picture below is the result of simulation 1 where I compared the speed of counting values for Series.value_counts() and collections.Counter(). In this simulation, I measured time for counting iterables of variable sizes ( $N \in 10^1 - 10^7$ ) using Series.value_counts() and collections.Counter(). It shows Series.value_counts() comes to be slower if the $N < 10^3$.

Source code for simulation 1:

@timing
def value_count_collections_x(n, sort=False):
    x = np.random.randint(0, 1000, (n))
    vc = Counter(x)
    if sort:
        vc = dict(sorted(vc.items(), key=lambda x: x[1], reverse=True))
    return vc


@timing
def value_count_series_x(n, sort=False):
    x = np.random.randint(0, 1000, (n))
    vc = pd.Series(x).value_counts(sort=sort)
    return vc


for _ in range(5):
    for n in [10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000]:
        value_count_collections_x(n)
        value_count_series_x(n)

Case2: repeatedly calls value_counts() in aggregation functions

That also comes to be issue when using in aggregation functions. I run another simulation to compare the execution time for Series.value_counts() and collections.Counter() 2. This time I used counting methods in an aggregate function. It also shows Series.value_counts() is 5 times slower than collections.Counter() method.

source code for simulation 2:

@timing
def collection_counter(n, sort=False):
    def agg_top20(x):
        counter = Counter(x)
        top20 = dict(counter.most_common(20)).keys()
        return " ".join(map(str, top20))

    df = pd.DataFrame(
        pd.DataFrame(np.random.randint(0, 1000, (n, 100))).stack().reset_index()
    )
    return df.groupby("level_0")[0].agg(agg_top20)


@timing
def series_value_count(n, sort=False):
    def agg_top20(x):
        counter = x.value_counts()
        top20 = counter.iloc[:20].index
        return " ".join(map(str, top20))

    df = pd.DataFrame(
        pd.DataFrame(np.random.randint(0, 1000, (n, 100))).stack().reset_index()
    )
    return df.groupby("level_0")[0].agg(agg_top20)


for _ in range(5):
    for n in [1000, 2_000, 5_000, 10_000]:
        collection_counter(n)
        series_value_count(n)

Is this an expected performance for value_counts()? If it isn't performance improvement should be cared.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 91111fd python : 3.10.6.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-52-generic Version : #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 59.6.0
pip : 22.0.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : 0.8.3
fsspec : 2022.10.0
gcsfs : None
matplotlib : 3.6.0
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

Prior Performance

No response

Footnotes

  1. https://gist.github.com/bilzard/558f07a7d13f62b4222a92446704cbe5

  2. https://gist.github.com/bilzard/08eb013b7a65e35e7bc8f0768c4bd147

mroeschke wrote this answer on 2022-11-16

Is this an expected performance for value_counts()? If it isn't performance improvement should be cared.

Possibly. pandas.Series and collection.Counter are different objects at the end of the day. So while each has a "value counts" like functionality, Series also has multiple value counts options, nan handling, etc. Therefore due to difference in scope, I expect there to be some difference in performance between the two.

Thanks for the analysis but I don't think this is entirely actionable as of now unless you also had some profiling results of Series.value_counts and suggestion of bottlenecks where performance could be improved

bilzard wrote this answer on 2022-11-17
More Details About Repo
Owner Name pandas-dev
Repo Name pandas
Full Name pandas-dev/pandas
Language Python
Created Date 2010-08-24
Updated Date 2022-12-07
Star Count 36164
Watcher Count 1118
Fork Count 15472
Issue Count 3683

YOU MAY BE INTERESTED

Issue Title Created Date Comment Count Updated Date
centos7(arm64) compile failed 7 2021-12-14 2022-12-04
Executing `install_dependencies.sh` results in a deprecation warning 0 2021-12-13 2022-07-10
Make a Hugging Face "optimized" pipeline example 5 2022-09-05 2022-09-07
Possible improvement for myTasksBrowse 1 2022-05-05 2022-09-21
Tritium gas can spontaneously appear 0 2022-03-27 2022-11-12
Connection error for Siemens s7 1212C 4 2022-04-28 2022-12-06
Unable to draw "✗" correctly 3 2021-02-06 2022-11-24
Can't detect duplicated list entries 2 2021-03-14 2022-11-25
Design new color system based on Ux improvements 4 2020-12-24 2022-11-08
yi-Hack-v4 not success in 4.60.0.0A 24 2019-08-20 2022-11-20
Error: config file not specfied when absolute path gived since #d5c9a61 0 2021-05-08 2022-11-12
protocolBuf问题 3 2021-11-25 2021-12-31
Feature request: native Linux desktop notifications 7 2019-11-02 2022-11-23
html/js: Angle bracket inside script tag breaks minification 1 2021-05-11 2022-11-18
Parser doesn't decode requirement attribute values 1 2022-02-20 2022-10-08
MachineNames is required 0 2022-08-05 2022-11-12
Feature request: Create new issue from /issues page 0 2022-07-27 2022-11-11
infomaniak.com cookie consent not blocked 1 2022-03-11 2022-11-30
固定列表格无数据时横向滚动条在上面 1 2021-06-08 2022-01-23
debug and source code analysis for irods 3 2021-12-09 2021-12-30
rebalance fails with with lie about file size and dumps stack 14 2021-12-09 2022-10-31
Kata spams logs with "failed to get OOM event from sandbox" warnings 1 2022-03-03 2022-09-22
In case of Some nodes dose not support Grpc 0 2022-09-08 2022-11-25
How to set SCSI controller type? 5 2021-08-09 2022-11-29
E2E Tests: Store Cypress recordings to GCS buckets 2 2021-12-02 2022-09-01
[The Eye of Eternity] Malygos walking mid-air, instead of flying 0 2022-02-24 2022-11-17
Problem with aliasname of subquery to the same table 6 2022-08-10 2022-11-09
Uncaught (in promise) Error: 禁止多种API加载方式混用 at l7:23 3 2021-12-21 2022-10-31
Android Basics: create-dice-roller-app-with-button 0 2021-07-19 2022-10-22
Arrivée de audrey.lebret 0 2021-07-26 2022-10-20
Arrivée de yannick.jacqueline 0 2021-04-14 2022-11-16
Arrivée de maxime.golfier 0 2021-08-05 2022-09-24
Arrivée de Lilian Saget-Lethias (lsagetlethias/lisag) 1 2021-08-25 2022-10-25
Arrivée de raphael.huchet 0 2021-02-25 2022-08-19
Arrivée de maxime.lecoq 0 2021-04-20 2022-10-18
Arrivée de augustin.ragon 0 2021-07-14 2022-11-17
Arrivée de martial.maillot 2 2021-02-01 2022-10-31
Panel: Easier way to define width of left and right panel 0 2017-11-25 2022-12-01
httpcore.ReadError on reading HTTPS webpage with valid certificate 1 2021-08-23 2022-12-04
希望获取数据集的具体格式规范 2 2021-12-29 2021-12-27
table 38 "Purchase Header" - new integration event in procedure SendRecords 0 2022-09-02 2022-08-30
add script to invert project image 0 2021-02-18 2022-11-26
Include vendored dependencies in the release tarball 5 2020-05-14 2022-11-13
Can provider_uri be other protocols like http 0 2022-11-21 2022-12-06
What dose angle bracket with capital letters mean in the yaml configuration file? 1 2022-11-21 2022-12-06
Drawer menu: similar names Configure profile / map / screen / settings / plugins 7 2022-11-04 2022-11-23
Add support to delete all bookmarks under root folders like `Mobile bookmarks` folder (add a menu option) 10 2021-10-14 2022-11-18
Doesn't work on Emacs 26.3 -- Error: (void-function magit-setup-buffer) 3 2019-09-24 2022-11-17
Question: catch-all route difference between '*' and '**' 2 2022-07-26 2022-11-11
Test failure JIT\\Regression\\JitBlue\\GitHub_35821\\GitHub_35821\\GitHub_35821.cmd 10 2022-02-08 2022-11-14