ENH: Safety net for operations not compatible with a given dtype.

This issue has been created since 2022-09-20.

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could have a safety net preventing me from doing non correct operations linked to dtypes issues/incompatibilities.

Example:

In [1]: import pandas as pd
In [2]: s = pd.Series([1,3,5,129]).astype("uint8")
   ...: s + s

Out[2]: 
0     2
1     6
2    10
3     2
dtype: uint8

The expected value of 129+129 would be 258, however the value is 2.
This is because the maximum value of an "uint8" is 255, and 255+1 set the value back to zero.

Another kind of misleading behaviour is if one uses astype in a not careful way

In [1]: import pandas as pd
In [2]: s = pd.Series([1,-1,10**5]).astype("int8")
In [3]: s

Out[3]: 
0     1
1    -1
2   -96
dtype: int8

Feature Description

A solution could be to make sure that an operation is compatible between columns dtypes.
I try to show an example of what I mean for the + operation.

    @unpack_zerodim_and_defer("__add__")
    def __add__(self, other):
        check_dtype_operation_add(self, other)       # The check function checking that an operation is ok
        return self._arith_method(other, operator.add)

A mock of the function testing only for an operation between uint8 and a number:

def check_dtype_operation_add(self, other):
    if self.dtype == "uint8" and isinstance(other, numbers.Number) :
        logging.warning("warning: you are performing an addition between a uint8 and a number,"
                        "this could result in an overflow if the operation leads to number greater than 255"
        )
        if self.max() + other > 255:
            raise Exception(f"adding {self.dtype} with a number which had lead to a memory overflow, please upcast your series dtype")

The final behaviour is this one:

  • print a warning (should be kept for low values dtypes (int8, uint8, Int8, float8, etc))
  • raise an exception if the operation leads to an overflow
In [1]: import pandas as pd

In [2]: s = pd.Series([1,2,200]).astype("uint8")

In [3]: s+200
WARNING:root:warning: your are performing an addition between a uint8 and a number,this could result in an overflow if the operation leads to number greater than 255
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In [3], line 1
----> 1 s+200

File /workspaces/pandas/pandas/core/ops/common.py:73, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     69             return NotImplemented
     71 other = item_from_zerodim(other)
---> 73 return method(self, other)

File /workspaces/pandas/pandas/core/arraylike.py:121, in OpsMixin.__add__(self, other)
    119 @unpack_zerodim_and_defer("__add__")
    120 def __add__(self, other):
--> 121     check_dtype_operation_add(self, other)
    122     return self._arith_method(other, operator.add)

File /workspaces/pandas/pandas/core/arraylike.py:43, in check_dtype_operation_add(self, other)
     39 logging.warning("warning: your are performing an addition between a uint8 and a number,"
     40                 "this could result in an overflow if the operation leads to number greater than 255"
     41 )
     42 if self.max() + other > 255:
---> 43     raise Exception(f"adding {self.dtype} with a number which had lead to a memory overflow, please upcast your series dtype")
     44 else:
     45     pass

Exception: adding uint8 with a number which had lead to a memory overflow, please upcast your series dtype

Concerning .astype we would just have to make sure, in the case of a downcasting, that s.max() and s.min() are compatible with the lower dtype.

I guess that would impact slightly the performance, an option to deactivate this behaviour ( e.g. pd.options.dtype.dtype_checks = False ) could solve the issue for users that want the maximum performance.

Alternative Solutions

Another solution would be to upcast implicitly the dtype when needed. I actually would enjoy that solution, it could possibly be implemented as an option pd.options.dytpe.automatic_casting = True

In [1]: import pandas as pd
In [2]: s = pd.Series([1,3,5,129]).astype("uint8")
   ...: s + s

Out[2]: 
0     2
1     6
2    10
3     258
dtype: uint16
In [1]: import pandas as pd
In [2]: s = pd.Series([1,3,5,129, 3.2]).astype("uint8")
   ...: s

Out[3]: 
0      1.0
1      3.0
2      5.0
3    129.0
4      3.2
dtype: float64

One could also imagine an option to save the maximum memory pd.options.dytpe.automatic_casting = "aggressive"

In [1]: import pandas as pd
In [2]: s = pd.Series([1,3,5,129, 3.2]).astype("uint8")
   ...: s

Out[3]: 
0      1.000000
1      3.000000
2      5.000000
3    129.000000
4      3.199219
dtype: float16

Additional Context

My experience & why I think something should be done

While I know about this pandas behavior since 2014, I remember that I was fairly surprised the first time it happened to me.
I was working on a 2000 columns dataframe, with more than 2000 lines of code, my downstream dataframe was strange, and I spent a great amount of time finding the issue.

I am certain that users wanting to save a little memory space with lower dtypes have faced (silently or not) the same issue, and I am pretty sure that production code is running bugs due to this. Moreover, the documentation lacks warnings about it.

Documentation concerning the subject:

I looked at several places to find a warning about this pandas behaviour, and I did not find anything.

API Reference:

It is the same for the Series part of the api reference.

User guide

The basics dtypes is fairly complete, but except for an example that shows explicitly the issue, the is no warning in the whole guide about possible issues with operations or downcasting that would lead to unintended behaviour.

Functions docstring

I was not able to find any warning in the function docstring about this issue.

phofl wrote this answer on 2022-09-20

Hi, thanks for your report.

pandas does not implement these aggregation methods itself. We are falling back to numpy where this behavior is inherited from:

na = np.array([129], dtype="uint8")

na + na

returns

[2]
adrienpacifico wrote this answer on 2022-09-20

Hi, thanks for your answer.
As pandas is higher level than numpy, I think it would make sense to protect (or warn) users from operations than would lead to unexpected behavior.

I hope that I showed that pandas could have these protections with quite small modifications in the code base. This inherited behavior from numpy does not mean that pandas can not deal with it, right?

@phofl should I consider that the pandas library is not open to such contributions and will never implement protection against those behaviours?

In any case, does documenting more about the issue with warnings in the user guide, api reference, and functions docstring would be an improvement to the current documentation?

mroeschke wrote this answer on 2022-09-21

Improving the docstrings would probably be the better course of actions here. Generally, pandas tries to align with numpy semantics unless there are unsupported behaviors in numpy that pandas requires. Deviating from numpy semantics would be too difficult to maintain if numpy decides to change its behaviors.

Side note: the new pyarrow dtypes in 1.5 will raise on overflow errors for addition

In [2]: ser = pd.Series([129], dtype="uint8[pyarrow]")

In [3]: ser
Out[3]:
0   129
dtype: uint8[pyarrow]

In [4]: ser + ser
ArrowInvalid: overflow
More Details About Repo
Owner Name pandas-dev
Repo Name pandas
Full Name pandas-dev/pandas
Language Python
Created Date 2010-08-24
Updated Date 2022-09-29
Star Count 35374
Watcher Count 1122
Fork Count 15034
Issue Count 3579

YOU MAY BE INTERESTED

Issue Title Created Date Comment Count Updated Date
Does this do anything when playing on vanilla multiplayer? 1 2021-12-22 2022-08-31
[Bug] Tutorial Example for Single Fracture Under Shear Compression Cannot Match Analytical Solution with Parallel Run 12 2022-07-22 2022-09-22
[Feature] Azure Pipelines Follow-Up 0 2022-07-21 2022-09-22
Can relate only one content type 3 2021-12-04 2022-07-06
num_users counts the number of users that have ever existed, but should be the total number of enabled users 0 2021-05-28 2022-09-17
Unify runtime tagging scheme of values 1 2021-03-27 2022-08-21
Implementation of Conditions 1 2020-12-05 2022-09-02
Implementation of Types 1 2020-12-05 2022-08-03
fn STRING - use with care 1 2021-08-26 2022-08-28
Simple multiplication gets wrong answer 6 2021-06-20 2022-07-09
Change the loading order of the JSCL bundle 0 2020-12-05 2022-07-09
注意如果你有多个Module,请在每个Module的build.gradle文件中apply插件 3 2021-07-19 2022-09-05
mctools目录没有找到 0 2021-08-05 2022-01-01
集成报错Cannot cast object 'property(interface org.gradle.api.file.Directory 0 2021-07-19 2022-09-05
Meta tag order in <head> 3 2021-05-23 2022-08-31
Could not find a package configuration file provided by "trajopt" 1 2021-04-16 2022-07-05
The foxy ova file does not work with virtualbox 1 2021-07-27 2022-08-18
No tesseract_planning package 0 2021-04-17 2022-07-05
Cross post links don't work 0 2021-01-07 2022-09-19
Add documentation for the "image" shortcode 1 2021-07-15 2022-09-19
[Site Request] Mangabuddy 1 2022-06-30 2022-09-12
Coq seems to install META files in the wrong location 13 2022-02-06 2022-09-22
Dependency Dashboard 1 2021-08-30 2022-09-15
summarize fails with two `across` constructs 4 2021-10-01 2022-08-24
Highlight matching block endpoint 4 2021-11-18 2022-09-25
container class 0 2021-07-19 2021-12-26
Change color of Blue Circle in PhotoLibraryPicker 0 2019-02-01 2022-09-23
A custom searchable list? 0 2022-01-26 2022-09-11
Flushing cache for yum repo sometimes fails, causing build script to end prematuarlly 1 2021-01-05 2022-09-28
Calico >= v3.19.0 loop-crashes on physical worker nodes 13 2021-09-02 2022-09-27
Broken "On-Premise" Installation on a Fresh Kubernetes 1.22 Cluster 13 2021-08-27 2022-09-29
FelixConfiguration.crd.projectcalico.org "default" is invalid: spec.bpfLogLevel: Invalid value: "null": spec.bpfLogLevel in body must be of type string: "null" 4 2021-08-29 2022-08-29
curl: (35) Unknown SSL protocol error in connection to github.com:443 1 2021-08-30 2022-07-22
VMs cannot DHCP 2 2021-09-03 2022-01-21
Pycharm IDE Fatal Error - Exception in plugin Robot Framework Language Server - LanguageServerUnavailableException 8 2021-08-18 2022-09-22
Adding more end to end tests 0 2021-02-25 2022-09-04
Why is the API shape different to share target? 9 2021-06-14 2022-09-17
Return the status of function without rc 0 2022-07-06 2022-09-25
Cosmos PrivateEndpoint - Regional Failover 0 2021-12-14 2022-09-28
Merge `[ELF][ARM] Fix unneeded thunk for branches to hidden undefined weak` into release/14.x 9 2022-04-15 2022-06-23
Discussion block missing border on small screen 0 2022-04-08 2022-08-29
cant find crate for clippy 3 2018-10-16 2022-08-31
Disabled state doesn't visually match with Bootstrap 3 0 2017-01-16 2022-09-23
Not able to compile and run the present source code and the cache is not cleared of previous codes 1 2022-06-11 2022-08-25
[Bug Report] el-table 结合 el-radio 实现表格行单选,开发时无异常,打包部署后点击单选框不显示选中状态。2.15.6 无此问题 0 2022-05-26 2022-09-18
HLint code-action fails 1 2020-12-17 2022-09-22
How to use promql to query in flux? 0 2021-08-28 2022-09-28
[Traefik Pilot] Traefik Plugin Analyzer has detected a problem. 0 2020-07-20 2022-09-02
`<Transition appear={false} />` fails when coupled with `useEffect` 2 2022-01-23 2022-09-26
Desktop apps on Linux starting but not displaying anything ("Unable to create a GL context") 3 2021-02-23 2022-09-27