BUG: merge_asof unrecoverably loses precision, yet does not support nullable integer columns.

This issue has been created since 2022-09-22.

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df_a = pd.DataFrame({
    'a':[0, 1549049940000000000],
    'v2': ['a1', 'a2']
})
df_b = pd.DataFrame({
    'b': [1549049937688215000],
    'v': ['b1']
})
pd.merge_asof(df_a, df_b, left_on=['a'], right_on=['b']).astype({'b': pd.Int64Dtype()})

Issue Description

The join result incorrectly ends up with b = 1549049937688215040, rather than the value provided. Note the last 2 digits being 40 rather than 00 like in the input value.

This is because it casts the values to floats to be able to produce "null" values in the row that hasn't been matched, and as part of casting the value to floats, it loses precision and produces incorrect values.

If you remove the first row (where a = 0) in df_a, it returns the correct result as there are no rows that need to be "null".

Sadly you cannot do a merge_asof on nullable types (pd.Int64Dtype() et al) to fix this.

Expected Behavior

import pandas as pd
df_a = pd.DataFrame({
    'a':[0, 1549049940000000000],
    'v2': ['a1', 'a2']
})
df_b = pd.DataFrame({
    'b': [1549049937688215000],
    'v': ['b1']
})
df_a['a_copy'] = df_a['a'].astype(pd.Int64Dtype())
df_b['b_copy'] = df_b['b'].astype(pd.Int64Dtype())

pd.merge_asof(df_a, df_b, left_on=['a'], right_on=['b']).drop(columns=['a', 'b']).rename(columns={'a_copy': 'a', 'b_copy': 'b'})

returns the correct result but is not very nice to use.

Installed Versions

1.5.0

phofl wrote this answer on 2022-09-22

Hi, thanks for your report. We are not casting from numpy dtypes to nullable dtypes just because there are nans introduced. You have to be explizit there. You can use convert_dtypesto generalise this

Audrius-GR wrote this answer on 2022-09-22

Sure, but We are not casting from numpy dtypes to nullable dtypes just because there are nans introduced. is exactly what causes the integers to be converted to floats and lose precision unrecoverably.

Why not convert to pandas nullable types instead and not lose precision?

As I said, there are workarounds, as you pointed out, convert_dtypes is one of them (except you still have to use numpy columns for the merge keys, as it does not support pandas nullable types),

But I feel this is "a trap by default" when a user passes in a integer 1549049937688215000 and gets back a float that cannot be converted back to the original value, as casting it to int yields 1549049937688215040.

phofl wrote this answer on 2022-09-22

There is lots of discussion about this on the issue tracker. For now, it is a deliberate decision not to do that

Audrius-GR wrote this answer on 2022-09-22

Except that the decision is not consistent.

If there is no "null" row, I get int64's back with the correct values that don't lose precision.
If there is a null row, I get floats that have lost precision.

Audrius-GR wrote this answer on 2022-09-22

and doing:

df_a['a'] = df_a['a'].astype({'a': pd.Int64Dtype()})
df_b['b'] = df_b['b'].astype({'b': pd.Int64Dtype()})

before the merge does not work, and ends up with:

  File "\lib\site-packages\pandas\core\reshape\merge.py", line 1993, in _get_join_indexers
    return func(left_values, right_values, self.allow_exact_matches, tolerance)
  File "pandas\_libs\join.pyx", line 868, in pandas._libs.join.__pyx_fused_cpdef
TypeError: No matching signature found
phofl wrote this answer on 2022-09-22

This is the numpy casting behavior, not pandas

phofl wrote this answer on 2022-09-22

Merge is not yet fully supported for nullable dtypes,

There are open issues about that

More Details About Repo
Owner Name pandas-dev
Repo Name pandas
Full Name pandas-dev/pandas
Language Python
Created Date 2010-08-24
Updated Date 2022-09-29
Star Count 35374
Watcher Count 1122
Fork Count 15034
Issue Count 3579

YOU MAY BE INTERESTED

Issue Title Created Date Comment Count Updated Date
Publish latest version on Zenodo? 0 2022-02-04 2022-09-06
ntmap with Netbox 3.0.8 1 2021-10-30 2022-09-15
Replace API backend with typescript. 1 2020-02-13 2022-08-12
elastic search optimizations 0 2020-03-02 2022-08-02
Support F3/SHIFT+F3 Shortcuts 1 2018-03-26 2022-09-05
Error when pulling container 2 2021-04-08 2022-09-25
Installing with pip 21.0.1 does not work, as pip tries to use wheel but doesn't fall back on setup.py install 7 2021-02-05 2022-09-23
Please add “isleep.blanket.hs2205” 2 2022-02-11 2022-06-17
[Question] Help 短横线插槽名称如何转换成jsx插槽函数名? 1 2022-07-21 2022-09-21
tabularx does not work properly 1 2020-12-22 2022-09-29
Compression addition 3 2021-07-27 2022-09-17
Is there a way to sync/replicate Multiple Security Onion deployments (elastic search indexes) 0 2022-01-13 2022-07-25
Ranger doesn't timeout previews 2 2022-04-15 2022-09-19
test_mtime_file and test_update_time_cp_p fail on Fedora 1 2022-06-12 2022-09-02
Shell is freezing when mentioning FUSEd folder 5 2022-05-17 2022-09-02
Processing tags inside an ogg container no longer works 5 2022-07-13 2022-09-15
load list error: error internal server error 17 2022-03-24 2022-09-18
Error after installation through YunoHost 4 2022-03-28 2022-09-18
legality issues with egg bred pokemon 2 2021-12-04 2022-07-21
Create action to build and push image to Github Container Registry 0 2021-08-10 2022-09-04
load_depends() fails if the package cannot be loaded without attaching it 3 2021-04-20 2022-09-16
`[[<-.custom` produces unexpected exported NAMESPACE values 3 2021-03-31 2022-07-27
Request: add {List name}Rustyboss1 0 2022-09-16 2022-09-22
imgpkg copy mismatches digests with images on quay.io 2 2021-07-28 2022-09-15
Load IDBFS before running the emscripten_set_main_loop 1 2022-07-13 2022-09-28
Set @/whatever import path as not scoped package 0 2021-07-06 2022-09-29
speakerdeck.com 0 2021-12-31 2022-01-18
syntax: support backtracking in the parser if the input implements io.Seeker 2 2019-01-15 2022-09-20
[Bug][beta] - SliderMonitor hides monitor buttons when the viewport width is small 1 2021-10-24 2022-09-20
Usage with eslint-plugin-jsx-a11y 1 2021-02-11 2022-09-17
Question: blueprintjs + htm 1 2021-02-10 2022-09-17
VNC start script is outdated 0 2020-10-17 2022-09-13
Feature: Add MariaDB option to service operator 5 2020-09-14 2022-09-24
Quick Error W/ Postman. 23 2020-10-21 2022-08-21
Being able to create settings.json and extensions.json outside of image and named config level for an ATTACHED VSCode Container 6 2021-11-24 2022-08-08
`avoid_redundant_argument_values` is not working for enum instances 4 2022-08-26 2022-09-15
Conflicting Fields in Cloudtrail data 0 2022-02-22 2022-09-11
C interface with optional C++ API 26 2022-06-04 2022-09-17
dictionary find should just return bool 1 2022-05-23 2022-09-17
UI for filtered log file 1 2021-08-20 2022-09-25
replace() jinja filter appears to create malformed lists 4 2022-04-02 2022-08-12
Collect cached results from s3 bucket 1 2021-05-18 2022-09-08
Topoedit solution fails to upgrade to VS 2019 2 2021-09-26 2022-09-20
Drop support for targeting Windows Vista / Server 2008 4 2021-10-20 2022-09-07
[Question] How do I change the headless browser language in python? 1 2022-06-16 2022-07-23
Explore updating customer details during checkout async 0 2022-07-26 2022-08-10
Indexing stopped working after upgrading target platform to 2021.2 4 2021-08-20 2022-08-19
First attempt always fails - could not read from repo 3 2022-01-31 2022-09-27
BigQuery Datasource created in EU is not found by Grafana Dashboard 1 2022-02-02 2022-09-28
A11y_.NET Core_WPF_DatabindingDemo_ListofProducts_NonTextContrast: The color contrast of the selected list item is less than 3:1. 5 2022-06-29 2022-09-26