BUG: bitmasks not supported in interchange/from_dataframe.py

This issue has been created since 2022-11-24.

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Sorry, this is not easily reproducible as the dataframe interchange protocol for pyarrow is still work in progress but I think the error is quite clear:

import pyarrow as pa
table = pa.table({"a": [1, 2, 3, None]})

exchange_df = table.__dataframe__()

from pandas.core.interchange.from_dataframe import from_dataframe
from_dataframe(exchange_df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 53, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 74, in _from_dataframe
    pandas_df = protocol_df_chunk_to_pandas(chunk)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 122, in protocol_df_chunk_to_pandas
    columns[name], buf = primitive_column_to_ndarray(col)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 160, in primitive_column_to_ndarray
    data = set_nulls(data, col, buffers["validity"])
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 504, in set_nulls
    null_pos = buffer_to_ndarray(valid_buff, valid_dtype, col.offset, col.size)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 395, in buffer_to_ndarray
    raise NotImplementedError(f"Conversion for {dtype} is not yet supported.")
NotImplementedError: Conversion for (<DtypeKind.BOOL: 20>, 1, 'b', '=') is not yet supported.

Issue Description

I am currently working on implementing a dataframe interchange protocol for pyarrow.Table in Apache Arrow project (apache/arrow#14613).

I am using pandas implementation to test that the produced __dataframe__ object can be correctly consumed.

When consuming a pyarrow.Table with missing values I get an NotImplementedError. The bitmasks, used by PyArrow to represent nulls in a given column, can not be converted.

But if I look at the code in from_dataframe.py:

if bit_width == 1:
assert length is not None, "`length` must be specified for a bit-mask buffer."
arr = np.ctypeslib.as_array(data_pointer, shape=(buffer.bufsize,))
return bitmask_to_bool_ndarray(arr, length, first_byte_offset=offset % 8)
else:
return np.ctypeslib.as_array(
data_pointer, shape=(buffer.bufsize // (bit_width // 8),)
)
def bitmask_to_bool_ndarray(

I would think this is not intentional and that the _NP_DTYPES should include {1: bool}

column_dtype = _NP_DTYPES.get(kind, {}).get(bit_width, None)

_NP_DTYPES: dict[DtypeKind, dict[int, Any]] = {
DtypeKind.INT: {8: np.int8, 16: np.int16, 32: np.int32, 64: np.int64},
DtypeKind.UINT: {8: np.uint8, 16: np.uint16, 32: np.uint32, 64: np.uint64},
DtypeKind.FLOAT: {32: np.float32, 64: np.float64},
DtypeKind.BOOL: {8: bool},
}

Expected Behavior

The bitmask can be converted to ndarray by the current pandas implementation of the dataframe interchange protocol and the code below could work for missing values also:

>>> import pyarrow as pa
>>> table = pa.table({"a": [1, 2, 3, 4]})

>>> exchange_df = table.__dataframe__()
>>> exchange_df._df
pyarrow.Table
a: int64
----
a: [[1,2,3,4]]

>>> from pandas.core.interchange.from_dataframe import from_dataframe
>>> from_dataframe(exchange_df)
   a
0  1
1  2
2  3
3  4

Installed Versions

INSTALLED VERSIONS

commit : 87cfe4e
python : 3.9.14.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:46 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.5.0
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.28
pytest : 7.1.3
hypothesis : 6.39.4
sphinx : 4.3.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.02.0
gcsfs : 2022.02.0
matplotlib : 3.6.2
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0.dev117+geeca8a4e3.d20221122
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : None
xlwt : None
zstandard : None
tzdata : None

AlenkaF wrote this answer on 2022-11-24
mroeschke wrote this answer on 2022-11-29

Thanks for the report @AlenkaF!

As an aside related to converting a pa.Table to a pd.DataFrame, it would be very interesting to keep the pyarrow types in a pandas DataFrame since pandas supports a built-in pd.ArrowExtensionArray. Of course more would need refactored on the pandas exchange side to construct a pd.DataFrame using pd.ArrowExtensionArray instead of numpy arrays.

jorisvandenbossche wrote this answer on 2022-11-29

it would be very interesting to keep the pyarrow types in a pandas DataFrame

I think that is something that is fully controlled on the pandas side? PyArrow will just expose the buffers, and it's up to pandas to decide how to reassemble those in arrays (except for boolean dtype, where the spec currently requires a byte array, so that's not possible zero-copy for pyarrow. But that is also something that maybe should be discussed on the Data APIs side to include that in the spec?)

mroeschke wrote this answer on 2022-11-29

I think that is something that is fully controlled on the pandas side?

Yeah agreed. I guess during the consumption of the interchange by pandas there can be a "mode" that can be configure to say whether to consume the buffers as arrow objects or numpy objects

More Details About Repo
Owner Name pandas-dev
Repo Name pandas
Full Name pandas-dev/pandas
Language Python
Created Date 2010-08-24
Updated Date 2022-12-07
Star Count 36164
Watcher Count 1118
Fork Count 15472
Issue Count 3683

YOU MAY BE INTERESTED

Issue Title Created Date Comment Count Updated Date
EasyLoadingStyle.custom时backgroundColor设置透明无效 2 2021-09-01 2022-11-23
Inconsistency in positions between debugger and app 1 2021-12-20 2022-11-29
Object is not correctly stringified by `toJsonString()` in some case. 0 2021-04-14 2022-11-06
How to get response when I write a command using SerialConnection writeBytes() method 3 2021-08-18 2022-09-17
Duplicated Sinks showing in Active Datasets list. 1 2022-07-26 2022-08-15
Agent view: Policies applied to stale agents could also have stale status 0 2022-07-27 2022-08-19
Summary Plot: Curve highlighting for multiple plots 0 2022-05-18 2022-09-29
[bug]: lan nginx proxy doesn't handle websockets 0 2022-03-15 2022-10-08
Errors when creating reports 3 2022-03-10 2022-08-02
AssembyResolver improvements 0 2021-08-05 2022-10-11
Make Content Type Optional for File Uploads 3 2022-02-13 2022-11-25
Issue with with windows build number 19044.1826 2 2022-07-19 2022-11-28
WPS Office mimetype icons 0 2017-09-19 2022-11-19
Error installing HTTPSConnection... Full Error in Post 4 2022-09-30 2022-11-25
Failed prop type (React) 1 2020-11-27 2022-11-13
Übersichtsseite und Docs-Seite zusammenführen 1 2022-04-13 2022-09-17
Failure while following 'getting started' instructions 4 2021-01-24 2022-11-27
Java lang null exception. Load gcode file from Carveco into UGS 2.0.9 I can toggle machinex,y and z when I send file I get java.lang.NullPointerException error 1 2022-01-11 2022-10-16
Demo link from readme doesn't work 4 2021-08-16 2022-11-22
esp32_exception_decoder wrong result 3 2021-11-20 2022-08-21
Alpine package missing "provides" variable 2 2022-06-07 2022-11-21
Support for HID iClass Cards 14 2016-10-25 2022-11-05
Color-Scheme Aware Faint/Dim Rendering 0 2021-07-31 2022-12-01
Poco X3 Pro (Vayu) Wifi 6 support 0 2022-02-21 2022-11-19
XiaoMi 11 lite 5G NE (lisa) source is incomplete 0 2022-02-27 2022-10-10
Doble chests and traped chests disapire 1 2022-01-02 2022-09-26
EGO selection possibly not storing correctly 3 2022-03-06 2022-07-25
GraalVM support? ReferenceError: global is not defined 4 2021-11-25 2022-11-05
Blockscout Error Reading Proxy Contract 4 2022-01-28 2022-08-02
Release dials 0.0.7 1 2020-06-09 2022-01-08
Cleanup custom Kyma dashboards and alerts 1 2022-02-07 2022-10-12
make: [email protected]: Permission denied (publickey). 3 2022-04-07 2022-12-03
[bug] wrong code generated for repeated num items 3 2021-05-08 2022-11-04
[Bug]: Copied widgets don't show up in Autocompletion hints in propertypane of original widget 0 2022-07-04 2022-10-24
Ensure correct quoting in translations 6 2022-10-09 2022-11-16
Your Temple Is Under Attack mode 2 doesn't target an opponent to draw 2 cards 2 2022-11-27 2022-11-30
TDengine-server-3.0.1.7-Windows-x64.exe 运行报错 5 2022-11-21 2022-11-26
SharedArrays does not work on (native) Apple Silicon 2 2021-12-18 2022-11-13
Bug: Error when trying to move a board to a different category 2 2022-04-14 2022-12-05
[BUG] - ServerDetails 0 2022-12-03 2022-11-28
[IMPROVEMENT] - Original Font 1 2022-12-02 2022-11-28
Examples of enterprise grade implementations with this library? 3 2022-08-30 2022-11-29
openTelemetry: "before all" hook for "should run the openTelemetryTracing sample" failed 0 2022-09-17 2022-11-29
pubsub: "after all" hook for "should allow closing of publisher clients" failed 0 2022-09-17 2022-11-29
[Files Field Type] Allow In-Entry Field Editing via Modal / Tooltip 4 2017-03-23 2022-10-31
Using `%gui qt` and calling an async function shows cells as busy even after completion 4 2021-06-02 2022-10-30
Minor issue with examples in ch5.3 1 2021-11-23 2022-10-31
gotta reboot to add new song 1 2020-12-30 2022-11-16
how to force-flush ILP cache/state before taking disk snapshot ? 1 2021-10-30 2022-11-01
Are you planning to integrate this extension into Python? 1 2022-04-28 2022-11-02