BUG: categorical_column_to_series() should not accept only PandasColumn

This issue has been created since 2022-11-24.

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pyarrow as pa
import pandas as pd

arr = ["Mon", "Tue", "Mon", "Wed", "Mon", "Thu", "Fri", "Sat", "Sun"]
table = pa.table(
    {"weekday": pa.array(arr).dictionary_encode()}
)
exchange_df = table.__dataframe__()

from pandas.core.interchange.from_dataframe import from_dataframe
from_dataframe(exchange_df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 53, in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy))
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 74, in _from_dataframe
    pandas_df = protocol_df_chunk_to_pandas(chunk)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 124, in protocol_df_chunk_to_pandas
    columns[name], buf = categorical_column_to_series(col)
  File "/Users/alenkafrim/repos/pyarrow-dev-9/lib/python3.9/site-packages/pandas/core/interchange/from_dataframe.py", line 185, in categorical_column_to_series
    assert isinstance(cat_column, PandasColumn), "categories must be a PandasColumn"
AssertionError: categories must be a PandasColumn

Issue Description

I am currently working on implementing a dataframe interchange protocol for pyarrow.Table in Apache Arrow project (apache/arrow#14613).

I am using pandas implementation to test that the produced __dataframe__ object can be correctly consumed.

When consuming a pyarrow.Table with categorical column I get an error from pandas that the categories must be a PandasColumn and not a general __dataframe__ column defined by the interchange protocol. There is a check on line 185 for PandasColumn instance:

def categorical_column_to_series(col: Column) -> tuple[pd.Series, Any]:
"""
Convert a column holding categorical data to a pandas Series.
Parameters
----------
col : Column
Returns
-------
tuple
Tuple of pd.Series holding the data and the memory owner object
that keeps the memory alive.
"""
categorical = col.describe_categorical
if not categorical["is_dictionary"]:
raise NotImplementedError("Non-dictionary categoricals not supported yet")
cat_column = categorical["categories"]
# for mypy/pyright
assert isinstance(cat_column, PandasColumn), "categories must be a PandasColumn"

Expected Behavior

categorical_column_to_series() function should accept a general dataframe_protocol column for the categories in the categorical column.

Installed Versions

INSTALLED VERSIONS

commit : 87cfe4e
python : 3.9.14.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:46 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.5.0
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.28
pytest : 7.1.3
hypothesis : 6.39.4
sphinx : 4.3.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.1.1
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.02.0
gcsfs : 2022.02.0
matplotlib : 3.6.2
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0.dev117+geeca8a4e3.d20221122
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : None
xlwt : None
zstandard : None
tzdata : None

AlenkaF wrote this answer on 2022-11-24
ronwho wrote this answer on 2022-11-29

take

ronwho wrote this answer on 2022-11-30

Hello @AlenkaF ,

I am fairly new to open source contribution, and I am trying to take this task for a school project.

I wanted to reproduce your error; however, I see that to do so, I must use your version of pyarrow.

I tried a couple of things to pip install your pyarrow:

  1. pip3 install git+https://github.com/AlenkaF/[email protected]#subdirectory=python
  2. also tried clone your version and then doing pip install . in the python directory.

I always run into the error:

"Could not find a package configuration file provided by "Arrow" with any of
        the following names:
          ArrowConfig.cmake
          arrow-config.cmake"

I tried looking into the pyarrow documentation for cpp building on OSX; however I still could not figure out how get past the error.

I have a feeling it should not be this difficult to do; perhaps I am missing something? Please let me know :)

Thanks,
Ron

AlenkaF wrote this answer on 2022-12-01

Hi @ronwho ,

thank you for working on this issue!

You are correct. To be able to reproduce the error you will have to work from the branch in my Apache Arrow fork.
I will send you a link to a new branch I will create today (need to change)!

But first you will need to build PyArrow from source following Python Development section in our documentation.

You will need to build Arrow C++ and then PyArrow, I suggest on Apache Arrow master branch first, and then you can checkout my branch with dataframe interchange protocol code.

Hope that makes sense. Feel free to ping me for questions in case you run into any difficulties!

ronwho wrote this answer on 2022-12-01

Hey @AlenkaF,

Thanks for your response, it has really helped me!

After carefully following the Python development guide I was successfully able to build and pip install the package in main. After checking out the ARROW-18152 branch I noticed there was a compile error while compiling Arrow C++, but I saw that you created a new branch "ARROW-18152-second." I tried building that and it built and pip installed successfully!

However, for some reason when testing the code on python, I am getting an import error from pyarrow.

File "/Users/ron/Projects/open-source/testing-pandas/test-case.py", line 1, in <module>
    import pyarrow as pa
  File "/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/__init__.py", line 65, in <module>
    import pyarrow.lib as _lib

ImportError: dlopen(/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-darwin.so, 0x0002): Library not loaded: '@rpath/libarrow_dataset.1000.dylib'

  Referenced from: '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-darwin.so'

  Reason: tried: '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/libarrow_dataset.1000.dylib' (no such file), '/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/libarrow_dataset.1000.dylib' (no such file), '/opt/homebrew/lib/libarrow_dataset.1000.dylib' (no such file), '/opt/homebrew/lib/libarrow_dataset.1000.dylib' (no such file), '/usr/local/lib/libarrow_dataset.1000.dylib' (no such file), '/usr/lib/libarrow_dataset.1000.dylib' (no such file)

I tested with also building the main branch, however I run into the same error. I also tried git clean and removing the build folder and rebuilding. I think I must be doing something wrong. Have you ever encountered an error like this? Please let me know!

Thanks so much again!

AlenkaF wrote this answer on 2022-12-01

Great, you found the new branch! 👍

Can you check if you built Arrow C++ with -DARROW_DATASET=ON ? Can you also check if you can find libarrow_dataset.* in the folder where the libraries are installed (/Users/ron/Projects/open-source/pyarrow-dev/lib/python3.10/site-packages/pyarrow/)?

From the error I would think that the Arrow C++ dataset package is not installed.

More Details About Repo
Owner Name pandas-dev
Repo Name pandas
Full Name pandas-dev/pandas
Language Python
Created Date 2010-08-24
Updated Date 2022-12-07
Star Count 36164
Watcher Count 1118
Fork Count 15472
Issue Count 3683

YOU MAY BE INTERESTED

Issue Title Created Date Comment Count Updated Date
[FEATURE] Add Git and GitHub Tutorial to DevOps Roadmap 2 2022-11-09 2022-11-29
When I use lura proxy modifier plugin gives an error 7 2022-04-07 2022-10-20
Literally has never worked 1 2021-10-11 2022-10-20
Deprecated(aerial.on_attach) warning with pylsp 5 2022-10-27 2022-11-03
ScriptProcessorNode deprecation warnings 1 2022-06-06 2022-09-20
no such repo being called in many files hclcom/domino 6 2022-05-09 2022-11-02
Domino container not coming up 1 2022-03-21 2022-11-13
The `invoke_default_handler` function 0 2021-01-03 2022-11-13
Move hotkeys works wrong 0 2022-02-24 2022-11-06
Mumble freezing on sound system reload with pipewire 2 2022-05-03 2022-10-20
folderAnnotation insufficient privileges 3 2020-12-01 2022-12-05
PipeWire: stuttering when switching songs 0 2021-11-20 2022-07-16
Schema not enforced for `Object` fields when `saveUnknown = false` 1 2022-07-18 2022-11-14
Initial login fails when no internet connection 0 2020-03-16 2022-11-21
Upgrade to Spring for Apache Geode 1.6.7 0 2022-04-22 2022-11-27
ノード構成によってグラフが逆流します 0 2021-12-16 2022-01-14
[Feature] JumpServer文件管理不能删除tmp目录 1 2022-01-10 2022-11-04
Docker build error when building for local hosting 3 2021-01-14 2022-11-19
[PostgreSQL 14] Support of multirange data types. 4 2021-11-22 2022-11-20
Documentation: User Storage SPI should document the UserStorageProvider.Stream interface 0 2021-11-11 2021-11-13
Better transition for main logo to move around 0 2020-11-26 2022-11-20
Easier way to localate coordinates 2 2022-01-09 2022-12-04
Fix parsing FI/PI under Ruby 3.0 3 2021-01-11 2022-07-22
filter function on the history page does not work 7 2021-11-22 2022-12-01
Add frozen string literal declaration to all Ruby files 1 2022-06-07 2022-11-12
[Nginx] unable to find my server 10 2017-08-20 2022-11-18
Language options 5 2021-10-19 2022-01-04
install netcdf ^mpich section needs an overhaul 0 2021-03-22 2022-11-26
Set-AzVmRunCommand does not behave as specified 4 2022-08-19 2022-08-28
pulse & plans for the 1.x branch 5 2022-05-16 2022-11-22
Breadcrumbs not always display the correct path 0 2022-02-22 2022-10-30
Gladys Plus: Print 2FA enrollment code 2 2021-11-16 2022-10-12
[Bug]: new BigNumber() not a number: [object Object] 1 2022-05-27 2022-09-18
Istio Version 1.10.4 - HTTP/1.1"" 0 DC downstream_remote_disconnect 4 2022-05-19 2022-07-30
Dont run "node utils/nftport/uploadFiles.js" 7 2022-04-18 2022-09-06
How can make mariadb galera cluster initialization faster? 3 2022-07-20 2022-09-02
Google Play: ru.yandex.weatherplugin 1 2022-06-11 2022-11-29
Testing for signs of life that webhook jobs are executed 0 2022-11-21 2022-11-27
port: [#6432] TeamsInfo.GetMemberAsync(...) doesn't work properly in Skill Bot scenario, it returns http 405 error (#6443) 0 2022-08-30 2022-11-29
Bluetooth: Controller: Group auxiliary PDU transmissions 0 2022-02-17 2022-03-03
403: Too many unsuccessful login attempts. Please restart transmission-daemon. 6 2021-08-19 2022-12-05
Get rid accept cookies popup for StackOverflow when using CLI 7 2021-06-11 2022-09-16
Imperatively & synchronously read atoms (`useAtomCallback` only supports `Promise`s) 2 2022-04-15 2022-10-31
Failed to load resource: Google Fonts 3 2021-08-15 2022-12-05
chef 13 support 2 2015-10-09 2022-01-05
[DocDB] flaky test: YBBackupTest.TestYSQLTabletSplitRangeUniqueIndexOnHiddenColumn 1 2022-10-07 2022-10-06
[Update] Add additional instruction to the TSDB update process 1 2021-01-18 2022-11-28
Too large tarball size 0 2022-09-04 2022-10-03
Update CI to consistently use the DockerHub mirror and to avoid unnecessarily logging in to DockerHub 0 2022-05-17 2022-05-18
Postgresql 14.0 support warning message in Flyway 8.0.2 1 2021-10-26 2022-09-14