BUG: Inconsistent behaviour reading .tar.gz files for 1.5.0

This issue has been created since 2022-09-22.

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df_test = pd.DataFrame([[1,1],[2,2]])
df_test.columns = ["c1", "c2"]

df_test.to_csv("./test.csv.tar.gz", index=False)

Issue Description

Executing the above code using pandas == 1.5.0, and reading the saved table via

pd.read_csv("./test_old.csv.tar.gz")

using lower versions produces

   test.csv.tar.gz   c2
0              1.0  1.0
1              2.0  2.0
2              NaN  NaN

Executing the above code using pandas <= 1.4.4, and reading the saved table using pandas == 1.5.0 would raise the following ReadError:

~/.conda/envs/default/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    209                 else:
    210                     kwargs[new_arg_name] = new_arg_value
--> 211             return func(*args, **kwargs)
    212 
    213         return cast(F, wrapper)

~/.conda/envs/default/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    315                     stacklevel=find_stack_level(inspect.currentframe()),
    316                 )
--> 317             return func(*args, **kwargs)
    318 
    319         return wrapper

~/.conda/envs/default/lib/python3.9/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    948     kwds.update(kwds_defaults)
    949 
--> 950     return _read(filepath_or_buffer, kwds)
    951 
    952 

~/.conda/envs/default/lib/python3.9/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    603 
    604     # Create the parser.
--> 605     parser = TextFileReader(filepath_or_buffer, **kwds)
    606 
    607     if chunksize or iterator:

~/.conda/envs/default/lib/python3.9/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
   1440 
   1441         self.handles: IOHandles | None = None
-> 1442         self._engine = self._make_engine(f, self.engine)
   1443 
   1444     def close(self) -> None:

~/.conda/envs/default/lib/python3.9/site-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
   1727                 is_text = False
   1728                 mode = "rb"
-> 1729             self.handles = get_handle(
   1730                 f,
   1731                 mode,

~/.conda/envs/default/lib/python3.9/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    798             compression_args.setdefault("mode", ioargs.mode)
    799             if isinstance(handle, str):
--> 800                 handle = _BytesTarFile(name=handle, **compression_args)
    801             else:
    802                 # error: Argument "fileobj" to "_BytesTarFile" has incompatible

~/.conda/envs/default/lib/python3.9/site-packages/pandas/io/common.py in __init__(self, name, mode, fileobj, archive_name, **kwargs)
    965         # type "Union[ReadBuffer[bytes], WriteBuffer[bytes], None]"; expected
    966         # "Optional[IO[bytes]]"
--> 967         self.buffer = tarfile.TarFile.open(
    968             name=name,
    969             mode=self.extend_mode(mode),

~/.conda/envs/default/lib/python3.9/tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1614                         fileobj.seek(saved_pos)
   1615                     continue
-> 1616             raise ReadError("file could not be opened successfully")
   1617 
   1618         elif ":" in mode:

ReadError: file could not be opened successfully

Expected Behavior

The table should be read correctly as

   c1  c2
0   1   1
1   2   2

Installed Versions

INSTALLED VERSIONS

commit : ca60aab
python : 3.9.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.290-217.505.amzn2.x86_64
Version : #1 SMP Wed Aug 10 09:52:16 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.4 # 1.5.0, 1.1.5 also used for testing
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
setuptools : 60.9.3
pip : 21.2.4
Cython : 0.29.28
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.31.1
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.3.0
gcsfs : None
markupsafe : 2.1.0
matplotlib : 3.5.2
numba : 0.55.2
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
zstandard : None

MarcoGorelli wrote this answer on 2022-09-22

Thanks for the report - from bisecting, this was caused by #44787

I don't know anything about tarfile, but from reading there, it looks like the previous behaviour in 1.4.4 was incorrect was it would produce invalid tar files - in which case, it looks like this is fine?

cc @Skn0tt

wenh06 wrote this answer on 2022-09-22

I might have found the problem. In previous versions, if one intended to save a .tar.gz file, he actually only got a gzipped text file. Only from version 1.5.0, one can really get a .tar.gz file.

MarcoGorelli wrote this answer on 2022-09-22

thanks for checking - let's close then, doesn't look anything needs doing

More Details About Repo
Owner Name pandas-dev
Repo Name pandas
Full Name pandas-dev/pandas
Language Python
Created Date 2010-08-24
Updated Date 2022-09-29
Star Count 35374
Watcher Count 1122
Fork Count 15034
Issue Count 3579

YOU MAY BE INTERESTED

Issue Title Created Date Comment Count Updated Date
swig failed with exit status 1 with kbqa_cq_ru 1 2022-05-22 2022-09-19
beginner issue (ϵ not defined) 2 2022-01-23 2022-07-25
Table foots are missing when using \pagebreak in some cases 0 2021-12-18 2022-09-26
Update portainer icon 1 2022-03-13 2022-08-17
Opening share content inline instead of new window/tab? 3 2022-01-12 2022-09-28
[BUG] missing opening '(' then stop with windows-latest 3 2022-08-04 2022-09-17
feature request: search by / only search in top level entries. 2 2022-05-10 2022-09-17
Response slowdown due to handler registration order [not an issue, documentation emphasise] 5 2022-02-03 2022-09-28
Mention examples in tutorial docs 3 2022-02-03 2022-09-03
Unable to change language in config.plist 1 2022-03-03 2022-09-28
Add 'graphql' storage type 0 2022-09-09 2022-09-25
Routes with no props handler returning 404 status 1 2021-10-25 2022-09-22
Click <8 pin conflicts with Black 2 2022-03-09 2022-09-18
from PyQt5 import QtWebEngineWidgets ImportError: cannot import name 'QtWebEngineWidgets' from 'PyQt5' 1 2019-08-09 2022-08-06
[Intel]: http://www.foo.be/cours/dess-20042005/report/bigwar.html#sc 0 2022-04-20 2022-08-31
Docs section in storybook broken 2 2022-03-02 2022-09-14
Reporting more number of rows unique in source and not in target. Currently its limited to 10 6 2022-03-16 2022-05-07
CVE-2021-41496 (High) detected in numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl 0 2022-02-25 2022-05-07
Out of bounds access in paint_uniform_rectangle 1 2022-01-20 2022-09-15
Add a RPC to return block with it's transactions and transaction receipts 6 2019-02-28 2022-09-25
Projects with period '.' in name can't be queried in api 1 2022-01-05 2022-09-25
Seeing worse performance with queries with multiple words in complex scenario on 0.23 4 2022-06-07 2022-09-01
`unmanaged-cluster` deployed with multiple nodes and port mappings fails 4 2022-03-01 2022-07-13
Analyzers do not set ResultType 1 2022-05-28 2022-08-22
How do I configure two messageBrokers 1 2020-12-08 2022-09-10
java.io.IOException: null 1 2020-08-26 2022-07-19
Branded types not working with `declaration` option of TS in some cases 0 2021-03-22 2022-09-24
[@types/react-native] ranges in Animated.InterpolationConfigType should be readonly 0 2022-02-17 2022-09-25
Search: v11 audit 0 2022-01-13 2022-01-09
when create external table for anther starrocks, missing a comma after "storage_format"="DEFAULT" 1 2022-08-17 2022-08-21
[Enhancement] Execute UnionAll serially for some etl task? 0 2022-08-12 2022-08-21
"The Kiwi IRC Server had an error :(" 13 2021-12-17 2022-09-01
Add missing Knowledge Checks, Learning Outcomes, and Knowledge Check links to Advanced HTML and CSS/Natural Responsiveness 1 2022-02-14 2022-07-23
Get wrong info on windows 11 2 2021-08-01 2022-09-23
Improper Certificate Validation SNYK-JAVA-IONETTY-1042268 0 2021-12-27 2021-12-30
Support for TLS/plaintext Port Unification 9 2022-03-21 2022-08-15
unable download Devtools for the current target 1 2021-12-12 2022-09-13
Blazor 6.0 Radzen DataGrid not able to access System.Collections 1 2022-06-13 2022-09-13
[Bug] Timezone issue when using DB=postgres 8 2022-05-27 2022-09-21
Non-null assertion after long inline comment breaks parsing 0 2022-03-24 2022-09-11
Hey, I also have this problem. After filling up my disk, Uptime-Kuma then crashes... Here are my logs: 0 2022-05-22 2022-05-25
dubbo3.0.7指定failfast失效 1 2022-06-13 2022-09-22
build(deps): update jupyter-client requirement from <7,>=5.2.3 to >=5.2.3,<8 1 2021-08-26 2022-08-22
Documentation does not seem to survive through backpack 0 2020-05-25 2022-08-29
Prow integration test without bazel 4 2021-09-29 2022-09-22
[Cloud Run] Can not reach gRPC server on Cloud Run 1 2022-03-06 2022-08-16
r_debug_execute incorrectly advances seek 0 2022-04-17 2022-09-24
The error arrived at error-handler middleware has different proto then it used to have 6 2022-07-14 2022-09-21
ChromeProxyService: Failed to evaluate expression 'FirebaseRemoteConfig': InternalError: No frame with index 30. 3 2022-05-12 2022-09-21
Enable a maximum eta for the queue 4 2022-07-22 2022-09-07