StataReader processes whole file before reading in chunks

This issue has been created since 2022-09-21.

I've noticed that when reading large Stata files using the chunksize parameter the time it takes to create the StataReader object is affected by the size of the file. This is a bit surprising since all of the metadata it needs is contained in the file header so it seems like it should take the same time regardless of the total file size.

I took a look at the code and it seems like the culprit is this line that reads the entire file into a BytesIO object before parsing the header. I'm not entirely sure what this accomplishes. Ideally it would be nice to be able to create the StataReader object after processing just the header portion of the file.

pandas/pandas/io/stata.py

Lines 1167 to 1175 in 71fc89c

with get_handle(
path_or_buf,
"rb",
storage_options=storage_options,
is_text=False,
compression=compression,
) as handles:
# Copy to BytesIO, and ensure no encoding
self.path_or_buf = BytesIO(handles.handle.read())

twoertwein wrote this answer on 2022-09-22

Is that a behavior change you have noticed since 1.5 or did it also exist in previous versions? I think these particular lines of code are around since 1.3 but even before it (I think) it had a similar logic.

I think the issue is that some IO-like objects are not seekable but read_stata does internally a lot of seeking (some of the compressions IO doesn't support seeking). It might be the case that we can change the above line to only completely read the file if it isn't seekable.

sterlinm wrote this answer on 2022-09-24

I don't think it changed in 1.5, I had noticed it with 1.4. I didn't look back to see when it was introduced or if it has always been there.

I see the uses of seek in parsing the header but it seems like it should be possible to avoid that.

EDIT: Commented to soon, I think the suggestion to skip that when the file is seekable is simpler.

twoertwein wrote this answer on 2022-09-24

Feel free to open a PR!

I think the main change is

self.handles = get_handle(...)
if hasattr(self.handles.handle, "seekable") and self.handles.handle.seekable:
    self.path_or_buf = self.handles.handle
else:
    with self.handles:
        self.path_or_buf = BytesIO(handles.handle.read()) 

# and then appropriate code to close self.handles (and self.path_or_buf in case of BytesIO)
sterlinm wrote this answer on 2022-09-24

Feel free to open a PR!

I'll give it a shot over the weekend. Thanks!

More Details About Repo
Owner Name pandas-dev
Repo Name pandas
Full Name pandas-dev/pandas
Language Python
Created Date 2010-08-24
Updated Date 2022-09-29
Star Count 35374
Watcher Count 1122
Fork Count 15034
Issue Count 3579

YOU MAY BE INTERESTED

Issue Title Created Date Comment Count Updated Date
Create a table-based generic device handler for telegrams 3 2020-11-17 2022-08-09
GloBI open access policy, governance 1 2022-04-28 2022-09-18
index plazi treatment: define the use case 1 2022-04-28 2022-08-22
GUID/ link to indexed treatments from Plazi in GloBI 1 2022-04-28 2022-09-05
Problems integrating Helpdesk html with project 2 2021-04-28 2022-09-25
`cog push` fails without sudo if user is not in `docker` group, and fails with sudo if `cog login` is done without sudo 0 2022-04-21 2022-08-27
replicate link error if push with `:version-1` 0 2022-04-13 2022-08-09
Buttons on Grouping rows (folders, dates) should apply to all images that would fall in that grouping 0 2021-10-01 2022-09-17
Add support install KubeSphere on containerd 1 2021-08-19 2022-09-23
Karpenter does not scale up nodes 10 2022-06-15 2022-09-26
Non-string type support for AWS Tag values in the Provisoner CRD spec 4 2022-06-15 2022-09-15
Column data length change is not handled and the table entity version does not change 3 2022-02-03 2022-07-25
VegaLite's data occurs "Javascript Error: requirejs is not defined". 0 2021-03-25 2022-09-23
500 error on getFeatureInfo 2 2021-11-03 2022-08-18
Listview option removes registry key that does not exist 1 2021-11-26 2022-09-23
[Question] Translate a batch of images at the same time 1 2022-02-10 2022-09-08
Canvas image changes on a Chromebook 1 2020-11-13 2022-09-19
[INTEGRATION][Snowflake] SnowflakeOperator failed to produce OUTPUT dataset 1 2022-06-09 2022-09-27
Suggestion: allow config to require only non-default values 1 2021-03-14 2022-07-19
[Has]SystemPadding works incorrect on older iOS releases (iOS 11.4) 9 2020-12-16 2022-09-23
[rootcling] `genreflex --cxxmodule` generates a truncated .pcm file 3 2021-12-16 2022-01-16
Uploading files doesn't include album art in some edge cases. 2 2022-06-04 2022-09-21
Android-Kotlin : 19.04 Overview of MVC architecture in Android. 7 2021-11-06 2022-07-11
[FRONTEND WITH FRAMEWORK]: Angular(7.2) -> Setup for server communication 1 2021-11-08 2022-08-11
How to make nvim function signature work? 1 2021-10-14 2022-09-09
[Enhancement] Set the position for the indicator of Checkbox element 2 2022-09-26 2022-09-22
Image of URL should be the same, when adding an image or a URL that reference an image 0 2022-05-26 2022-09-05
[JS]: Missing documentation on running the firstscript.spec.js 4 2022-09-08 2022-09-17
cctest test-js-to-wasm/TestFastJSWasmCall_MultipleArgs failed 2 2021-02-07 2022-09-22
Grid - rows > 1 + fill: 'row' + slidesPerView: 'auto' causes the swiper to go crazy 0 2022-06-08 2022-09-18
Fortran Language Server outdated 2 2022-05-13 2022-09-12
Backend API does not verify JWT tokens 1 2022-05-20 2022-09-21
RFE: Separate functional code from interface 0 2022-01-20 2022-03-18
Auto analyze job failed on one table, all the following table are skipped 0 2021-10-26 2022-09-17
prepared-plan-cache support invalidating cached plan when the table statistics is re-newed 0 2021-10-26 2022-09-17
创建时,如果包含设置了 Serializer 并且为空的值,会报错 1 2022-07-19 2022-09-28
Search Across All Credentials When Modifying a Template 0 2021-09-08 2022-07-21
Cannot find awxkit folder after installation 1 2021-09-08 2022-09-29
Write-Progress progress bar only accurate up to 12% complete 8 2022-08-01 2022-08-16
Differing file version and product version returned 8 2022-08-01 2022-09-16
ForEach-Object -Parallel | Execution hangs when method in class calls other method in class 3 2022-07-31 2022-08-16
Download of Polls not possible 5 2022-02-21 2022-09-02
Add TriPinHexAssemblyGenerator to Reactor Module 1 2022-05-04 2022-08-28
disable domain name limitation 4 2022-01-25 2022-08-20
Custom rpc method is undefined 3 2021-12-27 2022-07-26
Remove camunda:initiator when a Start Event is moved to a SubProcess 0 2021-09-02 2022-09-26
Convert to libdnf plugin 0 2021-01-19 2022-08-19
Amazon button not showing due to performance profiling? 9 2020-11-18 2022-08-13
Fixed read_pin bug in sx1509.lua 0 2017-03-22 2022-05-29
Volume slider moves horizontally when the metering bar orientation is set to 'Vertical' 1 2022-07-15 2022-08-08