Potential memory leakage of TensorFlow Swin model on kaggle!

This issue has been created since 2022-07-30.

System Info

Info:

Framework: TensorFlow 2 (Keras)
Version: 2.6
OS: Kaggle

Who can help?

Swin Model Card @amyeroberts
TensorFlow: @Rocketknight1
Vision: @NielsRogge, @sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

A recent kaggle competition (hosted by Google), I tried to use pretrained tf swin transformer model from hugging face but even with the base model, I consistently received out of memory error. Below is the submission status with a base_tf_swin model.

image

Some note:

  • Other framework like pytorch works fine here.
  • Other than this model, much larger model like tf_convnext_xlarge is able to run without OOM.

So, I'm assuming there might be some potential memory leakage in tf_swin implementation. Below is the code I use to build the complete model.

id = "microsoft/swin-base-patch4-window7-224-in22k"

from transformers import AutoFeatureExtractor, TFSwinModel
feature_extractor = AutoFeatureExtractor.from_pretrained(id)
inputs = keras.Input(shape=(None, None, 3), dtype='uint8')
mode_inputs = tf.cast(inputs, tf.float32)

mode_inputs = keras.layers.Resizing(*INPUT_SHAPE)(mode_inputs)
mode_inputs = keras.layers.Rescaling(scale=1.0 / 255)(mode_inputs)
mode_inputs = keras.layers.Normalization(
    mean=feature_extractor.image_mean,
    variance=[x ** 2 for x in feature_extractor.image_std ],
    axis=3
)(mode_inputs)
mode_inputs = keras.layers.Permute(dims=(3, 1, 2))(mode_inputs)

tf_huggingface_module = TFSwinModel.from_pretrained(id)
tf_huggingface_module.trainable = False
logits = tf_huggingface_module(mode_inputs)
adv_logits = keras.Dense(64)(logits.pooler_output)

outputs = keras.layers.Lambda(
    lambda x: tf.math.l2_normalize(x, axis=-1), name='embedding_norm'
)(adv_logits)

tf_huggingface_classifier = keras.Model(inputs, outputs)

Expected behavior

It should work like other model. To reproduce the issue exactly, (in the worst case), you may need to run it on kaggle platform. Kaggle submission status (as shown in the above diagram) is not very descriptive other than just showing submission status :(. Mainly, I like to know what could be the cause of it and any possible solution.

Carrussalgreencompany wrote this answer on 2022-07-30

Upwithal system build foundry - broken links and system flow response view' run node slow memory additional linkages attached with issues resolved.
https://form.jotform.com/220476043605046
Survey response link for confirmation.
@free2ride19

amyeroberts wrote this answer on 2022-08-04

Hi @innat, thanks for flagging this!

In order to help figure out what's causing the problem and possible solutions, could you please answer the following questions:

  • Could you give the version of transformers you're using and any other relevant packages?
  • Does the notebook run successfully before entering it as a submission? If not, what line of code causes the failure?
  • Could you give details on the checkpoint used for convnext? Can you confirm the convnext model works with the exact same pipeline?
  • When you said other frameworks work fine - can you confirm that you were able to use the equivalent Swin PyTorch model on the same swin checkpoint?

What would help most and answer all of these would be a saved kaggle notebook that you could share.

free2ride19 wrote this answer on 2022-08-04
innat wrote this answer on 2022-08-04

Hello @amyeroberts; thanks for checking. To answer all of your query,

Could you give the version of transformers you're using and any other relevant packages?

  1. It can be done, we will share a notebook file. Shortly,
tf.__version__, tfa.__version__, transformers.__version__
('2.6.4', '0.14.0', '4.22.0.dev0')

Does the notebook run successfully before entering it as a submission? If not, what line of code causes the failure?

  1. I hardly used hugging face vision model. It's kind of my first look of these vision models for current on-going kaggle competition.

Could you give details on the checkpoint used for convnext? Can you confirm the convnext model works with the exact same pipeline?

  1. Regarding the convnext checkpoint, yes, I can give you the exact file and reproducible code. And I confirm that hugging face convnext (larger one) runs fine whereas tiny swin gives OOM.

When you said other frameworks work fine - can you confirm that you were able to use the equivalent Swin PyTorch model on the same swin checkpoint?

  1. I should have elaborate more. I'm not PyTorch 1st user. Swin PyTorch model works fine is reported by other practitioners.

What would help most and answer all of these would be a saved kaggle notebook that you could share.

Notebook Files

It contains TensorFlow ConvNeXt and Swin Model pipelines and relevant package's version. The modeling strategy, saving, and submission process is followed according to the rules. The evaluation page also describes how they evaluate both framework and expected modeling approach. Hope it helps.

free2ride19 wrote this answer on 2022-08-04
amyeroberts wrote this answer on 2022-08-04

Hi @innat, thank you for all of your detailed responses and for sharing the notebook.

I ran the notebook in kaggle and was able to save out the model with the checkpoint you used in your first example: "microsoft/swin-base-patch4-window7-224-in22k"

The notebook is here: https://www.kaggle.com/code/aeroberts4444/test-swin-saving/notebook

Are you able to run the notebook you shared on kaggle? Or do you still hit the OOM?

innat wrote this answer on 2022-08-04

@amyeroberts Thanks for running the code.

Yes, if you run the code that I shared, you won't see any OOM effect instant. As I said, I tried to submit two model from hugging-face ("microsoft/swin-tiny-patch4-window7-224" and "facebook/convnext-large-224-22k-1k") to this competition.

The convnext is comparatively much larger than tiny swin, but in the inference time, the submission status always exceed the allowed compute resource for tiny swin but works fine for large convnext model. That's why I kind of have weak assumption that, there may be some issue with swin implementation. Also, later I realized that pytorch practitioners use timm version of swin model, and not from huggingface and no issue found about OOM with that.

This competition is unique (no training or test data is provided), so it might be hard to debug the root cause. Please let me know if its out of scope to address such issue.

amyeroberts wrote this answer on 2022-08-05

Hi @innat, thanks for clarifying. It's certainly a problem if there's a memory leak and one we'd want to address. I'm going to continue to look into this. As you said, because of the nature of kaggle and the competition it can be hard to debug. As such, it might take some time before I manage to figure out if there's a problem, what it is and how to solve.

innat wrote this answer on 2022-08-05

@amyeroberts Thanks for your cordial support. I also informed competition host (googler), HERE, but no response yet.

cc @kfrancischen

amyeroberts wrote this answer on 2022-08-17

Hi @innat. As mentioned above it's quite hard to debug without know what's happening during submission and logs from the kaggle notebook. My current best guess is it's due to the size of the saved Swin model.

Using your script to create and save out a model, I looked at the sizes across different checkpoints:

"microsoft/resnet-50"                              # 23,561,152 params
"google/vit-base-patch16-224-in21k"                # 86,389,248 params
"microsoft/swin-base-patch4-window7-224-in22k"     # 86,743,224 params
"microsoft/swin-tiny-patch4-window7-224"           # 27,519,354 params
"facebook/convnext-large-224-22k-1k"               # 196,230,336 params
tf_hf_classifier_convnext_large_224_22k_1k:
total 25712
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 13:13 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 13:13 assets
-rw-r--r--   1 amyroberts  staff   510K 10 Aug 13:13 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    12M 10 Aug 13:13 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 13:13 variables

tf_hf_classifier_resnet_50:
total 12048
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 12:51 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 12:51 assets
-rw-r--r--   1 amyroberts  staff   488K 10 Aug 12:51 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff   5.4M 10 Aug 12:51 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 12:51 variables

tf_hf_classifier_swin_base_patch4_window7_224_in22k:
total 179216
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 13:00 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 12:59 assets
-rw-r--r--   1 amyroberts  staff   7.4M 10 Aug 13:00 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    80M 10 Aug 13:00 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 12:59 variables

tf_hf_classifier_swin_tiny_patch4_window7_224:
total 83944
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 13:09 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 13:09 assets
-rw-r--r--   1 amyroberts  staff   474K 10 Aug 13:09 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    41M 10 Aug 13:09 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 13:09 variables

tf_hf_classifier_vit_base_patch16_224_in21k:
total 21328
drwxr-xr-x   6 amyroberts  staff   192B 10 Aug 12:53 .
drwxr-xr-x  24 amyroberts  staff   768B 10 Aug 13:13 ..
drwxr-xr-x   2 amyroberts  staff    64B 10 Aug 12:53 assets
-rw-r--r--   1 amyroberts  staff   162K 10 Aug 12:53 keras_metadata.pb
-rw-r--r--   1 amyroberts  staff    10M 10 Aug 12:53 saved_model.pb
drwxr-xr-x   4 amyroberts  staff   128B 10 Aug 12:53 variables

I haven't dug much into why the model is so much larger. A cursory glance at the model graphs didn't reveal anything particularly surprising.

ydshieh wrote this answer on 2022-08-18

Randomly jumping in this thread :-)

  • Are you able to reproduce this issue in a machine with similar spec as Kaggle machines?
  • One way to narrow down to the root cause is to gradually remove some parts of code
  • From the provided notebook, we can't have any conclusion on memory leak. Memory leak refers to the memory usage increase during a repetition of the same call to a particular code block.
  • Suggestion: try to see if this issue occurs during model saving, or the memory usage increases during inference time.
innat wrote this answer on 2022-08-18

@amyeroberts
Thanks for checking. I'll quickly check the size of these models in torch version.

@kfrancischen
Your feedback is really much appreciate here. (more info)

ydshieh wrote this answer on 2022-08-18

I would suggest debug this in a VM outside Kaggle though. I remembered there is limited GPU/TPU hours per week on Kaggle. Don't waste your quota :-)

More Details About Repo
Owner Name huggingface
Repo Name transformers
Full Name huggingface/transformers
Language Python
Created Date 2018-10-29
Updated Date 2022-09-25
Star Count 70865
Watcher Count 858
Fork Count 16234
Issue Count 534

YOU MAY BE INTERESTED

Issue Title Created Date Comment Count Updated Date
Get ORM working / compiling 1 2022-06-02 2022-08-29
[Bug] Stocks/dps/prom - Error: darkpool_otc() got an unexpected keyword argument 'promising' 1 2022-09-23 2022-09-20
Current state of computables 7 2021-01-22 2022-08-29
“Set user language based on your location”缺少默认值 1 2022-08-02 2022-09-21
configure_file does not rerun if output deleted 8 2022-08-21 2022-09-14
Fail to downlaod a video from a bilibili URL 0 2021-12-20 2022-09-26
month must be in 1..12 6 2022-01-04 2022-09-19
Shynet on same server with Nginx and SSL 2 2022-01-29 2022-09-19
Existing Tests/Factories are broken 0 2022-09-24 2022-09-19
[bug]: 'svn up *' is incorrectly highlighted 0 2022-08-19 2022-09-19
Changing country no longer refocuses the text input 0 2022-03-01 2022-09-19
Autodetect and remove verbal filler sounds? 3 2019-08-15 2022-09-19
[Bug] TEMP/input.mp4: No such file or directory 0 2019-04-11 2022-09-19
Idea for notifications: Home Feed card? 1 2020-12-08 2022-09-26
prefetch v3.0.0 : timeout exhausted while creating file within network system module 18 2022-03-21 2022-08-28
Unable to buy API3, Empty error 3 2022-01-21 2022-09-16
Video play stops to the zero after saving current frame as image 1 2021-10-13 2022-09-28
Webserver gets stuck on clock jump 4 2022-09-11 2022-09-27
[1.16.5] Monsters and players spawn 1 block in the ground 1 2022-09-23 2022-09-19
Add package dependencies to the repo 4 2020-10-21 2022-09-21
Containers: push/pop methods are inconsistent 1 2022-02-11 2022-09-05
usePrevious demo not working in website iframe 5 2021-07-31 2022-09-17
can we document the behaviour of having multiple SNAT IPs and when 64k ports exhaust a SNAT IP 11 2021-10-04 2022-08-29
Show the help menu if an inexistent flag is provided to interactsh-server 2 2022-08-18 2022-09-17
Cloud build with source repositories - mirror github repo containing git lfs files 0 2022-05-15 2022-09-15
ros humble colcon build --symlink-install failure 2 2022-06-14 2022-09-16
[@angular-eslint/template/i18n] boundTextAllowedPattern does not work for text content / inner html 0 2022-05-18 2022-09-25
Words without an accent are no longer warned 2 2021-10-07 2022-09-18
Running Consumers as Workers 4 2020-07-30 2022-09-16
reconstruction from 2 images failed to converge 1 2021-11-22 2022-09-25
Enable sampling tail of fuzzer output 1 2022-01-25 2022-09-26
Mac address conflicts for pure ipv6 env 1 2021-11-15 2022-09-23
Could not load main pom file content, see example 20 2022-02-19 2022-09-26
[BUG] (failed Marlin on Mac OS) 4 2022-05-13 2022-08-22
Getting an error for while dragging and dropping a Chart regarding the height being NaN and whole page crashes 1 2022-09-19 2022-09-19
Distorted area chart 0 2022-09-20 2022-09-19
Error: '---' is not in list when trying to provide an auth token 3 2022-05-31 2022-09-17
视频播放完毕后的“相关推荐”无法通过“传统连播模式”插件禁止连续播放。版本已更新为最新仍然无效。 1 2022-07-03 2022-09-19
新版2.2.2卡,而且console.table is not a function bug 1 2022-07-02 2022-09-19
No Restart or Shutdown Server Commands 5 2022-05-24 2022-09-28
magit-commit with automatic staging shows wrong diff 1 2022-07-11 2022-09-13
History-graph wont show graph (but does show the popup per entity) 5 2022-08-06 2022-09-11
Mac install error, OS version 10.12.6, go version go1.8.3 36 2017-08-03 2022-09-18
Missing read the port number from ssh url 2 2021-11-11 2022-08-22
Frame Navigation to an non xaml UI Page end up in "access violation reading at position" 6 2022-01-14 2022-09-21
When I run micro_ros_agent udp4 --port 8888, it didn't work... 2 2022-05-31 2022-09-18
Can't use same vault backend for static credentials if username_template is provided. 0 2021-12-09 2022-09-28
Shared multiplatform code isn't highlighted in compose-jb app 4 2021-08-11 2022-09-18
Bulk operations fail when `detailed-results` is `true` 0 2021-12-03 2022-08-22
Revamp actionpack-xml_parser gem 3 2016-07-20 2022-09-20