Framework: TensorFlow 2 (Keras) Version: 2.6 OS: Kaggle
examplesfolder (such as GLUE/SQuAD, ...)
A recent kaggle competition (hosted by Google), I tried to use pretrained
tf swin transformer model from hugging face but even with the base model, I consistently received out of memory error. Below is the submission status with a
tf_convnext_xlargeis able to run without OOM.
So, I'm assuming there might be some potential memory leakage in
tf_swin implementation. Below is the code I use to build the complete model.
id = "microsoft/swin-base-patch4-window7-224-in22k" from transformers import AutoFeatureExtractor, TFSwinModel feature_extractor = AutoFeatureExtractor.from_pretrained(id)
inputs = keras.Input(shape=(None, None, 3), dtype='uint8') mode_inputs = tf.cast(inputs, tf.float32) mode_inputs = keras.layers.Resizing(*INPUT_SHAPE)(mode_inputs) mode_inputs = keras.layers.Rescaling(scale=1.0 / 255)(mode_inputs) mode_inputs = keras.layers.Normalization( mean=feature_extractor.image_mean, variance=[x ** 2 for x in feature_extractor.image_std ], axis=3 )(mode_inputs) mode_inputs = keras.layers.Permute(dims=(3, 1, 2))(mode_inputs) tf_huggingface_module = TFSwinModel.from_pretrained(id) tf_huggingface_module.trainable = False
logits = tf_huggingface_module(mode_inputs) adv_logits = keras.Dense(64)(logits.pooler_output) outputs = keras.layers.Lambda( lambda x: tf.math.l2_normalize(x, axis=-1), name='embedding_norm' )(adv_logits) tf_huggingface_classifier = keras.Model(inputs, outputs)
It should work like other model. To reproduce the issue exactly, (in the worst case), you may need to run it on kaggle platform. Kaggle submission status (as shown in the above diagram) is not very descriptive other than just showing submission status :(. Mainly, I like to know what could be the cause of it and any possible solution.
Hi @innat, thanks for flagging this!
In order to help figure out what's causing the problem and possible solutions, could you please answer the following questions:
What would help most and answer all of these would be a saved kaggle notebook that you could share.
Hello @amyeroberts; thanks for checking. To answer all of your query,
Could you give the version of transformers you're using and any other relevant packages?
tf.__version__, tfa.__version__, transformers.__version__ ('2.6.4', '0.14.0', '4.22.0.dev0')
Does the notebook run successfully before entering it as a submission? If not, what line of code causes the failure?
Could you give details on the checkpoint used for convnext? Can you confirm the convnext model works with the exact same pipeline?
When you said other frameworks work fine - can you confirm that you were able to use the equivalent Swin PyTorch model on the same swin checkpoint?
What would help most and answer all of these would be a saved kaggle notebook that you could share.
It contains TensorFlow ConvNeXt and Swin Model pipelines and relevant package's version. The modeling strategy, saving, and submission process is followed according to the rules. The evaluation page also describes how they evaluate both framework and expected modeling approach. Hope it helps.
Hi @innat, thank you for all of your detailed responses and for sharing the notebook.
I ran the notebook in kaggle and was able to save out the model with the checkpoint you used in your first example:
The notebook is here: https://www.kaggle.com/code/aeroberts4444/test-swin-saving/notebook
Are you able to run the notebook you shared on kaggle? Or do you still hit the OOM?
@amyeroberts Thanks for running the code.
Yes, if you run the code that I shared, you won't see any OOM effect instant. As I said, I tried to submit two model from hugging-face (
"facebook/convnext-large-224-22k-1k") to this competition.
The convnext is comparatively much larger than tiny swin, but in the inference time, the submission status always exceed the allowed compute resource for tiny swin but works fine for large convnext model. That's why I kind of have weak assumption that, there may be some issue with swin implementation. Also, later I realized that pytorch practitioners use
timm version of swin model, and not from
huggingface and no issue found about OOM with that.
This competition is unique (no training or test data is provided), so it might be hard to debug the root cause. Please let me know if its out of scope to address such issue.
Hi @innat, thanks for clarifying. It's certainly a problem if there's a memory leak and one we'd want to address. I'm going to continue to look into this. As you said, because of the nature of kaggle and the competition it can be hard to debug. As such, it might take some time before I manage to figure out if there's a problem, what it is and how to solve.
Hi @innat. As mentioned above it's quite hard to debug without know what's happening during submission and logs from the kaggle notebook. My current best guess is it's due to the size of the saved Swin model.
Using your script to create and save out a model, I looked at the sizes across different checkpoints:
"microsoft/resnet-50" # 23,561,152 params "google/vit-base-patch16-224-in21k" # 86,389,248 params "microsoft/swin-base-patch4-window7-224-in22k" # 86,743,224 params "microsoft/swin-tiny-patch4-window7-224" # 27,519,354 params "facebook/convnext-large-224-22k-1k" # 196,230,336 params
tf_hf_classifier_convnext_large_224_22k_1k: total 25712 drwxr-xr-x 6 amyroberts staff 192B 10 Aug 13:13 . drwxr-xr-x 24 amyroberts staff 768B 10 Aug 13:13 .. drwxr-xr-x 2 amyroberts staff 64B 10 Aug 13:13 assets -rw-r--r-- 1 amyroberts staff 510K 10 Aug 13:13 keras_metadata.pb -rw-r--r-- 1 amyroberts staff 12M 10 Aug 13:13 saved_model.pb drwxr-xr-x 4 amyroberts staff 128B 10 Aug 13:13 variables tf_hf_classifier_resnet_50: total 12048 drwxr-xr-x 6 amyroberts staff 192B 10 Aug 12:51 . drwxr-xr-x 24 amyroberts staff 768B 10 Aug 13:13 .. drwxr-xr-x 2 amyroberts staff 64B 10 Aug 12:51 assets -rw-r--r-- 1 amyroberts staff 488K 10 Aug 12:51 keras_metadata.pb -rw-r--r-- 1 amyroberts staff 5.4M 10 Aug 12:51 saved_model.pb drwxr-xr-x 4 amyroberts staff 128B 10 Aug 12:51 variables tf_hf_classifier_swin_base_patch4_window7_224_in22k: total 179216 drwxr-xr-x 6 amyroberts staff 192B 10 Aug 13:00 . drwxr-xr-x 24 amyroberts staff 768B 10 Aug 13:13 .. drwxr-xr-x 2 amyroberts staff 64B 10 Aug 12:59 assets -rw-r--r-- 1 amyroberts staff 7.4M 10 Aug 13:00 keras_metadata.pb -rw-r--r-- 1 amyroberts staff 80M 10 Aug 13:00 saved_model.pb drwxr-xr-x 4 amyroberts staff 128B 10 Aug 12:59 variables tf_hf_classifier_swin_tiny_patch4_window7_224: total 83944 drwxr-xr-x 6 amyroberts staff 192B 10 Aug 13:09 . drwxr-xr-x 24 amyroberts staff 768B 10 Aug 13:13 .. drwxr-xr-x 2 amyroberts staff 64B 10 Aug 13:09 assets -rw-r--r-- 1 amyroberts staff 474K 10 Aug 13:09 keras_metadata.pb -rw-r--r-- 1 amyroberts staff 41M 10 Aug 13:09 saved_model.pb drwxr-xr-x 4 amyroberts staff 128B 10 Aug 13:09 variables tf_hf_classifier_vit_base_patch16_224_in21k: total 21328 drwxr-xr-x 6 amyroberts staff 192B 10 Aug 12:53 . drwxr-xr-x 24 amyroberts staff 768B 10 Aug 13:13 .. drwxr-xr-x 2 amyroberts staff 64B 10 Aug 12:53 assets -rw-r--r-- 1 amyroberts staff 162K 10 Aug 12:53 keras_metadata.pb -rw-r--r-- 1 amyroberts staff 10M 10 Aug 12:53 saved_model.pb drwxr-xr-x 4 amyroberts staff 128B 10 Aug 12:53 variables
I haven't dug much into why the model is so much larger. A cursory glance at the model graphs didn't reveal anything particularly surprising.
Randomly jumping in this thread :-)
|Issue Title||Created Date||Comment Count||Updated Date|
|Get ORM working / compiling||1||2022-06-02||2022-08-29|
|[Bug] Stocks/dps/prom - Error: darkpool_otc() got an unexpected keyword argument 'promising'||1||2022-09-23||2022-09-20|
|Current state of computables||7||2021-01-22||2022-08-29|
|“Set user language based on your location”缺少默认值||1||2022-08-02||2022-09-21|
|configure_file does not rerun if output deleted||8||2022-08-21||2022-09-14|
|Fail to downlaod a video from a bilibili URL||0||2021-12-20||2022-09-26|
|month must be in 1..12||6||2022-01-04||2022-09-19|
|Shynet on same server with Nginx and SSL||2||2022-01-29||2022-09-19|
|Existing Tests/Factories are broken||0||2022-09-24||2022-09-19|
|[bug]: 'svn up *' is incorrectly highlighted||0||2022-08-19||2022-09-19|
|Changing country no longer refocuses the text input||0||2022-03-01||2022-09-19|
|Autodetect and remove verbal filler sounds?||3||2019-08-15||2022-09-19|
|[Bug] TEMP/input.mp4: No such file or directory||0||2019-04-11||2022-09-19|
|Idea for notifications: Home Feed card?||1||2020-12-08||2022-09-26|
|prefetch v3.0.0 : timeout exhausted while creating file within network system module||18||2022-03-21||2022-08-28|
|Unable to buy API3, Empty error||3||2022-01-21||2022-09-16|
|Video play stops to the zero after saving current frame as image||1||2021-10-13||2022-09-28|
|Webserver gets stuck on clock jump||4||2022-09-11||2022-09-27|
|[1.16.5] Monsters and players spawn 1 block in the ground||1||2022-09-23||2022-09-19|
|Add package dependencies to the repo||4||2020-10-21||2022-09-21|
|Containers: push/pop methods are inconsistent||1||2022-02-11||2022-09-05|
|usePrevious demo not working in website iframe||5||2021-07-31||2022-09-17|
|can we document the behaviour of having multiple SNAT IPs and when 64k ports exhaust a SNAT IP||11||2021-10-04||2022-08-29|
|Show the help menu if an inexistent flag is provided to interactsh-server||2||2022-08-18||2022-09-17|
|Cloud build with source repositories - mirror github repo containing git lfs files||0||2022-05-15||2022-09-15|
|ros humble colcon build --symlink-install failure||2||2022-06-14||2022-09-16|
|[@angular-eslint/template/i18n] boundTextAllowedPattern does not work for text content / inner html||0||2022-05-18||2022-09-25|
|Words without an accent are no longer warned||2||2021-10-07||2022-09-18|
|Running Consumers as Workers||4||2020-07-30||2022-09-16|
|reconstruction from 2 images failed to converge||1||2021-11-22||2022-09-25|
|Enable sampling tail of fuzzer output||1||2022-01-25||2022-09-26|
|Mac address conflicts for pure ipv6 env||1||2021-11-15||2022-09-23|
|Could not load main pom file content, see example||20||2022-02-19||2022-09-26|
|[BUG] (failed Marlin on Mac OS)||4||2022-05-13||2022-08-22|
|Getting an error for while dragging and dropping a Chart regarding the height being NaN and whole page crashes||1||2022-09-19||2022-09-19|
|Distorted area chart||0||2022-09-20||2022-09-19|
|Error: '---' is not in list when trying to provide an auth token||3||2022-05-31||2022-09-17|
|新版2.2.2卡,而且console.table is not a function bug||1||2022-07-02||2022-09-19|
|No Restart or Shutdown Server Commands||5||2022-05-24||2022-09-28|
|magit-commit with automatic staging shows wrong diff||1||2022-07-11||2022-09-13|
|History-graph wont show graph (but does show the popup per entity)||5||2022-08-06||2022-09-11|
|Mac install error, OS version 10.12.6, go version go1.8.3||36||2017-08-03||2022-09-18|
|Missing read the port number from ssh url||2||2021-11-11||2022-08-22|
|Frame Navigation to an non xaml UI Page end up in "access violation reading at position"||6||2022-01-14||2022-09-21|
|When I run micro_ros_agent udp4 --port 8888, it didn't work...||2||2022-05-31||2022-09-18|
|Can't use same vault backend for static credentials if username_template is provided.||0||2021-12-09||2022-09-28|
|Shared multiplatform code isn't highlighted in compose-jb app||4||2021-08-11||2022-09-18|
|Bulk operations fail when `detailed-results` is `true`||0||2021-12-03||2022-08-22|
|Revamp actionpack-xml_parser gem||3||2016-07-20||2022-09-20|