-
Notifications
You must be signed in to change notification settings - Fork 44
fix(stream_io): Finalize temporary files correctly on __exit__
#72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The temporary file finalizer's check that the file is not already closed was causing the upload to be skipped if the stream was used as a context manager without an explicit call to .close(), because CPython's implementation of .__exit__() for NamedTemporaryFile objects closes the underlying file before calling the wrapper's own .close() method. After changing to `weakref.finalize` in commit 3fd7f1e, uploading temporary files became inherently idempotent, so the check that the file is not already closed is no longer necessary anyway. This change deletes that check. Furthermore, changes to the NamedTemporaryFile implementation with the addition of its delete_on_close parameter in Python 3.12 (python/cpython#97015) make its .__exit__() method no longer call its .close() method at all, so this change also embeds the finalizer as a hook in a wrapper around both .close() and .__exit__() separately.
__exit__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just so I'm clear, was it actually skipping _temp_file_closer
entirely? Also not sure why it's a good idea for Python 3.12+ to not be calling NamedTemporaryFile.close()
during .__exit__()
. Not sure what prompted them to give it a parameter to do this instead.
Glad weakref.finalize
is a handy workaround for this :)
Is close()
still called twice if open_stream
is used as a context manager? I get that that's fine now that we're using an idempotent closing function, but why is calling close()
twice necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was still calling
So the reason it doesn't call
Yep,
The stream is closed twice when you do this: with open_stream(...) as file:
... # Write some stuff
file.close() # First close
# file.__exit__() is triggered upon leaving the block, second close It's normally not necessary to write this—though because of idempotence guarantees, it is perfectly safe to do so—and people should be doing either: # No context manager
file = open_stream(...)
# Write some stuff
file.close() Or: # No explicit .close()
with open_stream(...) as file:
... # Write some stuff But the latter option was broken until now. However, the double close case can happen indirectly with tensorizer if you write: with open_stream(...) as file:
serializer = TensorSerializer(file)
serializer.write_module(...)
serializer.close() # Internally calls `file.close()` in addition to some other finalizations Where the serializer calls serializer = TensorSerializer(
open_stream(
path_uri=...,
mode="wb",
s3_access_key_id=...,
s3_secret_access_key=...,
s3_endpoint=...,
)
) Since those cannot otherwise be closed by the calling code. |
Standalone Object Storage Uploads &
with
The following code using
stream_io.open_stream()
as a context manager had been broken, and didn't upload anything to object storage:Whereas with an explicit call to
.close()
, in a context manager or not, it worked correctly:The reason for this is that the finalizer for a temporary file backing a stream upload first checked that the file was not already closed before uploading, to avoid uploading twice if
.close()
was called twice (e.g. in the second snippet:.close()
is called explicitly, and then implicitly at the end of the context). However, CPython's implementation ofNamedTemporaryFile.__exit__()
prior to Python 3.12 actually first closes the underlying file, and then calls the wrapper'sNamedTemporaryFile.close()
method, which caused the upload to be skipped entirely if the file closing was triggered only by a context ending without an explicit call to.close()
.This bug generally did not affect
TensorSerializer
use cases, sinceTensorSerializer
objects explicitly.close()
their output files either immediately after serialization when callingTensorSerializer.close()
, or when garbage collected, withinTensorSerializer.__del__()
.After the change to using
weakref.finalize()
to trigger uploads in commit 3fd7f1e, the upload callback became inherently idempotent, so the check that the file is not already closed is no longer necessary anyway. Thus, this change removes the check that the file is not already closed.Additionally, changes to the
NamedTemporaryFile
implementation in CPython with the addition of thedelete_on_close
parameter in Python 3.12 (python/cpython#97015) had the side effect that aNamedTemporaryFile
's.__exit__()
method no longer calls its.close()
method at all. Previously, the upload trigger was attached only to.close()
, so this change also switches the finalizer to now be embedded as a hook in wrappers around both.close()
and.__exit__()
separately.With both of these changes, the first example code snippet now works correctly in Python versions 3.8 through 3.12.