Skip to content

fix(stream_io): Finalize temporary files correctly on __exit__ #72

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 30, 2024

Conversation

Eta0
Copy link
Contributor

@Eta0 Eta0 commented Jan 30, 2024

Standalone Object Storage Uploads & with

The following code using stream_io.open_stream() as a context manager had been broken, and didn't upload anything to object storage:

from tensorizer.stream_io import open_stream

with open_stream("s3://my-bucket/file.txt", "wb") as file:
    file.write(b"Content")
    # Close implicitly

Whereas with an explicit call to .close(), in a context manager or not, it worked correctly:

with open_stream("s3://my-bucket/file.txt", "wb") as file:
    file.write(b"Content")
    file.close()

The reason for this is that the finalizer for a temporary file backing a stream upload first checked that the file was not already closed before uploading, to avoid uploading twice if .close() was called twice (e.g. in the second snippet: .close() is called explicitly, and then implicitly at the end of the context). However, CPython's implementation of NamedTemporaryFile.__exit__() prior to Python 3.12 actually first closes the underlying file, and then calls the wrapper's NamedTemporaryFile.close() method, which caused the upload to be skipped entirely if the file closing was triggered only by a context ending without an explicit call to .close().

This bug generally did not affect TensorSerializer use cases, since TensorSerializer objects explicitly .close() their output files either immediately after serialization when calling TensorSerializer.close(), or when garbage collected, within TensorSerializer.__del__().

After the change to using weakref.finalize() to trigger uploads in commit 3fd7f1e, the upload callback became inherently idempotent, so the check that the file is not already closed is no longer necessary anyway. Thus, this change removes the check that the file is not already closed.

Additionally, changes to the NamedTemporaryFile implementation in CPython with the addition of the delete_on_close parameter in Python 3.12 (python/cpython#97015) had the side effect that a NamedTemporaryFile's .__exit__() method no longer calls its .close() method at all. Previously, the upload trigger was attached only to .close(), so this change also switches the finalizer to now be embedded as a hook in wrappers around both .close() and .__exit__() separately.

With both of these changes, the first example code snippet now works correctly in Python versions 3.8 through 3.12.

Eta0 added 2 commits January 29, 2024 19:51
The temporary file finalizer's check that the file is not already closed
was causing the upload to be skipped if the stream was used as a context
manager without an explicit call to .close(), because CPython's
implementation of .__exit__() for NamedTemporaryFile objects closes
the underlying file before calling the wrapper's own .close() method.
After changing to `weakref.finalize` in commit 3fd7f1e, uploading
temporary files became inherently idempotent, so the check that the file
is not already closed is no longer necessary anyway.
This change deletes that check.

Furthermore, changes to the NamedTemporaryFile implementation with the
addition of its delete_on_close parameter in Python 3.12
(python/cpython#97015) make its .__exit__() method no longer call
its .close() method at all, so this change also embeds the finalizer
as a hook in a wrapper around both .close() and .__exit__() separately.
@Eta0 Eta0 added the bug Something isn't working label Jan 30, 2024
@Eta0 Eta0 requested review from wbrown and sangstar January 30, 2024 02:28
@Eta0 Eta0 self-assigned this Jan 30, 2024
@Eta0 Eta0 changed the title Eta/stream upload context fix(stream_io): Finalize temporary files correctly on __exit__ Jan 30, 2024
Copy link
Contributor

@sangstar sangstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just so I'm clear, was it actually skipping _temp_file_closer entirely? Also not sure why it's a good idea for Python 3.12+ to not be calling NamedTemporaryFile.close() during .__exit__(). Not sure what prompted them to give it a parameter to do this instead.

Glad weakref.finalize is a handy workaround for this :)

Is close() still called twice if open_stream is used as a context manager? I get that that's fine now that we're using an idempotent closing function, but why is calling close() twice necessary?

Copy link
Collaborator

@wbrown wbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch here.

@sangstar @Eta0 We should not be developing or testing on Python 3.12 as a general practice. 3.8/3.10 is what we should be targeting, as that's what Ubuntu 22.04 LTS has.

Running Python later than 3.10 will cause problems with some ML stuff.

@Eta0
Copy link
Contributor Author

Eta0 commented Jan 30, 2024

@sangstar

LGTM. Just so I'm clear, was it actually skipping _temp_file_closer entirely?

It was still calling _temp_file_closer(), but short-circuiting due to some logic in it and thus skipping the part of the function that would actually do anything.

Also not sure why it's a good idea for Python 3.12+ to not be calling NamedTemporaryFile.close() during .__exit__(). Not sure what prompted them to give it a parameter to do this instead.

NamedTemporaryFile in Python is a wrapper around a regular file object that holds an extra _closer functor-thing that closes and deletes the file (optionally, based on the delete parameter). In Python 3.12, they separated the delete parameter's functionality into delete: bool, delete_on_close: bool. The latter controls if .close() triggers a deletion, or if only .__exit__() will trigger a deletion. Allowing the file to close safely before deletion allows for more stuff like our use case, where we close the temporary file after writing, perform an action on it, and then delete it as a separate step, but it requires the file to be used as a context manager to clean itself up properly.

So the reason it doesn't call close() anymore is because the logic was shifted from .__exit__() calling .close() calling ._closer.close() into .__exit__() calling ._closer.cleanup() (new) and .close() calling ._closer.close() which optionally calls ._closer.cleanup(), depending on the value of delete_on_close:

              Before                             After
┌──────────────────────────────┐   ┌──────────────────────────────┐
│      _TemporaryFileWrapper   │   │      _TemporaryFileWrapper   │
│                              │   │                              │
│ .__exit__()────────►.close() │   │ .__exit__()         .close() │
│                            │ │   │ │                          │ │
└────────────────────────────┼─┘   └─┼──────────────────────────┼─┘
                             │       │                          │
┌────────────────────────────┼─┐   ┌─┼──────────────────────────┼─┐
│      _TemporaryFileCloser  │ │   │ │    _TemporaryFileCloser  │ │
│                            │ │   │ │                          │ │
│                            │ │   │ ▼         (optional)       ▼ │
│            .close()◄───────┘ │   │ .cleanup()◄── ─── ──.close() │
│        (close & delete)      │   │(close & delete)     (close)  │
└──────────────────────────────┘   └──────────────────────────────┘

Glad weakref.finalize is a handy workaround for this :)

Yep, weakref.finalize() makes its callback idempotent in addition to almost-guaranteeing (outside of cases like SIGKILL) that it runs before the interpreter shuts down. Perfect for our use case.

Is close() still called twice if open_stream is used as a context manager? I get that that's fine now that we're using an idempotent closing function, but why is calling close() twice necessary?

The stream is closed twice when you do this:

with open_stream(...) as file:
    ...  # Write some stuff
    file.close()  # First close
# file.__exit__() is triggered upon leaving the block, second close

It's normally not necessary to write this—though because of idempotence guarantees, it is perfectly safe to do so—and people should be doing either:

# No context manager
file = open_stream(...)
# Write some stuff
file.close()

Or:

# No explicit .close()
with open_stream(...) as file:
    ...  # Write some stuff

But the latter option was broken until now. However, the double close case can happen indirectly with tensorizer if you write:

with open_stream(...) as file:
    serializer = TensorSerializer(file)
    serializer.write_module(...)
    serializer.close()  # Internally calls `file.close()` in addition to some other finalizations

Where the serializer calls file.close() to support streams with no external reference, like a file opened automatically by passing a path to TensorSerializer, or this pattern:

serializer = TensorSerializer(
    open_stream(
        path_uri=...,
        mode="wb",
        s3_access_key_id=...,
        s3_secret_access_key=...,
        s3_endpoint=...,
    )
)

Since those cannot otherwise be closed by the calling code.

@wbrown wbrown merged commit 1f1d0f6 into main Jan 30, 2024
@Eta0 Eta0 deleted the eta/stream-upload-context branch January 30, 2024 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants