Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

RFC: Introducing Transactions extension to Modular Filesystems #245

Merged
merged 7 commits into from
Jul 15, 2020

Conversation

samikama
Copy link
Contributor

@samikama samikama commented May 13, 2020

cc @samikama @mihaimaruseac

Transactional File Systems Support

Status Accepted
RFC # 245
Author(s) Sami Kama ([email protected])
Sponsor Mihai Maruseac ([email protected])
Updated 2020-05-05

Objective

The aim of this RFC to extend filesystem access support to persistent storage that
provide transactional access and eventual consistency.

@vnghia
Copy link

vnghia commented May 14, 2020

Hello, I am currently implementing s3 filesystem plugin and gcs filesystem plugin for Tensorflow. Currently, we use this approach "prefetching the data to local system and then uploading after processing". I have just skimmed through your RFC but I think this may cause an increasing of requests made to the cloud server ? We choose the prefetching to minimize the requests made to the cloud server

@samikama
Copy link
Contributor Author

samikama commented May 14, 2020

Hi,
Thanks for the first comment :). Prefetching approach has several limitations. First of all, it extends the start time of processing because you have to wait for data to arrive to host. Second one is the size limitations, there are domains with 100s of TB data and would be unfeasible to prefetch it to the host if not impossible. Likewise, in uploading it is possible to run out of disk space before processing finished. Even though some of these can be mitigated by provisioning larger resources, such an approach will likely to increase the costs and bring an unnecessary burden to the users.

@vnghia
Copy link

vnghia commented May 14, 2020

I agree that it has some limitations, however, as far as I know, cloud filesystems like s3 and gcs do not allow objects to be modified. Once we upload something to the cloud server, the object become immutable. So in order to implement WritableFile, we first write it to a temporary file and upload it to the cloud when user call Sync. A similar thing holds for AppendableFile, we have to prefetch the file to a temporary local file, append something to it and overwrite it when upload to the server

@samikama
Copy link
Contributor Author

Of course there are workarounds for such issues but not necessarily optimal solutions for all cases. Transactions allow users to express intent. For example consider write to a file and conditionally maybe append to it. With the mechanism that you describe, file will be uploaded when closed after write() and uploaded to the cloud. Then if append branch is taken, it will be downloaded appended and uploaded again. However with transactions, user can close the transaction after append() decision and operation finished, which hints the file system to upload once the transaction is done, saving time, storage and bandwidth. If you keep file in local system until process is finished, there is a chance that it fails down the road, loosing all the work went into the production of these files. Also in inference case, TF might be producing files that is being uploaded for consumption by other processes/services. So keeping the file until the end of the process is a no-go under that circumstances. Also there are non-classical persistent storage systems other than gcs or s3 that may benefit from knowing the users intent.

@vnghia
Copy link

vnghia commented May 14, 2020

You can take a look at what I am doing here: GCS plugin, S3 plugin.

Basically, I keep a std::shared_ptr<std::fstream> in each struct.

namespace tf_random_access_file {

typedef struct S3File {
  const char* bucket;
  const char* object;
  const std::shared_ptr<Aws::Transfer::TransferManager>& transfer_manager;
  // FstreamWithName is a children class of `std::fstream`
  // with a member variable `name` generated randomly 
  // something like tensorflow_tmp_abcdef_tmp_s3_filesystem
  // to prevent it from being used by another process
  // and it get deleted when it goes out of scope.
  std::shared_ptr<FstreamWithName> temp_file;
} S3File;

For example consider write to a file and conditionally maybe append to it. With the mechanism that you describe, file will be uploaded when closed after write() and uploaded to the cloud. Then if append branch is taken, it will be downloaded appended and uploaded again.
If you keep file in local system until process is finished, there is a chance that it fails down the road, loosing all the work went into the production of these files. Also in inference case, TF might be producing files that is being uploaded for consumption by other processes/services. So keeping the file until the end of the process is a no-go under that circumstances

It is true that I keep a temporary file until the users explicitly call Cleanup() ( Sync(), Flush() won't close the std::fstream, it just upload the file to the server ).

However with transactions, user can close the transaction after append() decision and operation finished, which hints the file system to upload once the transaction is done, saving time, storage and bandwidth.

So you means transactions act like a signal to tell the cloud filesystem that everything is done, the server can close the connection, port, and the connection stream or something like this ? As far as I know, gcs and s3 use curl http GET/POST internally and they do not work like this. If the transactions act like a signal for tensorflow to upload, I have implemented in a way that Append() will append to a local temporary file, only when Sync(), Flush() or Close() is called, it will upload the file to cloud, so there is no need to have a transaction ?

Also there are non-classical persistent storage systems other than gcs or s3 that may benefit from knowing the users intent.

Currently, Tensorflow only supports 3 non-classical persistent storage systems hdfs, s3 and gcs. If we want to actually use this Transaction extension, we may have to add another filesystem. ( I haven't touched hdfs yet but I think it won't support this mechanism )

@samikama samikama changed the title Introducing Transactions extension to Modular Filesystems RFC [RFC] Introducing Transactions extension to Modular Filesystems RFC May 15, 2020
class Filesystem {
// Transaction Token API extensions
virtual Status GetTransactionTokenForFile(const string& file_name,std::unique_ptr<TransactionToken>* token) = 0
virtual Status StartTransaction(const string& transaction_name, std::unique_ptr<TransactionToken>* token) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::unique_ptr* is a weird type (pointer to unique pointer to TransactionToken)

I think the signature should just be TransactionToken* and we should clearly separate methods which can create/destroy a TransactionToken from ones that just need a reference to it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alextp Thanks for the comments. I made them as unique_ptr* to be inline with existing implementations. I don't have any special preference for it. I will change it to TransactionToken*


Transaction token will be owned by the Filesystem and use of it after `EndTransaction` will be an invalid operation.

File classes can be modified to keep TransactionToken, assigned by the filesystem on their construction using given scope, or default scope if not given. Filesystems may ignore it if transaction at that level doesn't make sense.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think saying that filesystems may ignore transaction tokes is very useful as this makes it impossible for code to work correctly on FSs with and without transaction tokens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not completely sure on what you meant here. I believe not all file systems will make use of transactions. For example for a RAM based file system, most if not all transactions are meaningless. Such a file system can return dummy transaction tokens and ignore the transaction token pointers it is given. So all file systems will accept tokens but not necessarily create or act on different transactions. I will try to add some examples to clarify this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a file system is guaranteed to have no other users then transactions are meaningless; otherwise transactions are meaningful.

Like, even on my local disk if I write a model checkpoint with N files and another process goes and deletes file 1 while I'm writing file N I have an invalid checkpoint. Similarly if I start writing a checkpoint with N files and another process tries to read from my local FS and sees the half-written checkpoint it'll fail. Creating a no-op transaction token for my local filesystem would then let these issues go through as opposed to actually doing the atomic writes we will do in networked filesystems if we do have transactions.

I think my question was more about what the intent of the token is: is it "please try to do a transaction but don't bother if you can't" or is it "I want to be able to rely on these things being atomic and please tell me if something failed".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really hard to implement the things you describe across processes, with or without transactions since it would require some form of synchronization between any potential reader/writer combination. Since it is not possible, it is not the intent here. Purpose of the transactions is to relay the filesystem a set of operations are complete together. In your example, transaction will cover N file writes. For example, if any other reader tries to read a file that is generated in transaction before transaction is finalized, FS can block that thread. So transactions are hints to file system saying "N operations are meaningful together".
If the underlying persistent storage doesn't have atomic operation support, only solution is to implement blocking mechanism. I would say it is closer to first interpretation than the second.

For the posix file example you gave, if multiple processes are using same posix fs plugin, it may be possible to use lock files to mimic ipc. But if consumer of checkpoints is a copy/move script for example, not much can be done.

Existing filesystem C++ api can easily be expanded by addition of three methods, an opaque structure and possibly a helper class to support transactions.

```cpp
struct TransactionToken{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative, less intrusive, API modification is instead of passing transaction tokens to all FS methods optionally we can have a variant of GetFileSystemForFile which instead is like StartTransactionForFile which returns a FileSystem which, when dereferenced, commits the transaction. Implicitly then all operations made on this file system are a part of the transaction until committed.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alextp +1 for non-intrusive APIs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly, that is a quite brilliant idea. You are talking about binding the token to FS object, at which, all FS operations done on this objects belong to same transaction. But since most existing code uses Env class for file operations which forward it to respective FS, it would require more changes in the code base, where all code using Env for file system interaction needs to be changed to use FS directly. If this is acceptable, I liked your idea better.
In C api, things can get a bit messier though, since it would mean constructing all TF_*Ops structures and a lot of gymnastics to get it right unless we require C api still requires tokens as proposed in this RFC, but C++ wrappers incorporates the logic you brought up. But that would mean that no direct C api interaction with file systems in framework code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point that the naive less intrusive approach I suggested won't work.

I don't have a rejoinder now, but I'll try to think of other non-intrusive or less intrusive approaches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: Core TF would never directly call the filesystem C API outside of the C++ wrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can block the operation to attach a file to a token if the file is already attached to the other token. This ensure blocking behavior and delaying overwrites.

Maybe this is actually the better implementation for concurrent transactions on the same file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case should filesystem ensure whether added object is directory or file? What happens in the case of directories. Also we have to make sure that appending to file uses same token, if done in another method. Which may require a unique filesystem based token key to be used when starting a transaction which would return existing token if there still an ongoing transaction with the same key or a new transaction if there isn't.

I am OK with this approach, but all these will start adding quite complicated layers of bookeeping into the filesystems which may have an impact on certain operations or filesystems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we prefer keeping API over complicating implementations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the new parameter is intrusive and requires more changes than adding the declarative API to create a token and then attach objects to it. Modular filesystem design does not allow subclassing and only creates one single operations table per URI scheme, so core TF won't be able to pick between the subclass that has tokens and the one that ignores them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we relax subclassing of Modular filesystem or just make it something like base subsystem and let users always interact through wrapped filesystem, it won't be an issue right? Op table can contain the api with the tokens but wrapper may keep them as optional. There still will be one op table per URI and one filesystem. Users will access through filesystem pointer instead of Env::Default() and that would probably be only difference. This would be done purely on FW side and plugins only implement the api.


For the new proposed filesystem plugin mechanism, two possible approaches exists. For `TF_RandomAccessFile`, `TF_WritableFile`, and `TF_ReadOnlyMemoryRegion` structures,

- Opaque pointers stay as is, thus no changes is needed in structures. Then each filesystem attach tokens to their own internal structures pointed by `void*`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer keeping the opaque pointers.


Transactions can also help with some filesystem access inconsistencies that can happen while reading and writing checkpoints. For example in while on thread reading files from a directory other may be modifying the underlying files. This could lead to reading an inconsistent or corrupt set of files by the reader. With transactions, each thread can have different transaction tokens and underlying file system can choose to postpone modification of files by redirecting them to a temporary location and then moving it in place when transactions end.

## User Benefit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I have is how do we expect FSs with transactions to coexist with FSs without transactions.

Do we expect only user-written python code to be made aware of transactions, or do we also expect opkernel C++ code to be transaction-aware? I ask because we do a lot of IO inside opkernels today (summary writing, checkpointing, data reading, etc) and I'd like to understand whether we plan to modify all this to use transactions and how.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++/C code change will also impact the existing external file systems plugins that are outside of TensorFlow Core repo. The tensorflow/io repo have several: Azfs (Microsoft Azure), Apache Ignite, OSS, and http. The file system plugins also exists for many other places outside of tensorflow org.

Wondering if there is a non-intrusive method to not break implementations for many places. cc @mihaimaruseac

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this proposal is an extension to Modular File Systems RFC, it is expected all filesystems use the API, however as I mentioned above, this may be as much as adding extra pointer to their methods and ignoring passed in value. Filesystems should not be forced to act on transactions and allowed to ignore or use a default transaction.

@yongtang Since all file systems are expected to migrate plugin api, I don't think an API change is avoidable and transaction tokens can easily be included at that time.

The second point about making C++ code transaction aware is unfortunately more intrusive. I can think of 2 options there

  • Transactions are initiated before ops so that ops can call GetTransactionTokenForFile(or equivalent from you proposal). This probably needs to be achieved by adding transaction begin/end ops to the graph, possibly by the python layer that inserts C++ ops.

  • Transaction tokens are made inputs and outputs of these C++ ops, and created by upstream begin transaction ops and finalized by downstream end transaction ops.

@alextp a quick search didn't turn up many ops, are you aware of any other ops doing file io other than save_restore_v2_ops?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samikama ReadFile and WriteFile are some of the C++ kernel ops, mostly in core/ops/io_ops.cc. But tf.data also has quite a few ops that touches file system API implicitly. As those C++ kernel op APIs are mostly frozen, the input args of the ops may not be easily changeable (version compatibility).

@alextp
Copy link
Contributor

alextp commented May 20, 2020 via email

@samikama
Copy link
Contributor Author

Hi All,

I updated the proposal, including some of the changes discussed above and approaches from @mihaimaruseac and @alextp with a trivial example and how it would be modified for further discussions.

Copy link
Contributor

@mihaimaruseac mihaimaruseac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the stateful tokens, the existing API methods should not have the token in the implementation

@ematejska ematejska changed the title [RFC] Introducing Transactions extension to Modular Filesystems RFC RFC: Introducing Transactions extension to Modular Filesystems Jun 9, 2020
@ematejska ematejska added the RFC: Proposed RFC Design Document label Jun 9, 2020
@mihaimaruseac
Copy link
Contributor

Notes from the design meeting:

  1. 3 proposals discussed:
    1. original: user create token and caries it around
      1. could this break backwards compatibility? Demo implementation says no (there are default empty values for token argument)
      2. user unfriendly, requires changes to op graph (new ops, new inputs)
    2. WrappedFileSystem:
      1. returned transparently to the user, hides token inside class implementation
      2. user friendly but still needs changes to op graph
    3. stateful tokens:
      1. least API change, ABI stable change
      2. less flexible, might require user code change
  2. Could we mix some of the approaches to guarantee both user friendliness, less changes and ABI stability?
    1. WrappedFileSystem in TF Core, stateful tokens in the API between TF core and filesystem plugins
    2. These 2 approaches counter the cons of the original proposal, we don't lose anything from the original proposal by doing this
  3. Goal is to make additions to API, not break it. We can insert compatibility shims where needed

Conclusions:

  1. Amend proposal with the mix between the last 2 proposals, get a quick approval on the PR
  2. We need a migration schedule to ensure this lands before TF fully converts to modular filesystems

Copy link
Contributor

@mihaimaruseac mihaimaruseac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Let's add an example on how the entire checkpointing operation would get transformed (all relevant layers) and then we can merge this

@ematejska
Copy link

Looks like the example was added. @samikama are we ready for the final LGTM from @mihaimaruseac and @alextp?

Copy link
Contributor

@mihaimaruseac mihaimaruseac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for the example

@alextp
Copy link
Contributor

alextp commented Jul 15, 2020

LGTM

@samikama
Copy link
Contributor Author

Thanks all for your feedback and help on this. I will start opening PRs once @ematejska finalizes the procedure.

@ematejska ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Jul 15, 2020
@ematejska ematejska merged commit 5f3e443 into tensorflow:master Jul 15, 2020
@ematejska
Copy link

It is all finalized :)

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Apr 27, 2021
PiperOrigin-RevId: 370587801
Change-Id: I04f1b2e30be7ef010a4445aaf657ed0de70e3af6
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Apr 27, 2021
PiperOrigin-RevId: 370681107
Change-Id: I06d48c974357b494f875b6f1247dd4032119e9f9
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants