RFC: Introducing Transactions extension to Modular Filesystems #245

samikama · 2020-05-13T22:59:19Z

Transactional File Systems Support

Status	Accepted
RFC #	245
Author(s)	Sami Kama ([email protected])
Sponsor	Mihai Maruseac ([email protected])
Updated	2020-05-05

Objective

The aim of this RFC to extend filesystem access support to persistent storage that
provide transactional access and eventual consistency.

vnghia · 2020-05-14T01:31:55Z

Hello, I am currently implementing s3 filesystem plugin and gcs filesystem plugin for Tensorflow. Currently, we use this approach "prefetching the data to local system and then uploading after processing". I have just skimmed through your RFC but I think this may cause an increasing of requests made to the cloud server ? We choose the prefetching to minimize the requests made to the cloud server

samikama · 2020-05-14T01:46:44Z

Hi,
Thanks for the first comment :). Prefetching approach has several limitations. First of all, it extends the start time of processing because you have to wait for data to arrive to host. Second one is the size limitations, there are domains with 100s of TB data and would be unfeasible to prefetch it to the host if not impossible. Likewise, in uploading it is possible to run out of disk space before processing finished. Even though some of these can be mitigated by provisioning larger resources, such an approach will likely to increase the costs and bring an unnecessary burden to the users.

vnghia · 2020-05-14T02:01:48Z

I agree that it has some limitations, however, as far as I know, cloud filesystems like s3 and gcs do not allow objects to be modified. Once we upload something to the cloud server, the object become immutable. So in order to implement WritableFile, we first write it to a temporary file and upload it to the cloud when user call Sync. A similar thing holds for AppendableFile, we have to prefetch the file to a temporary local file, append something to it and overwrite it when upload to the server

samikama · 2020-05-14T02:53:46Z

Of course there are workarounds for such issues but not necessarily optimal solutions for all cases. Transactions allow users to express intent. For example consider write to a file and conditionally maybe append to it. With the mechanism that you describe, file will be uploaded when closed after write() and uploaded to the cloud. Then if append branch is taken, it will be downloaded appended and uploaded again. However with transactions, user can close the transaction after append() decision and operation finished, which hints the file system to upload once the transaction is done, saving time, storage and bandwidth. If you keep file in local system until process is finished, there is a chance that it fails down the road, loosing all the work went into the production of these files. Also in inference case, TF might be producing files that is being uploaded for consumption by other processes/services. So keeping the file until the end of the process is a no-go under that circumstances. Also there are non-classical persistent storage systems other than gcs or s3 that may benefit from knowing the users intent.

vnghia · 2020-05-14T04:34:28Z

You can take a look at what I am doing here: GCS plugin, S3 plugin.

Basically, I keep a std::shared_ptr<std::fstream> in each struct.

namespace tf_random_access_file {

typedef struct S3File {
  const char* bucket;
  const char* object;
  const std::shared_ptr<Aws::Transfer::TransferManager>& transfer_manager;
  // FstreamWithName is a children class of `std::fstream`
  // with a member variable `name` generated randomly 
  // something like tensorflow_tmp_abcdef_tmp_s3_filesystem
  // to prevent it from being used by another process
  // and it get deleted when it goes out of scope.
  std::shared_ptr<FstreamWithName> temp_file;
} S3File;

For example consider write to a file and conditionally maybe append to it. With the mechanism that you describe, file will be uploaded when closed after write() and uploaded to the cloud. Then if append branch is taken, it will be downloaded appended and uploaded again.
If you keep file in local system until process is finished, there is a chance that it fails down the road, loosing all the work went into the production of these files. Also in inference case, TF might be producing files that is being uploaded for consumption by other processes/services. So keeping the file until the end of the process is a no-go under that circumstances

It is true that I keep a temporary file until the users explicitly call Cleanup() ( Sync(), Flush() won't close the std::fstream, it just upload the file to the server ).

However with transactions, user can close the transaction after append() decision and operation finished, which hints the file system to upload once the transaction is done, saving time, storage and bandwidth.

So you means transactions act like a signal to tell the cloud filesystem that everything is done, the server can close the connection, port, and the connection stream or something like this ? As far as I know, gcs and s3 use curl http GET/POST internally and they do not work like this. If the transactions act like a signal for tensorflow to upload, I have implemented in a way that Append() will append to a local temporary file, only when Sync(), Flush() or Close() is called, it will upload the file to cloud, so there is no need to have a transaction ?

Also there are non-classical persistent storage systems other than gcs or s3 that may benefit from knowing the users intent.

Currently, Tensorflow only supports 3 non-classical persistent storage systems hdfs, s3 and gcs. If we want to actually use this Transaction extension, we may have to add another filesystem. ( I haven't touched hdfs yet but I think it won't support this mechanism )

alextp · 2020-05-15T17:54:06Z

rfcs/20200505-transactional-fs.md

+class Filesystem {
+  // Transaction Token API extensions
+  virtual Status GetTransactionTokenForFile(const string& file_name,std::unique_ptr<TransactionToken>* token) = 0
+  virtual Status StartTransaction(const string& transaction_name, std::unique_ptr<TransactionToken>* token) = 0;


std::unique_ptr* is a weird type (pointer to unique pointer to TransactionToken)

I think the signature should just be TransactionToken* and we should clearly separate methods which can create/destroy a TransactionToken from ones that just need a reference to it

Hi @alextp Thanks for the comments. I made them as unique_ptr* to be inline with existing implementations. I don't have any special preference for it. I will change it to TransactionToken*

alextp · 2020-05-15T17:55:02Z

rfcs/20200505-transactional-fs.md

+
+Transaction token will be owned by the Filesystem and use of it after `EndTransaction` will be an invalid operation.
+
+File classes can be modified to keep TransactionToken, assigned by the filesystem on their construction using given scope, or default scope if not given. Filesystems may ignore it if transaction at that level doesn't make sense.


I don't think saying that filesystems may ignore transaction tokes is very useful as this makes it impossible for code to work correctly on FSs with and without transaction tokens.

I am not completely sure on what you meant here. I believe not all file systems will make use of transactions. For example for a RAM based file system, most if not all transactions are meaningless. Such a file system can return dummy transaction tokens and ignore the transaction token pointers it is given. So all file systems will accept tokens but not necessarily create or act on different transactions. I will try to add some examples to clarify this.

If a file system is guaranteed to have no other users then transactions are meaningless; otherwise transactions are meaningful.

Like, even on my local disk if I write a model checkpoint with N files and another process goes and deletes file 1 while I'm writing file N I have an invalid checkpoint. Similarly if I start writing a checkpoint with N files and another process tries to read from my local FS and sees the half-written checkpoint it'll fail. Creating a no-op transaction token for my local filesystem would then let these issues go through as opposed to actually doing the atomic writes we will do in networked filesystems if we do have transactions.

I think my question was more about what the intent of the token is: is it "please try to do a transaction but don't bother if you can't" or is it "I want to be able to rely on these things being atomic and please tell me if something failed".

It would be really hard to implement the things you describe across processes, with or without transactions since it would require some form of synchronization between any potential reader/writer combination. Since it is not possible, it is not the intent here. Purpose of the transactions is to relay the filesystem a set of operations are complete together. In your example, transaction will cover N file writes. For example, if any other reader tries to read a file that is generated in transaction before transaction is finalized, FS can block that thread. So transactions are hints to file system saying "N operations are meaningful together".
If the underlying persistent storage doesn't have atomic operation support, only solution is to implement blocking mechanism. I would say it is closer to first interpretation than the second.

For the posix file example you gave, if multiple processes are using same posix fs plugin, it may be possible to use lock files to mimic ipc. But if consumer of checkpoints is a copy/move script for example, not much can be done.

alextp · 2020-05-15T17:57:37Z

rfcs/20200505-transactional-fs.md

+Existing filesystem C++ api can easily be expanded by addition of three methods, an opaque structure and possibly a helper class to support transactions.
+
+```cpp
+struct TransactionToken{


One alternative, less intrusive, API modification is instead of passing transaction tokens to all FS methods optionally we can have a variant of GetFileSystemForFile which instead is like StartTransactionForFile which returns a FileSystem which, when dereferenced, commits the transaction. Implicitly then all operations made on this file system are a part of the transaction until committed.

WDYT?

@alextp +1 for non-intrusive APIs

If I understood correctly, that is a quite brilliant idea. You are talking about binding the token to FS object, at which, all FS operations done on this objects belong to same transaction. But since most existing code uses Env class for file operations which forward it to respective FS, it would require more changes in the code base, where all code using Env for file system interaction needs to be changed to use FS directly. If this is acceptable, I liked your idea better.
In C api, things can get a bit messier though, since it would mean constructing all TF_*Ops structures and a lot of gymnastics to get it right unless we require C api still requires tokens as proposed in this RFC, but C++ wrappers incorporates the logic you brought up. But that would mean that no direct C api interaction with file systems in framework code.

That's a good point that the naive less intrusive approach I suggested won't work.

I don't have a rejoinder now, but I'll try to think of other non-intrusive or less intrusive approaches.

Just a note: Core TF would never directly call the filesystem C API outside of the C++ wrapper.

We can block the operation to attach a file to a token if the file is already attached to the other token. This ensure blocking behavior and delaying overwrites.

Maybe this is actually the better implementation for concurrent transactions on the same file.

In that case should filesystem ensure whether added object is directory or file? What happens in the case of directories. Also we have to make sure that appending to file uses same token, if done in another method. Which may require a unique filesystem based token key to be used when starting a transaction which would return existing token if there still an ongoing transaction with the same key or a new transaction if there isn't.

I am OK with this approach, but all these will start adding quite complicated layers of bookeeping into the filesystems which may have an impact on certain operations or filesystems.

Do we prefer keeping API over complicating implementations?

Adding the new parameter is intrusive and requires more changes than adding the declarative API to create a token and then attach objects to it. Modular filesystem design does not allow subclassing and only creates one single operations table per URI scheme, so core TF won't be able to pick between the subclass that has tokens and the one that ignores them.

But if we relax subclassing of Modular filesystem or just make it something like base subsystem and let users always interact through wrapped filesystem, it won't be an issue right? Op table can contain the api with the tokens but wrapper may keep them as optional. There still will be one op table per URI and one filesystem. Users will access through filesystem pointer instead of Env::Default() and that would probably be only difference. This would be done purely on FW side and plugins only implement the api.

alextp · 2020-05-15T17:58:02Z

rfcs/20200505-transactional-fs.md

+
+For the new proposed filesystem plugin mechanism, two possible approaches exists. For `TF_RandomAccessFile`, `TF_WritableFile`, and `TF_ReadOnlyMemoryRegion` structures,
+
+- Opaque pointers stay as is, thus no changes is needed in structures. Then each filesystem attach tokens to their own internal structures pointed by `void*`.


I prefer keeping the opaque pointers.

alextp · 2020-05-15T17:59:40Z

rfcs/20200505-transactional-fs.md

+
+Transactions can also help with some filesystem access inconsistencies that can happen while reading and writing checkpoints. For example in while on thread reading files from a directory other may be modifying the underlying files. This could lead to reading an inconsistent or corrupt set of files by the reader. With transactions, each thread can have different transaction tokens and underlying file system can choose to postpone modification of files by redirecting them to a temporary location and then moving it in place when transactions end.
+
+## User Benefit


One question I have is how do we expect FSs with transactions to coexist with FSs without transactions.

Do we expect only user-written python code to be made aware of transactions, or do we also expect opkernel C++ code to be transaction-aware? I ask because we do a lot of IO inside opkernels today (summary writing, checkpointing, data reading, etc) and I'd like to understand whether we plan to modify all this to use transactions and how.

The C++/C code change will also impact the existing external file systems plugins that are outside of TensorFlow Core repo. The tensorflow/io repo have several: Azfs (Microsoft Azure), Apache Ignite, OSS, and http. The file system plugins also exists for many other places outside of tensorflow org.

Wondering if there is a non-intrusive method to not break implementations for many places. cc @mihaimaruseac

Since this proposal is an extension to Modular File Systems RFC, it is expected all filesystems use the API, however as I mentioned above, this may be as much as adding extra pointer to their methods and ignoring passed in value. Filesystems should not be forced to act on transactions and allowed to ignore or use a default transaction.

@yongtang Since all file systems are expected to migrate plugin api, I don't think an API change is avoidable and transaction tokens can easily be included at that time.

The second point about making C++ code transaction aware is unfortunately more intrusive. I can think of 2 options there

Transactions are initiated before ops so that ops can call GetTransactionTokenForFile(or equivalent from you proposal). This probably needs to be achieved by adding transaction begin/end ops to the graph, possibly by the python layer that inserts C++ ops.

Transaction tokens are made inputs and outputs of these C++ ops, and created by upstream begin transaction ops and finalized by downstream end transaction ops.

@alextp a quick search didn't turn up many ops, are you aware of any other ops doing file io other than save_restore_v2_ops?

@samikama ReadFile and WriteFile are some of the C++ kernel ops, mostly in core/ops/io_ops.cc. But tf.data also has quite a few ops that touches file system API implicitly. As those C++ kernel op APIs are mostly frozen, the input args of the ops may not be easily changeable (version compatibility).

alextp · 2020-05-20T20:52:27Z

That sounds good to me. Can you update the proposal?

…

On Wed, May 20, 2020 at 1:39 PM Sami Kama ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In rfcs/20200505-transactional-fs.md <#245 (comment)>: > + +With this extension proposal, users will have a more stable access to cloud storage systems as well as checkpointing redundancy can improve. + +## Design Proposal + +This RFC proposes to extend the [filesystem plugin rfc][filesystem_plugin] api with transaction markers. There can be different levels of transactions. First level can be global transactions, user starts a transaction scope. Any operations done within this scope will be grouped in this transaction. This is easiest to implement but have drawbacks such as only one transaction can be present at a time and different type of transactions may not always be reordered. +Alternative to this approach is having multiple transaction scopes. User can create a transaction token and pass it to filesystem operations for plugin to differentiate among independent transactions. This token can be per file or per directory level granularity. Even though per file would give most flexibility, per directory level of transaction detail is most likely sufficient. + +Filesystem plugins may choose to ignore the transaction scopes or can delay the operations until the termination of transaction scope. + +### Extension to existing filesystem implementation + +Existing filesystem C++ api can easily be expanded by addition of three methods, an opaque structure and possibly a helper class to support transactions. + +```cpp +struct TransactionToken{ Something like // Replace Env::Default()->(filesystems methods with filesystem// Env::Default()->CreateDir(dirname); FileSystem* FS=Env::Default()->StartTransactionForURI(fname); FS->CreateDir(dirname);//.....// Env::Default()->RenameFile(...) FS->RenameFile(....);//...// Env::Default()->DeleteDir(..) FS->DeleteDir(dirname); FS->EndTransaction(); Since with @mihaimaruseac <https://github.com/mihaimaruseac> said that C layer is always accessed through wrapper, Same logic can be implemented with plugins. But then the API should be extended a little bit so that users can continue an existing transaction. This does not necessarily reduce the total lines of code that need to be changed but simplifies token tracing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#245 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABHRNU76R6A3MNJFUNUW3RSQ5YDANCNFSM4NAFXFCQ> .

-- - Alex

samikama · 2020-05-24T03:32:24Z

Hi All,

I updated the proposal, including some of the changes discussed above and approaches from @mihaimaruseac and @alextp with a trivial example and how it would be modified for further discussions.

mihaimaruseac

For the stateful tokens, the existing API methods should not have the token in the implementation

rfcs/20200505-transactional-fs.md

mihaimaruseac · 2020-06-25T21:37:06Z

Notes from the design meeting:

3 proposals discussed:
1. original: user create token and caries it around
  1. could this break backwards compatibility? Demo implementation says no (there are default empty values for token argument)
  2. user unfriendly, requires changes to op graph (new ops, new inputs)
2. WrappedFileSystem:
  1. returned transparently to the user, hides token inside class implementation
  2. user friendly but still needs changes to op graph
3. stateful tokens:
  1. least API change, ABI stable change
  2. less flexible, might require user code change
Could we mix some of the approaches to guarantee both user friendliness, less changes and ABI stability?
1. WrappedFileSystem in TF Core, stateful tokens in the API between TF core and filesystem plugins
2. These 2 approaches counter the cons of the original proposal, we don't lose anything from the original proposal by doing this
Goal is to make additions to API, not break it. We can insert compatibility shims where needed

Conclusions:

Amend proposal with the mix between the last 2 proposals, get a quick approval on the PR
We need a migration schedule to ensure this lands before TF fully converts to modular filesystems

mihaimaruseac

LGTM.

Let's add an example on how the entire checkpointing operation would get transformed (all relevant layers) and then we can merge this

… changes

ematejska · 2020-07-14T15:19:28Z

Looks like the example was added. @samikama are we ready for the final LGTM from @mihaimaruseac and @alextp?

mihaimaruseac

LGTM. Thank you for the example

alextp · 2020-07-15T17:45:01Z

LGTM

samikama · 2020-07-15T17:59:05Z

Thanks all for your feedback and help on this. I will start opening PRs once @ematejska finalizes the procedure.

ematejska · 2020-07-15T20:20:17Z

It is all finalized :)

PiperOrigin-RevId: 370587801 Change-Id: I04f1b2e30be7ef010a4445aaf657ed0de70e3af6

PiperOrigin-RevId: 370681107 Change-Id: I06d48c974357b494f875b6f1247dd4032119e9f9

Introducing Transactions extension to Modular Filesystems RFC

8ae88c1

samikama requested review from ematejska, ewilderj, martinwicke and theadactyl as code owners May 13, 2020 22:59

googlebot added the cla: yes label May 13, 2020

samikama changed the title ~~Introducing Transactions extension to Modular Filesystems RFC~~ [RFC] Introducing Transactions extension to Modular Filesystems RFC May 15, 2020

alextp suggested changes May 15, 2020

View reviewed changes

Added updates from discussions

bab620f

mihaimaruseac approved these changes Jun 1, 2020

View reviewed changes

rfcs/20200505-transactional-fs.md Outdated Show resolved Hide resolved

rfcs/20200505-transactional-fs.md Outdated Show resolved Hide resolved

rfcs/20200505-transactional-fs.md Outdated Show resolved Hide resolved

Remove unnecessary token arguments from stateful tokens example

512619d

ematejska changed the title ~~[RFC] Introducing Transactions extension to Modular Filesystems RFC~~ RFC: Introducing Transactions extension to Modular Filesystems Jun 9, 2020

ematejska added the RFC: Proposed RFC Design Document label Jun 9, 2020

Adding modifications for merged proposal

25c4309

mihaimaruseac approved these changes Jul 2, 2020

View reviewed changes

ematejska and others added 2 commits July 7, 2020 17:01

Update 20200505-transactional-fs.md

84519a1

Added transactional checkpointing example and possible implementation…

cd7a634

… changes

mihaimaruseac approved these changes Jul 14, 2020

View reviewed changes

Mark as Accepted

00d1976

ematejska added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Jul 15, 2020

ematejska approved these changes Jul 15, 2020

View reviewed changes

ematejska merged commit 5f3e443 into tensorflow:master Jul 15, 2020

mihaimaruseac mentioned this pull request Aug 24, 2020

[Bug]The file copied by TF from HDFS to local may be wrong, when HDFS file is being overwritten tensorflow/tensorflow#42597

Closed

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Apr 27, 2021

Reverting tensorflow/community#245 in favor of tensorflow/community#277.

ab4922b

PiperOrigin-RevId: 370587801 Change-Id: I04f1b2e30be7ef010a4445aaf657ed0de70e3af6

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Apr 27, 2021

Reverting tensorflow/community#245 in favor of tensorflow/community#277.

fc2adf1

PiperOrigin-RevId: 370681107 Change-Id: I06d48c974357b494f875b6f1247dd4032119e9f9


		Transaction token will be owned by the Filesystem and use of it after `EndTransaction` will be an invalid operation.

		File classes can be modified to keep TransactionToken, assigned by the filesystem on their construction using given scope, or default scope if not given. Filesystems may ignore it if transaction at that level doesn't make sense.


		For the new proposed filesystem plugin mechanism, two possible approaches exists. For `TF_RandomAccessFile`, `TF_WritableFile`, and `TF_ReadOnlyMemoryRegion` structures,

		- Opaque pointers stay as is, thus no changes is needed in structures. Then each filesystem attach tokens to their own internal structures pointed by `void*`.


		Transactions can also help with some filesystem access inconsistencies that can happen while reading and writing checkpoints. For example in while on thread reading files from a directory other may be modifying the underlying files. This could lead to reading an inconsistent or corrupt set of files by the reader. With transactions, each thread can have different transaction tokens and underlying file system can choose to postpone modification of files by redirecting them to a temporary location and then moving it in place when transactions end.

		## User Benefit

RFC: Introducing Transactions extension to Modular Filesystems #245

RFC: Introducing Transactions extension to Modular Filesystems #245

Uh oh!

Conversation

samikama commented May 13, 2020 • edited by theadactyl Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Transactional File Systems Support

Objective

Uh oh!

vnghia commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samikama commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vnghia commented May 14, 2020

Uh oh!

samikama commented May 14, 2020

Uh oh!

vnghia commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alextp commented May 20, 2020 via email

Uh oh!

samikama commented May 24, 2020

Uh oh!

mihaimaruseac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihaimaruseac commented Jun 25, 2020

Uh oh!

mihaimaruseac left a comment

Choose a reason for hiding this comment

Uh oh!

ematejska commented Jul 14, 2020

samikama commented May 13, 2020 •

edited by theadactyl

Loading

vnghia commented May 14, 2020 •

edited

Loading

samikama commented May 14, 2020 •

edited

Loading

vnghia commented May 14, 2020 •

edited

Loading