Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a new storage specific Env API #5761

Closed
wants to merge 4 commits into from

Conversation

anand1976
Copy link
Contributor

@anand1976 anand1976 commented Aug 30, 2019

The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both options.env and options.file_system to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a CompositeEnvWrapper has been introduced that inherits from Env and redirects individual methods to either an Env implementation or the FileSystem as appropriate. When options are sanitized during DB::Open, options.env is replaced with a newly allocated CompositeEnvWrapper instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The CompositeEnvWrapper translates IOStatus return code to Status, and sets the severity to kSoftError if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic structure looks good to me! It's very hard to have confidence unless seeing an implementation. I suggest we hold the PR from being merged, and merge it together when the PR in which PosixEnv is moved to this new interface.

std::string file_path;

// To be set by the FSEnv implementation
std::string msg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something tricky to design. Ideally, we want something that can be added up for multiple calls. For the same type of latency, for example, they can go together. But there might be something that is not counter-based. So this is a little bit tricky. Maybe both of a map<string, uint64> and a string msg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think a map is a good idea to pass back counters that can be used for debugging. We could have a map<string, string> and require the env implementation to format counters as string?

static const char* Type() { return "Environment"; }

// Loads the environment specified by the input value into the result
static IOStatus LoadEnv(const std::string& value, FSEnv** result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is supposed to implement this function? For example, if a user implements HDFSEnv, how it is supposed to return from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a generic implementation of Env::LoadEnv that instantiates an Env object from the ObjectRegistry. I guess we need to do the same for FSEnv.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think the LoadFSEnv should return a std:::shared_ptr *. The use of the static pointers is problematic from a memory management perspective, IMO.

// The returned file will only be accessed by one thread at a time.
virtual IOStatus NewSequentialFile(const std::string& fname,
const IOOptions& io_opts,
std::unique_ptr<FSSequentialFile>* result,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output parameters after input ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for pointing it out. I tried to follow that in general, but missed a few cases.

// These values match Linux definition
// https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fcntl.h#n56
enum WriteLifeTimeHint {
WLTH_NOT_SET = 0, // No hint information set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming as kWlthNotSet

// IO_LOW - Typically background reads/writes such as compaction/flush
// IO_HIGH - Typically user reads/synchronous WAL writes
enum class IOPriority : uint8_t {
IO_LOW,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming as kIoLow or kIOLow.

@@ -18,31 +18,16 @@

#include <string>
#include "rocksdb/slice.h"
#ifdef OS_WIN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OS_WIN is a RocksDB build-time flag and may not be available when projects are built on top of RocksDB. This should not be used under include/rocksdb.

static Status CompactionTooLarge(const Slice& msg,
const Slice& msg2 = Slice()) {
return Status(kCompactionTooLarge, msg, msg2);
static T CompactionTooLarge(const Slice& msg, const Slice& msg2 = Slice()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that one point of having a separate status is to have a separate list of messages. For example, this compaction-too-large is not applicable to IOStatus. Having them sharing the same template with the same set of messages defeat the purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for introducing IOStatus was to keep the metadata separate. Status has severity, which it doesn't make sense for the FSEnv implementation to try to interpret, and IOStatus has its own metadata with doesn't apply to user facing RocksDB APIs. We could separate out the error codes too, but there will be a fair amount of overlap, and I was thinking that it could be patterned after errno, which is a superset of all system call error codes.


namespace rocksdb {

class Status {
template <class T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RocksDB code base has very limited use of template (which follows the guideline of Google C++ Style https://google.github.io/styleguide/cppguide.html#Template_metaprogramming ), and I don't remember any public interface uses template. This makes the public interface harder to understand. I suggest we think twice before using template here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I wanted to reuse the code/subcodes and template seemed to be the way to do that. But on further reflection, it allows us to reuse the values but the names have to be qualified by classname. Maybe we can use Status::Code and Status::SubCode directly in IOStatus? That way, we can have a single set of error codes, but have a distinct set of named constructors in IOStatus to indicate what's allowable.

@anand1976 anand1976 modified the milestone: 7.0.0 Sep 11, 2019
Copy link
Contributor

@mrambacher mrambacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to review without the corresponding changes to PosixEnv et al.

This is also listed as issue: #4746


virtual ~FSEnv();

static const char* Type() { return "Environment"; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be something other than "Environment", like "FSEnvironment", to prevent it from clashing with the Env value.

// These are hints and are not necessarily guaranteed to be honored.
// More hints can be added here in the future to indicate things like
// storage media (HDD/SSD) to be used, replication level etc.
struct IOOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see where you might want this on a per-file basis, but why do you want them on a per-request basis? Can you give an example where one read/write from a file would be a different priority than another to the same file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reads, compaction vs user reads is one use case. The long term plan is to also allow users to specify a timeout when calling Get/MultiGet, and that can be passed through to the FSEnv.

For writes, we might want to assign different priorities/timeouts to background WAL flush vs a sync write requested by the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout might be a per-operation parameter, but the IOType is not likely to be (are you suggesting the same file might be used for different types of IO? I also doubt that read or write operations to a single IO file would have different priorities (this read his high but the next one is low does not make sense to me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use a single TableReader object for user reads and compaction reads, so yes its possible to have different priority IOs to the same file. Even if we use a separate TableReader for compaction, I wonder if we might want to allow users to specify a priority. All user reads would go through the same file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use IOOptions for this though, rather than using the ReadOptions that are already present? Why add a new struct with new parameters rather than extending the one that already exists? It seems like there are an awful lot of places where you would need to update (SST reader/writer, trace methods, etc)

static const char* Type() { return "Environment"; }

// Loads the environment specified by the input value into the result
static IOStatus LoadEnv(const std::string& value, FSEnv** result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think the LoadFSEnv should return a std:::shared_ptr *. The use of the static pointers is problematic from a memory management perspective, IMO.

const IOOptions& io_opts,
std::unique_ptr<FSWritableFile>* result,
const EnvOptions& env_opts,
IODebugContext* dbg) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the IOOptions and EnvOptions be merged here? Why have both arguments?

Also, is there a reason to rename the file classes to FS* names. It seems like it will cause a lot of changes for not much gain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOOptions is per-IO, whereas EnvOptions is for file open. There's no overlap between the two.

Also, is there a reason to rename the file classes to FS* names. It seems like it will cause a lot of changes for not much gain.

Do you mean change the APIs but leave the class name unchanged? That would cause code compatibility issues.


// A structure to pass back some debugging information from the FSEnv
// implementation to RocksDB in case of an IO error
struct IODebugContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the use of this struct. Why would there be a different one for every API call? Who is managing this class (it is usually passed in as a pointer -- can it be null? what is the lifetime, etc?)

Would this be better as something that is a member of the FSEnv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I envisioned this being similar to PerfContext or IOStatsContext, i.e it could keep debugging counters for a single IO operation. I think its definition will have to go through a couple of iterations before it becomes meaningful. I will update the comments to clarify allocation, lifetime etc.


namespace rocksdb {

class IOStatus : public BaseStatus<IOStatus> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the need for a new IOStatus class to add all of the extensions to Status. But would it be better to have IOStatus extend Status and possibly have a new "Code" in Status for IO errors? Then code could treat the return from a FS operation as either a Status or IOStatus if they cared about the specific sub-codes for retrying the errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea. I tried to do something similar here, except by defining a base class that both Status and IOStatus extend. I don't think either approach would alter the impact to the code. The IOStatus needs to be propagated up through the *FileReader/*FileWriter classes and converted to the appropriate Status by the caller based on the context. Status::severity would be set based on the scope and retryability of the error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an IOStatus extends a Status, then code like:
Status s =
and
IOStatus s =
could both be valid. As you have it defined, the Status and IOStatus are not related, and therefore not interchangeable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree this would reduce the code changes required. I'm concerned that by implementing it this way, it would be very easy for developers to ignore IOStatus. Maintaining the error handling features would be difficult. @siying Any thoughts on this?

Copy link
Contributor Author

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrambacher Thanks for the review! I plan to port PosixEnv to this new interface as a next step. I just wanted to propose an interface first and get some early feedback on the direction.

// These are hints and are not necessarily guaranteed to be honored.
// More hints can be added here in the future to indicate things like
// storage media (HDD/SSD) to be used, replication level etc.
struct IOOptions {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reads, compaction vs user reads is one use case. The long term plan is to also allow users to specify a timeout when calling Get/MultiGet, and that can be passed through to the FSEnv.

For writes, we might want to assign different priorities/timeouts to background WAL flush vs a sync write requested by the user.

const IOOptions& io_opts,
std::unique_ptr<FSWritableFile>* result,
const EnvOptions& env_opts,
IODebugContext* dbg) = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOOptions is per-IO, whereas EnvOptions is for file open. There's no overlap between the two.

Also, is there a reason to rename the file classes to FS* names. It seems like it will cause a lot of changes for not much gain.

Do you mean change the APIs but leave the class name unchanged? That would cause code compatibility issues.


namespace rocksdb {

class IOStatus : public BaseStatus<IOStatus> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea. I tried to do something similar here, except by defining a base class that both Status and IOStatus extend. I don't think either approach would alter the impact to the code. The IOStatus needs to be propagated up through the *FileReader/*FileWriter classes and converted to the appropriate Status by the caller based on the context. Status::severity would be set based on the scope and retryability of the error.


// A structure to pass back some debugging information from the FSEnv
// implementation to RocksDB in case of an IO error
struct IODebugContext {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I envisioned this being similar to PerfContext or IOStatsContext, i.e it could keep debugging counters for a single IO operation. I think its definition will have to go through a couple of iterations before it becomes meaningful. I will update the comments to clarify allocation, lifetime etc.

// To be set by the FSEnv implementation
std::string msg;

IODebugContext(IOOptions& options) { opts = options; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a IODebugContext is created from an IOOptions, why does every API require both? Can we do with just one of these?

Copy link
Contributor

@riversand963 riversand963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anand1976 for the PR. I have taken a pass with a few comments.

// of the APIs is of type IOStatus, which can indicate an error code/sub-code,
// as well as metadata about the error such as its scope and whether its
// retryable.
class FSEnv {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name FSEnv somehow tends to lead to me thinking of it as a kind of Env, which is not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that comment. I was wondering why not call it simply "FileSystem"

virtual IOStatus NewSequentialFile(const std::string& fname,
const IOOptions& io_opts,
std::unique_ptr<FSSequentialFile>* result,
const EnvOptions& env_opts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env_opts should go before result.

IODebugContext* dbg) = 0;
// These values match Linux definition
// https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fcntl.h#n56
enum WriteLifeTimeHint {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using enum class in our codebase, and converting to Linux-compatible C-style enums?

std::unique_ptr<FSRandomRWFile>* /*result*/,
const EnvOptions& /*options*/,
IODebugContext* /*dbg*/) {
return IOStatus::NotSupported(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why NewRandomRWFile() returnes NotImplemented while NewWritableFile() is a pure virtual function? Does it imply that NewWritableFile() is something that an FS must support, while NewRandomRWFile() is not?

const EnvOptions& env_options,
const ImmutableDBOptions& db_options) const;

// This seems to clash with a macro on Windows, so #undef it here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably use ifndef around this undef.


private:
// No copying allowed
FSEnv(const FSEnv&);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can use =delete to disallow copy.

void SetDataLoss(bool data_loss) { data_loss_ = data_loss; }
void SetScope(IOErrorScope scope) { scope_ = scope; }

bool retryable() { return retryable_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: bool retryable() const

@anand1976
Copy link
Contributor Author

@siying @mrambacher @riversand963 I have updated the PR with a port of PosixEnv. I've also incorporated your comments and feedback. The PR description has also been updated with the basic design. PTAL.

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard is it to modify the DB code to directly use the new interface?
The way is done now introduces extra wrapper for every file system call, which I am not sure we are OK with. Also still we don't know how effective the new interface is because we are not directly using them in DB.

@anand1976
Copy link
Contributor Author

The extra wrapper for every file system call is a fair point. I think this can be solved by inlining CustomEnvWrapper. In CustomEnvWrapper, it can fall back to Env if no FileSystem has been specified.

@anand1976
Copy link
Contributor Author

@siying Your other point about not knowing how effective is the new interface is more complicated. In this PR, I have focused on basic error recovery for retryable errors. The DB should be able to automatically recover (ignoring the scope of the error) from errors that have io_status.retryable() set to true (ENOSPC being the only one from Posix as of now). It should work transparently for other such errors.

I don't think we will know the effectiveness of the other aspects of the new interface, such as timeouts and priority, until there is an end to end implementation (including a functioning FileSystem implementation from WS). So I think modifying the DB to specify timeout/priority for IO operations should be deferred to a follow-on PR. In the meantime, it should be made clear that the new interface is still experimental, in order to discourage developers from porting their Env implementations to it right away. Thoughts?

@siying
Copy link
Contributor

siying commented Oct 15, 2019

@anand1976 I don't think we can easily inline virtual functions. More importantly, this is not how we want it to eventually look like. Eventually, we want the new API used everywhere, and legacy Env implementation will be wrapped as the new one. So I'm curious how hard it is to skip the reverse wrapped stage and just do what we want to do eventually now.

@siying
Copy link
Contributor

siying commented Oct 15, 2019

@anand1976 I totally agree that we should delay the implementation of all the timeout, io priority, etc. I totally support this plan. Even the retryable can be delayed. My comment is not about having those implemented, but about which interface is used. If DB still uses the old Env interface, we do not really see how the new interface is used. I'm not talking about the new options like timeout, etc, but the functionalities already in use. I know it is very time consuming to go through all the code base and chase where they are used. But this is needed anyway. By going through it earlier, we can also look at whether the interface is a good one for all those places in longer term, which is helpful to avoid changing the interface again.

@anand1976
Copy link
Contributor Author

The core RocksDB code reads/writes to files through RandomAccessFileReader/WritableFileWriter. It should be relatively straightforward to modify those two to use the new interfaces, as well as deal with the ripple effects (since they would need to be constructed with FSRandomAccessFile/FSWritableFile). This would make it easier in the future to add support in RocksDB for fine-grained error handling, timeouts and IO prioritization.

Copy link
Contributor

@mrambacher mrambacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anand1976 @siying I see a couple different things here:

  1. The new methods take a few additional arguments (IOOptions, IODebugContext). I am not sure what the point is of these arguments -- why not extend EnvOptions for IOOptions? And if you need a debug context, why not make it part of the constructor rather than the individual methods? I do not see what the usage pattern is for this argument yet.

  2. I do not see the need for the FSXXX classes and methods. If the goal is to replace the existing methods, why not just do that rather than create this additional wrapper layer?

  3. I do not understand the relationship between Env and FileSystem. I would have thought that an Env has a FileSystem in it but this appears not to be the case. Additionally, I am concerned about the lifetime of a FileSystem. It seems like it needs to be a std:shared_ptr, which it is not currently (I think the same should be true of Env BTW).

// implementation instead of relying on this default file_system
//
// The result of Default() belongs to rocksdb and must never be deleted.
static FileSystem* Default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a std:shared_ptr.

Comment on lines 170 to 171
const std::string& fname, const IOOptions& io_opts,
std::unique_ptr<FSRandomAccessFile>* result, const EnvOptions& env_opts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure I understand why/how we need an IOOptions and an EnvOptions and an IODebugContext. It seems like one of these would do...

IOStatus status = IOError("Unable to close log file", "", ret);
if (!status.ok()) {
return Status::IOError("Unable to close log file", "");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this code. How does the IOError get converted to an IOStatus and how does it ever translate to ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted this change. The return from IOError() can be directly returned.

Comment on lines 62 to 68
void SetRetryable(bool retryable) { retryable_ = retryable; }
void SetDataLoss(bool data_loss) { data_loss_ = data_loss; }
void SetScope(IOErrorScope scope) { scope_ = scope; }

bool retryable() const { return retryable_; }
bool data_loss() const { return data_loss_; }
IOErrorScope scope() const { return scope_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these more symmetrical? It does not seem like SetDataLoss and data_loss are the equivalent setter and getter...

Comment on lines 38 to 49
if (result.file_system != nullptr) {
result.env = new CompositeEnvWrapper(result.env, result.file_system);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't an Env have a FileSystem member rather than the FileSystem bubbling up to a higher level?

@@ -0,0 +1,730 @@
// Copyright (c) 2019-present, Facebook, Inc. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this all necessary because you do not want to change the Env classes? Why not?

I am a little confused as to why you are not replacing the existing implementation of the File classes, rather than wrapping them. I understand there might be some potential API compatibility issues, but that seems like something that could be worked out in the design.

Comment on lines +576 to +554
IOStatus GetChildren(const std::string& dir, const IOOptions& /*opts*/,
std::vector<std::string>* result,
IODebugContext* /*dbg*/) override {
result->clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think every code path in the FS Posix does not use the IOOptions and IODebugContext, which makes it very hard to say that the function signatures you suggest are the ones that should be used. I am especially concerned of the IODebugContext * as I do not understand the expected behavior of this (is nullptr a valid value?) and it is not clear what the lifetime of this object is.

I would prefer having something that is specified when the FileSystem or File object is created, rather than on every function signature. Unless you can tell me why one would use a different argument for different calls for the same file object...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying something when the File object is created would not work for retrieving fine grained statistics for a single operation. The IODebugContext is somewhat similar in concept to the IOStatsContext, which records some statistics for each IO call, but intended to provide more visibility into the Env which IOStatsContext does not. And unlike IOStatsContext, we want to move away from using thread local storage and explicitly pass a pointer to the object.

@anand1976
Copy link
Contributor Author

@mrambacher

The new methods take a few additional arguments (IOOptions, IODebugContext). I am not sure what the point is of these arguments -- why not extend EnvOptions for IOOptions? And if you need a debug context, why not make it part of the constructor rather than the individual methods? I do not see what the usage pattern is for this argument yet.

The scope of EnvOptions and IOOptions is different. EnvOptions is for file creation, whereas IOOptions is for individual file operations. For the New*File methods, I agree IOOptions and EnvOptions can be combined (maybe the latter can embed the former). IODebugContext is envisioned to be used for per file request tracing. It cannot be embedded in the file object since the file access is multi-threaded.

I do not see the need for the FSXXX classes and methods. If the goal is to replace the existing methods, why not just do that rather than create this additional wrapper layer?

Replacing the existing methods would break backward compatibility. There are a number of users with custom Env implementations and requiring changes in them would complicate things. We could overload the methods, but we wouldn't be able to define the new ones as pure virtual (doing so would break compatibility).

I do not understand the relationship between Env and FileSystem. I would have thought that an Env has a FileSystem in it but this appears not to be the case. Additionally, I am concerned about the lifetime of a FileSystem. It seems like it needs to be a std:shared_ptr, which it is not currently (I think the same should be true of Env BTW).

Agree about FileSystem being a shared_ptr. There's technically no relationship between Env and FileSystem. Going forward, we'd like to have the ability to use multiple FileSystems in the same DB - perhaps for tiering or some other use case.

@anand1976
Copy link
Contributor Author

@siying PTAL. The new commit ports most of the core RocksDB code to use the new API. The RandomAccessFileReader, SequenceFileReader and WritableFileWriter classes call the new API. Tools and tests still use the old API, either directly or through wrapper classes where needed. There may be some small cleanup needed, which I'll do once we agree on the basic structure of the code.

@mrambacher
Copy link
Contributor

I know I sound like a broken record here, but strongly believe that the additional arguments to the methods is going to be confusing and add little value. I would much prefer that the IOOptions and EnvOptions have some is-a relationship to one another, rather than passing two in.

As far as the relationship between Env and FileSystem, I still think there is a relationship. At the moment, a large percentage of the methods of Env are FileSystem related. I think an Env has three components -- FileSystem, Threads, and Clock/System. I do not believe that having an Env has-a FileSystem causes any future problems with multiple FileSystems in a single DB.

Finally, I also think that the IODebugContext should be a std::shared_ptr and should ONLY be an argument to the NewXXXFile methods. I have a hard time envisioning when you would like to debug specific calls to read for specific files and how you would implement that. I think making it a call to the File constructor could be useful however. But I also believe it should be a std::shared_ptr to guarantee its lifefime.

I also think that all of the existing Env in the code base (HDFS, Memory, test ones) should be ported to this new API. That would help validate that the API is correct for more than a single FileSystem implementation.

@siying
Copy link
Contributor

siying commented Nov 7, 2019

@mrambacher the main motivation of the separation of Env and FileSystem is that, we plan to support multiple file systems and sharing some other resource like thread pool, timing, etc. So separating file system related functionality feels something we have to do. I agree that there is going to be a painful transition, but we can't think of a better way. Do you have any suggestion to solve this problem?

@mrambacher
Copy link
Contributor

@siying I am not sure how the issues are related.

If there are multiple FileSystems, there will need to be some mechanism of getting which one you want (like by name). So Env would probably have two methods:
GetFileSystem() -- returns the "default" one
GetFileSystem(const std::string & id) -- returns the file system based on the input "id".

If you intend to support multiples, something must be the container for storing the multiples, so why not the Env?

I guess one question is which layer knows which FileSystem to use. At a certain point, someone has to make a decision.

In the places that both an Env and FileSystem are required arguments, what is needed from the Env?

Another solution would be to invert it. Instead of an Env having a FileSystem, why not have a FileSystem have an Env? What from an Env is useful/important to a FileSystem?

@siying
Copy link
Contributor

siying commented Nov 7, 2019

@mrambacher I agree that getting a file system from a string may be a good idea. It is useful for many use cases. Where to put it is interesting though. I don't have an opinion on that.
We might also need to support users to provide FileSystem class without a globally name. This is the pattern RocksDB API has been using, block cache, write buffer manager, ThreadPool, etc. These shared resource is represented as C++ objects.

I don't think there are many places where Env and FileSystem need to be passed in together. In some entrance of major functionality, maybe, but I don't expect it will be common.

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic interface looks good to me. I'll try to do more detailed review today.

// The returned file will only be accessed by one thread at a time.
virtual IOStatus NewSequentialFile(const std::string& fname,
const IOOptions& io_opts,
const EnvOptions& env_opts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to put EnvOptions here? Ideally we don't. I know consolidating the two options may be painful, but fow now, can we build a two way converter so that the API can be clean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of EnvOptions is a file, for the duration of the file being open. The scope of IOOptions is the specific IOs that would happen as part of file open. How about having a FileOptions with IOOptions embedded in it? Other methods such as FSRandomAccessFile::Read() can take IOOptions as an argument, while methods such as FieSystem::NewRandomAccessFile() can take a FileOptions as argument.

IO_ERROR_SCOPE_FILE_SYSTEM,
IO_ERROR_SCOPE_FILE,
IO_ERROR_SCOPE_RANGE,
IO_ERROR_SCOPE_MAX,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kIoErrorScopeFileSystem?

@dhruba
Copy link
Contributor

dhruba commented Nov 7, 2019

Can we overload the existing Env to encapsulate various combinations of file systems and operating systems.? Every Env is a combination of OperatingSystem + FileSystem + hardware.

If we can do that elegantly, then we do not have to introduce the FileSystem class to propagate to all the db code.

@siying
Copy link
Contributor

siying commented Nov 7, 2019

@dhruba eventually, we will need to allow one DB to store different files to different file systems. The list of file systems to use will belong to DB's configuration, not Env's.

@dhruba
Copy link
Contributor

dhruba commented Nov 7, 2019

@siying agreed. For example, when we had to store files to local xfs as well as AWS S3 fs, we created an AwsEnv. The code that implements this AwsEnv can then decide where to store which file based on its own criteria. Also, we overloaded EnvOptions so that we can add more config parameters for this decision making.

levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 18, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 18, 2021
…in BG jobs (#6487)

Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: facebook/rocksdb#6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 18, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 18, 2021
…nv (#6414)

Summary:
In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR facebook/rocksdb#5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.

A set of ErrorHandlerFSTests are introduced for testing
Pull Request resolved: facebook/rocksdb#6414

Test Plan: pass make asan_check, pass error_handler_fs_test.

Differential Revision: D20252421

Pulled By: zhichao-cao

fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 18, 2021
…in BG jobs (#6487)

Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: facebook/rocksdb#6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
Summary:
In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR facebook/rocksdb#5761  introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory.
Pull Request resolved: facebook/rocksdb#6468

Test Plan: pass make asan_check

Differential Revision: D20195261

Pulled By: zhichao-cao

fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
…in BG jobs (#6487)

Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: facebook/rocksdb#6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
…in BG jobs (#6487)

Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: facebook/rocksdb#6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
…nv (#6414)

Summary:
In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR facebook/rocksdb#5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.

A set of ErrorHandlerFSTests are introduced for testing
Pull Request resolved: facebook/rocksdb#6414

Test Plan: pass make asan_check, pass error_handler_fs_test.

Differential Revision: D20252421

Pulled By: zhichao-cao

fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21
Signed-off-by: Changlong Chen <levisonchen@live.cn>
mm304321141 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 23, 2021
…in BG jobs (#6487)

Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: facebook/rocksdb#6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 24, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 24, 2021
…in BG jobs (#6487)

Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: facebook/rocksdb#6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 24, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 24, 2021
…nv (#6414)

Summary:
In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR facebook/rocksdb#5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.

A set of ErrorHandlerFSTests are introduced for testing
Pull Request resolved: facebook/rocksdb#6414

Test Plan: pass make asan_check, pass error_handler_fs_test.

Differential Revision: D20252421

Pulled By: zhichao-cao

fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 24, 2021
…in BG jobs (#6487)

Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: facebook/rocksdb#6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 25, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Jun 25, 2021
…in BG jobs (#6487)

Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: facebook/rocksdb#6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Sep 14, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Sep 14, 2021
…in BG jobs (#6487)

Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: facebook/rocksdb#6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Sep 14, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Sep 14, 2021
…nv (#6414)

Summary:
In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR facebook/rocksdb#5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.

A set of ErrorHandlerFSTests are introduced for testing
Pull Request resolved: facebook/rocksdb#6414

Test Plan: pass make asan_check, pass error_handler_fs_test.

Differential Revision: D20252421

Pulled By: zhichao-cao

fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Sep 14, 2021
…in BG jobs (#6487)

Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: facebook/rocksdb#6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Sep 14, 2021
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.

This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.

The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.

This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.

The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: facebook/rocksdb#5761

Differential Revision: D18868376

Pulled By: anand1976

fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Signed-off-by: Changlong Chen <levisonchen@live.cn>
levichen94 pushed a commit to bytedance/terarkdb that referenced this pull request Sep 14, 2021
…in BG jobs (#6487)

Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of facebook/rocksdb#5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.

The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: facebook/rocksdb#6487

Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check

Reviewed By: anand1976

Differential Revision: D20685017

Pulled By: zhichao-cao

fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Signed-off-by: Changlong Chen <levisonchen@live.cn>
@tabokie tabokie mentioned this pull request May 12, 2022
39 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants