[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHWVdUXwzLoeqGjanWYcwQvUqZKJnQnuiSKx=_PjDOz4t1Ox7g@mail.gmail.com>
Date: Wed, 13 Feb 2019 10:30:47 -0600
From: Vijay Chidambaram <vijayc@...xas.edu>
To: Andreas Dilger <adilger@...ger.ca>
Cc: linux-ext4@...r.kernel.org, jesus.palos@...xas.edu,
Theodore Tso <tytso@....edu>
Subject: Re: Selective Data Journaling in ext4
On Mon, Feb 11, 2019 at 7:25 PM Andreas Dilger <adilger@...ger.ca> wrote:
>
> On Feb 11, 2019, at 5:14 PM, Vijay Chidambaram <vijayc@...xas.edu> wrote:
> >
> > Hi all,
> >
> > We would like to present an idea to improve the performance of data
> > journaling in ext4. Data journaling is expensive because data is
> > written twice: once to the journal and once to the actual file system.
> > Passing data through the journal provides consistency guarantees that
> > ordered journaling mode cannot provide (for example, data journaling
> > prevents a data block from being partially written).
> >
> > The idea behind Selective Data Journaling is simple: create a new
> > journaling mode by modifying ordered journaling mode to journal data
> > blocks which are already part of a file. Data blocks which are newly
> > allocated are not part of the journal, and are written out before the
> > journal blocks in accordance with ordered mode's ordering guarantees.
> > If there is a crash before transaction commit, the only side effect is
> > un-allocated data blocks getting written with new data.
> >
> > Selective Data Journaling provides a lot of the benefits of data
> > journaling, at significantly lower cost. For workloads which mostly
> > deal with new data blocks (any applications which update files via
> > atomic rename), Selective Data Journaling can increase performance
> > significantly.
>
> One major caveat here is that files are *very rarely* overwritten in
> place. This is mostly useful for database-type workloads, and most
> databases already have their own transaction journal independent of
> the filesystem journal, so AFAIK this would not be a very widely-used
> feature.
Agreed, but another way to view this feature is that it is dynamic
switching between ordered mode and data journaling mode. We switch to
data journaling mode exactly when it is required, so you are right
that most applications would never see a difference. But when it is
required, this scheme would ensure stronger semantics are provided.
Overall, it provides data-journaling guarantees all the time, and I
was thinking some applications would like that peace of mind.
> That said, a related, but IMHO much more useful form of selective data
> journaling would be for "random IOPS" workloads, where there may be
> many small writes either to a single file or to many small files, and
> this IO could be aggregated and optimized with fast linear writes to
> the journal, possibly on a separate flash device. That avoids a lot
> of seeks for the main filesystem device for small IO (which would
> otherwise be IOPS limited and not bandwidth limited, so the double data
> writes are not a limiting factor), while allowing large writes to go
> directly to the filesystem device and avoid the double writes (which
> would otherwise reduce IO bandwidth by half).
>
> Since we already have delalloc to pre-stage the dirty pages before the
> write, we can make a good decision about whether the file data should
> be written to the journal or directly to the filesystem.
>
> This could likely leverage the work that was already done for SMR journal
> mode (Ted has patches, and I think they are available online as well),
> and hopefully integrate both those patches and this new work into mainline
> ext4.
>
> I'm happy to discuss this further if you are interested.
We like this idea as well, and would be happy to work on it! To make
sure we are on the same page, the proposal is to:
- identify whether writes are sequential or random (1)
- Send random writes to the journal if Selective Data Journaling is enabled (2)
How should we do (1)? Also, would it make sense to do this per-file
instead of as a mode for the entire file system? I am thinking of
opening a file with O_SDJ which will convert random writes to
sequential and increase performance.
Powered by blists - more mailing lists