linux-ext4 - Re: [RFC] [PATCH] vfs: Call filesystem callback when backing device caches should be flushed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090121214132.GD16133@shareable.org>
Date:	Wed, 21 Jan 2009 21:41:32 +0000
From:	Jamie Lokier <jamie@...reable.org>
To:	Jan Kara <jack@...e.cz>
Cc:	linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	Theodore Tso <tytso@....EDU>
Subject: Re: [RFC] [PATCH] vfs: Call filesystem callback when backing device caches should be flushed

Jan Kara wrote:
>   Well, that would be nice but you cannot return from fsync() until you've
> done the flush. So you have to be careful not to wait for too long. JBD
> actually plays these tricks with sync transaction batching and it's not
> trivial to get this right. So I'd rather avoid it.

Didn't extN for some N do/did something similar?

> > What about O_SYNC writes though?  A device flush after each one would
> > be expensive, but that's what equivalence to fsync() implies is
> > needed.
>   Yes.
> 
> > O_DIRECT writes shouldn't do block_flush_device(), but an app may
> > still need a way to commit data for integrity.  So fsync() or
> > fdatasync() called after a series of O_DIRECT writes should call
> > block_flush_device() _even_ though there's no page-cache dirty data to
> > commit, and even if there's no inode change to commit.
>   Hmm, this is an interesting point. You're right that we currently miss
> the flushes and we probably need some dirty inode flag like needs_flush or
> so.

Proposal (both together):

  1. per-device-queue flag needs_flush.

     Set on write queued, clear on flush queued.  When clear, flushes
     are discarded instead of being queued.  Waiting on the discarded
     flush waits instead for the last flush which was queued, if it's
     still in flight.  So the queue will also track that last flush.

  2. per-inode flag needs_flush.

     Set on write queued from this file (writeback), cleared on flush
     sent from this file (i.e. the thing fsync/fdatasync/O_SYNC should
     be calling).  As above, flushes aren't sent from this file when
     this flag is clear, and waiting on a discarded flush waits
     instead on the last flush sent for this file, if it's still in
     flight.  So the file will track that last flush command in
     addition to needs_flush.

Implement both.  The first doee right thing optimising away
unnecessary journal/tree-log barriers.  The second further optimises
individual files.

You *could* have a needs_flush bit per page, to tune it further, in
the same way that fsync_range() and O_DIRECT invalidations etc. are
getting better at working with ranges, but that may be pointless
overengineering (I've no idea).

> > Since you want to avoid issuing two device flushes in a row (they're
> > not free), and a journalling fs may issue one separately, as Joel says
> > a filesystem could override this.
>   Yes, journalling filesystems usually take care themselves.
> 
> > But I suspect it would be better to keep the generic call to
> > block_flush_device() from fsync(), and at the block layer discard
> > duplicate flushes that have no writes in between.

>   Hmm, probably this won't be too hard to implement. OTOH it won't catch
> those cases where some other process manages to squeeze in some writes
> between the two flushes. So I'm not sure if we really want to design things
> this way unless really necessary.

Let me put it this way.  ext3 is a journalling fs, and it does _not_
provide integrity with fsync() or fdatasync() in all cases, even with
barriers and data=ordered turned on.

We should have something which provides flushes generically, with the
possibility for the fs to override it with a smarter method when it
knows better.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html