[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100524184024.GA9905@mail.oracle.com>
Date: Mon, 24 May 2010 11:40:25 -0700
From: Joel Becker <Joel.Becker@...cle.com>
To: Nick Piggin <npiggin@...e.de>
Cc: Dave Chinner <david@...morbit.com>,
Christoph Hellwig <hch@...radead.org>,
Josef Bacik <josef@...hat.com>, linux-fsdevel@...r.kernel.org,
chris.mason@...cle.com, akpm@...ux-foundation.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC] new ->perform_write fop
On Mon, May 24, 2010 at 04:55:19PM +1000, Nick Piggin wrote:
> On Mon, May 24, 2010 at 03:53:29PM +1000, Dave Chinner wrote:
> > Because if we fail after the allocation then ensuring we handle the
> > error *correctly* and *without further failures* is *fucking hard*.
>
> I don't think you really answered my question. Let me put it in concrete
> terms. In your proposal, why not just do the reserve+allocate *after*
> the pagecache copy? What does the "reserve" part add?
In ocfs2, we can't just crash our filesystem. We have to be
safe not just with respect to the local machine, we have to leave the
filesystem in a consistent state - structure *and* data - for the other
nodes.
The ordering and locking of allocation in get_block(s)() is so
bad that we just Don't Do It. By the time get_block(s)() is called, we
require our filesystem to have the allocation done. We do our
allocation in write_begin(). By the time we get to the page copy, we
can't ENOSPC or EDQUOT. O_DIRECT I/O falls back to sync buffered I/O if
it must allocate, pushing us through write_begin() forcing other nodes
to honor what we've done.
This is easily extended to the reserve multipage operation.
It's not delalloc, because we actually allocate in the reserve
operation. We handle it just like a large case of the single page
operation. Someday we hope to add delalloc, and it would actually do
better here.
I guess you could call this "copy middle" like Dave describes in
his followup to your mail. Copy Middle also has the property that it
can handle short writes without any error handling. Copy First has to
discover it can only get half the allocation and drop the latter half of
the pagecache. Copy Last has to discover it can only do half the page
copy and drop the latter half of the allocation.
> > IMO, the fundamental issue with using hole punching or direct IO
> > from the zero page to handle errors is that they are complex enough
> > that there is *no guarantee that they will succeed*. e.g. Both can
> > get ENOSPC/EDQUOT because they may end up with metadata allocation
> > requirements above and beyond what was originally reserved. If the
> > error handling fails to handle the error, then where do we go from
> > there?
>
> There are already fundamental issues that seems like they are not
> handled properly if your filesystem may allocate uninitialized blocks
> over holes for writeback cache without somehow marking them as
> uninitialized.
>
> If you get a power failure or IO error before the pagecache can be
> written out, you're left with uninitialized data there, aren't you?
> Simple buffer head based filesystems are already subject to this.
Sure, ext2 does this. But don't most filesystems guaranteeing
state actually make sure to order such I/Os? If you run ext3 in
data=writeback, you get what you pay for. This sounds like a red
herring.
Dave's original point stands. ocfs2 supports unwritten extents
and punching holes. In fact, we directly copied the XFS ioctl(2)s. But
when we do punch holes, we have to adjust our tree. That may require
additional metadata, and *that* can fail with ENOSPC or EDQUOT.
Joel
--
"I always thought the hardest questions were those I could not answer.
Now I know they are the ones I can never ask."
- Charlie Watkins
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@...cle.com
Phone: (650) 506-8127
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists