[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110526140558.GJ9520@thunk.org>
Date: Thu, 26 May 2011 10:05:58 -0400
From: Ted Ts'o <tytso@....edu>
To: Andreas Dilger <adilger@...ger.ca>
Cc: Jan Kara <jack@...e.cz>, Manish Katiyar <mkatiyar@...il.com>,
linux-ext4@...r.kernel.org
Subject: Re: [PATCH 2/3] jbd2 : Fix journal start by passing a parameter to
specify if the caller can deal with ENOMEM
On Wed, May 25, 2011 at 10:07:20PM -0600, Andreas Dilger wrote:
> > What JBD2_TOPLEVEL means is that caller is from a top-level file
> > system function, such as ext4_symlink() or ext4_chmod(), such that
> > start_this_handle() can use GFP_KERNEL instead of GFP_NOFS. GFP_NOFS
> > is needed for any function that might get called by the direct reclaim
> > path (i.e., the writepage() function). But for the top-level
> > symlink() or chmod() function, it's actually OK to allocate memory
> > using GFP_KERNEL, since there's no potential recursion problem.
>
> At this point, why not just pass GFP_KERNEL or GFP_NOFS directly,
> optionally with __GFP_NOFAIL?
Well, __GFP_NOFAIL is going away. (At least there are a number of mm
hackers, including akpm, who really want it to go away).
We could key off of GFP_NOFS, but GFP_NOFS doesn't mean loop and make
sure that we don't fail. The two concepts are in fact orthogonal;
it's just at the moment that there are most places which are called in
the fs writeback path which also can't fail.
But just to give one example, ext4_bio_write_page() is an example of a
function that allocates memory GFP_NOFS, but can fail with ENOMEM,
because its caller, mpage_da_submit_io() in fs/ext4/inode.c is
designed to cope with failure in a way that doesn't cause data loss.
(We leave the page dirty, unlock it and back out of the writeback
code, and later the bdi flusher threads will retry the writepages
request.)
- Ted
P.S. That means that there are calls to ext4_journal_start() in the
ext4_da_writepages() code path that probably could be converted to
ext4_journal_start_failok() --- or to ext4_journal_start(inode,
nblocks, JBD2_FAIL_OK) per my suggestion --- once we have fully
audited the error return paths.
P.P.S. Something that might be worth doing is having a sysfs tunable
that causes ext4_journal_start() to return an ENOMEM failure on a per
file system basis with a probability specified by the sysfs tunable.
This would allow us to actually _test_ the error handling to make sure
sane things happen....
What I'd probably do is define a new flag, JBD2_WRITEBACK_TEST, and
use it to make the ext4_journal_start() functions that are allowed to
probabilistically fail (since the retry should be happening at a
higher level), and then run a stress test with the syfs tunable
enabled. Since the flag would only cause ext4_journal_start()
functions that should have automatic fallbacks, the stress test would
be able to run to completion even though 10% of the
ext4_journal_start(... JBD2_WRITEBACK_TEST) calls were failing.
That's another example of why using a flag bitfield instead of a bool
is much more powerful.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists