[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180518191040.GG31737@kmo-pixel>
Date: Fri, 18 May 2018 15:10:40 -0400
From: Kent Overstreet <kent.overstreet@...il.com>
To: Andreas Dilger <adilger@...ger.ca>
Cc: linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Dave Chinner <dchinner@...hat.com>, darrick.wong@...cle.com,
tytso@....edu, linux-btrfs@...r.kernel.org, clm@...com,
jbacik@...com, viro@...iv.linux.org.uk, willy@...radead.org,
peterz@...radead.org
Subject: Re: [PATCH 10/10] Dynamic fault injection
On Fri, May 18, 2018 at 01:05:20PM -0600, Andreas Dilger wrote:
> On May 18, 2018, at 1:49 AM, Kent Overstreet <kent.overstreet@...il.com> wrote:
> >
> > Signed-off-by: Kent Overstreet <kent.overstreet@...il.com>
>
> I agree with Christoph that even if there was some explanation in the cover
> letter, there should be something at least as good in the patch itself. The
> cover letter is not saved, but the commit stays around forever, and should
> explain how this should be added to code, and how to use it from userspace.
>
>
> That said, I think this is a useful functionality. We have something similar
> in Lustre (OBD_FAIL_CHECK() and friends) that is necessary for being able to
> test a distributed filesystem, which is just a CPP macro with an unlikely()
> branch, while this looks more sophisticated. This looks like it has some
> added functionality like having more than one fault enabled at a time.
> If this lands we could likely switch our code over to using this.
This is pretty much what I was looking for, I just wanted to know if this patch
was interesting enough to anyone that I should spend more time on it or just
drop it :) Agreed on documentation. I think it's also worth factoring out the
functionality for the elf section trick that dynamic debug uses too.
> Some things that are missing from this patch that is in our code:
>
> - in addition to the basic "enabled" and "oneshot" mechanisms, we have:
> - timeout: sleep for N msec to simulate network/disk/locking delays
> - race: wait with one thread until a second thread hits matching check
>
> We also have a "fail_val" that allows making the check conditional (e.g.
> only operation on server "N" should fail, only RPC opcode "N", etc).
Those all sound like good ideas... fail_val especially, I think with that we'd
have all the functionality the existing fault injection framework has (which is
way to heavyweight to actually get used, imo)
Powered by blists - more mailing lists