lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LRH.2.02.1703081129080.17825@file01.intranet.prod.int.rdu2.redhat.com>
Date:   Wed, 8 Mar 2017 11:40:01 -0500 (EST)
From:   Mikulas Patocka <mpatocka@...hat.com>
To:     NeilBrown <neilb@...e.com>
cc:     Mike Snitzer <snitzer@...hat.com>, Jens Axboe <axboe@...nel.dk>,
        Jack Wang <jinpu.wang@...fitbricks.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Lars Ellenberg <lars.ellenberg@...bit.com>,
        Kent Overstreet <kent.overstreet@...il.com>,
        Pavel Machek <pavel@....cz>
Subject: Re: blk: improve order of bio handling in generic_make_request()



On Wed, 8 Mar 2017, NeilBrown wrote:

> On Tue, Mar 07 2017, Mike Snitzer wrote:
> 
> > On Tue, Mar 07 2017 at 12:05pm -0500,
> > Jens Axboe <axboe@...nel.dk> wrote:
> >
> >> On 03/07/2017 09:52 AM, Mike Snitzer wrote:
> >> > On Tue, Mar 07 2017 at  3:49am -0500,
> >> > Jack Wang <jinpu.wang@...fitbricks.com> wrote:
> >> > 
> >> >>
> >> >>
> >> >> On 06.03.2017 21:18, Jens Axboe wrote:
> >> >>> On 03/05/2017 09:40 PM, NeilBrown wrote:
> >> >>>> On Fri, Mar 03 2017, Jack Wang wrote:
> >> >>>>>
> >> >>>>> Thanks Neil for pushing the fix.
> >> >>>>>
> >> >>>>> We can optimize generic_make_request a little bit:
> >> >>>>> - assign bio_list struct hold directly instead init and merge
> >> >>>>> - remove duplicate code
> >> >>>>>
> >> >>>>> I think better to squash into your fix.
> >> >>>>
> >> >>>> Hi Jack,
> >> >>>>  I don't object to your changes, but I'd like to see a response from
> >> >>>>  Jens first.
> >> >>>>  My preference would be to get the original patch in, then other changes
> >> >>>>  that build on it, such as this one, can be added.  Until the core
> >> >>>>  changes lands, any other work is pointless.
> >> >>>>
> >> >>>>  Of course if Jens wants a this merged before he'll apply it, I'll
> >> >>>>  happily do that.
> >> >>>
> >> >>> I like the change, and thanks for tackling this. It's been a pending
> >> >>> issue for way too long. I do think we should squash Jack's patch
> >> >>> into the original, as it does clean up the code nicely.
> >> >>>
> >> >>> Do we have a proper test case for this, so we can verify that it
> >> >>> does indeed also work in practice?
> >> >>>
> >> >> Hi Jens,
> >> >>
> >> >> I can trigger deadlock with in RAID1 with test below:
> >> >>
> >> >> I create one md with one local loop device and one remote scsi
> >> >> exported by SRP. running fio with mix rw on top of md, force_close
> >> >> session on storage side. mdx_raid1 is wait on free_array in D state,
> >> >> and a lot of fio also in D state in wait_barrier.
> >> >>
> >> >> With the patch from Neil above, I can no longer trigger it anymore.
> >> >>
> >> >> The discussion was in link below:
> >> >> http://www.spinics.net/lists/raid/msg54680.html
> >> > 
> >> > In addition to Jack's MD raid test there is a DM snapshot deadlock test,
> >> > albeit unpolished/needy to get running, see:
> >> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> >> 
> >> Can you run this patch with that test, reverting your DM workaround?
> >
> > Yeap, will do.  Last time Mikulas tried a similar patch it still
> > deadlocked.  But I'll give it a go (likely tomorrow).
> 
> I don't think this will fix the DM snapshot deadlock by itself.
> Rather, it make it possible for some internal changes to DM to fix it.
> The DM change might be something vaguely like:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3086da5664f3..06ee0960e415 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
> 
>  	len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
> 
> +	if (len < ci->sector_count) {
> +		struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);

fs_bio_set is a shared bio set, so it is prone to deadlocks. For this 
change, we would need two bio sets per dm device, one for the split bio 
and one for the outgoing bio. (this also means having one more kernel 
thread per dm device)

It would be possible to avoid having two bio sets if the incoming bio were 
the same as the outgoing bio (allocate a small structure, move bi_end_io 
and bi_private into it, replace bi_end_io and bi_private with pointers to 
device mapper and send the bio to the target driver), but it would need 
much more changes - basically rewrite the whole bio handling code in dm.c 
and in the targets.

Mikulas

> +		bio_chain(split, bio);
> +		generic_make_request(bio);
> +		bio = split;
> +		ci->sector_count = len;
> +	}
> +
>  	r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
>  	if (r < 0)
>  		return r;
> 
> Instead of looping inside DM, this change causes the remainder to be
> passed to generic_make_request() and DM only handles or region at a
> time.  So there is only one loop, in the top generic_make_request().
> That loop will not reliable handle bios in the "right" order.
> 
> Thanks,
> NeilBrown
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ