lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 23 Sep 2022 16:15:29 -0700 (PDT)
From:   Hugh Dickins <hughd@...gle.com>
To:     Keith Busch <kbusch@...nel.org>
cc:     Hugh Dickins <hughd@...gle.com>, Jan Kara <jack@...e.cz>,
        Jens Axboe <axboe@...nel.dk>,
        Yu Kuai <yukuai1@...weicloud.com>,
        Liu Song <liusong@...ux.alibaba.com>,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH next] sbitmap: fix lockup while swapping

On Fri, 23 Sep 2022, Hugh Dickins wrote:
> On Fri, 23 Sep 2022, Keith Busch wrote:
> 
> > Does the following fix the observation? Rational being that there's no reason
> > to spin on the current wait state that is already under handling; let
> > subsequent clearings proceed to the next inevitable wait state immediately.
> 
> It's running fine without lockup so far; but doesn't this change merely
> narrow the window?  If this is interrupted in between atomic_try_cmpxchg()
> setting wait_cnt to 0 and sbq_index_atomic_inc() advancing wake_index,
> don't we run the same risk as before, of sbitmap_queue_wake_up() from
> the interrupt handler getting stuck on that wait_cnt 0?

Yes, it ran successfully for 50 minutes, then an interrupt came in
immediately after the cmpxchg, and it locked up just as before.

Easily dealt with by disabling interrupts, no doubt, but I assume it's a
badge of honour not to disable interrupts here (except perhaps in waking).

Some clever way to make the wait_cnt and wake_index adjustments atomic?

Or is this sbitmap_queue_wake_up() interrupting sbitmap_queue_wake_up()
just supposed never to happen, the counts preventing it: but some
misaccounting letting it happen by mistake?

> 
> > 
> > ---
> > diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> > index 624fa7f118d1..47bf7882210b 100644
> > --- a/lib/sbitmap.c
> > +++ b/lib/sbitmap.c
> > @@ -634,6 +634,13 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
> >  
> >  	*nr -= sub;
> >  
> > +	/*
> > +	 * Increase wake_index before updating wait_cnt, otherwise concurrent
> > +	 * callers can see valid wait_cnt in old waitqueue, which can cause
> > +	 * invalid wakeup on the old waitqueue.
> > +	 */
> > +	sbq_index_atomic_inc(&sbq->wake_index);
> > +
> >  	/*
> >  	 * When wait_cnt == 0, we have to be particularly careful as we are
> >  	 * responsible to reset wait_cnt regardless whether we've actually
> > @@ -660,13 +667,6 @@ static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
> >  	 * of atomic_set().
> >  	 */
> >  	smp_mb__before_atomic();
> > -
> > -	/*
> > -	 * Increase wake_index before updating wait_cnt, otherwise concurrent
> > -	 * callers can see valid wait_cnt in old waitqueue, which can cause
> > -	 * invalid wakeup on the old waitqueue.
> > -	 */
> > -	sbq_index_atomic_inc(&sbq->wake_index);
> >  	atomic_set(&ws->wait_cnt, wake_batch);
> >  
> >  	return ret || *nr;
> > --
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ