linux-kernel - Re: [PATCH v2 1/1] blk-mq: fix hang caused by freeze/unfreeze sequence

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJrWOzCWfMN7qKipgWn2P0dfycu4y5DQysD+7Q1UANgXBgw0+g@mail.gmail.com>
Date:	Wed, 10 Aug 2016 10:42:09 +0200
From:	Roman Penyaev <roman.penyaev@...fitbricks.com>
To:	Tejun Heo <tj@...nel.org>
Cc:	Akinobu Mita <akinobu.mita@...il.com>,
	Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>,
	linux-block@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/1] blk-mq: fix hang caused by freeze/unfreeze sequence

Hi,

On Wed, Aug 10, 2016 at 5:55 AM, Tejun Heo <tj@...nel.org> wrote:
> Hello,
>
> On Mon, Aug 08, 2016 at 01:39:08PM +0200, Roman Pen wrote:
>> Long time ago there was a similar fix proposed by Akinobu Mita[1],
>> but it seems that time everyone decided to fix this subtle race in
>> percpu-refcount and Tejun Heo[2] did an attempt (as I can see that
>> patchset was not applied).
>
> So, I probably forgot about it while waiting for confirmation of fix.
> Can you please verify that the patchset fixes the issue?  I can apply
> the patchset right away.

I have not checked your patchset but according to my understanding
it should not fix *this* issue.  What happens here is a wrong order
of invocation of percpu_ref_reinit() and percpu_ref_kill().  So what
was observed is the following:

 CPU#0               CPU#1
 ----------------    -----------------
 percpu_ref_kill()

                     percpu_ref_kill() << atomic reference does
 percpu_ref_reinit()                   << not guarantee the order

                     blk_mq_freeze_queue_wait() !! HANG HERE

                     percpu_ref_reinit()

blk_mq_freeze_queue_wait() on CPU#1 expects percpu-refcount to be
switched to ATOMIC mode (killed), but that does not happen, because
CPU#2 was faster and has been switched percpu-refcount to PERCPU
mode.

This race happens inside blk-mq, because invocation of kill/reinit
is controlled by the reference counter, which does not guarantee the
order of the following functions calls (kill/reinit).

So the fix is the same as originally proposed by Akinobu Mita, but
the issue is different.

But of course I can run tests on top of your series, just to verify
that everything goes smoothly and internally percpu-refcount members
are consistent.

--
Roman