linux-kernel - Re: [PATCH 1/1] block: System crashes when cpu hotplug + bouncing port

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <YNp1Bho5yypHkPfW@T590>
Date:   Tue, 29 Jun 2021 09:20:27 +0800
From:   Ming Lei <ming.lei@...hat.com>
To:     wenxiong@...ibm.com
Cc:     Daniel Wagner <dwagner@...e.de>, linux-kernel@...r.kernel.org,
        james.smart@...adcom.com, wenxiong@...ibm.com, sagi@...mberg.me
Subject: Re: [PATCH 1/1] block: System crashes when cpu hotplug + bouncing
 port

Hi Wenxiong,

On Mon, Jun 28, 2021 at 01:17:34PM -0500, wenxiong wrote:
> 
> > 
> > The root cause is that blk-mq doesn't work well on tag allocation from
> > specified hctx(blk_mq_alloc_request_hctx), and blk-mq assumes that any
> > request allocation can't cross hctx inactive/offline, see
> > blk_mq_hctx_notify_offline()
> 
> Hi Ming,
> 
> I tried to pass online cpu_id(like cpu=8 in my case) to
> blk_mq_alloc_request_hctx(),
> data.hctx = q->queue_hw_ctx[hctx_idx];
> but looks like data.hctx returned with NULL. So system crashed if accessing
> data.hctx later.
> 
> blk-mq request allocation can't cross hctx inactive/offline but blk-mq still
> reallocate the hctx for offline cpus(like cpu=4,5,6,7 in my case) in
> blk_mq_realloc_hw_ctxs() and hctx are NULL for online(cpu=8 in my case)cpus.
> 
> Below is my understanding for hctxs, please correct me if I am wrong:
> 
> Assume a system has two cores with 16 cpus:
> Before doing cpu hot plug events:
> cpu0-cpu7(core 0) : hctx->state is ACTIVE and q->hctx is not NULL.
> cpu8-cpu15(core 1): hctx->state is ACTIVE and q->hctx is not NULL
> 
> After doing cpu hot plug events(the second half of each core are offline)
> cpu0-cpu3: online, hctx->state is ACTIVE and q->hctx is not NULL.
> cpu4-cpu7: offline,hctx->state is INACTIVE and q->hctx is not NULL
> cpu8-cpu11: online, hctx->state is ACTIVE but q->hctx = NULL
> cpu12-cpu15:offline, hctx->state is INACTIVE and q->hctx = NULL
> 
> So num_online_cpus() is 8 after cpu hotplug events. Either way not working
> for me, no matter I pass 8 online cpus or 4 online/4 offline cpus.
> 
> Is this correct? If nvmf pass online cpu ids to blk-mq, why it still
> crashes/fails?

NVMe users have to pass correct hctx_idx to blk_mq_alloc_request_hctx(), but
from the info you provided, they don't provide valid hctx_idx to blk-mq, so
q->queue_hw_ctx[hctx_idx] is NULL and kernel panic.

I believe Daniel's following patch may fix this specific issue if your
controller is FC:

[1] https://lore.kernel.org/linux-nvme/YNXTaUMAFCA84jfZ@T590/T/#t


Thanks,
Ming