linux-kernel - Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <2257859e-dad6-3356-93a7-e87f02104969@de.ibm.com>
Date:   Thu, 7 Dec 2017 10:20:13 +0100
From:   Christian Borntraeger <borntraeger@...ibm.com>
To:     Christoph Hellwig <hch@....de>
Cc:     Jens Axboe <axboe@...nel.dk>,
        Bart Van Assche <Bart.VanAssche@....com>,
        "linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Stefan Haberland <sth@...ux.vnet.ibm.com>,
        linux-s390 <linux-s390@...r.kernel.org>,
        Martin Schwidefsky <schwidefsky@...ibm.com>
Subject: Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with
 virtio-blk (also 4.12 stable)



On 12/07/2017 12:29 AM, Christoph Hellwig wrote:
> On Wed, Dec 06, 2017 at 01:25:11PM +0100, Christian Borntraeger wrote:
> t > commit 11b2025c3326f7096ceb588c3117c7883850c068    -> bad
>>     blk-mq: create a blk_mq_ctx for each possible CPU
>> does not boot on DASD and 
>> commit 9c6ae239e01ae9a9f8657f05c55c4372e9fc8bcc    -> good
>>    genirq/affinity: assign vectors to all possible CPUs
>> does boot with DASD disks.
>>
>> Also adding Stefan Haberland if he has an idea why this fails on DASD and adding Martin (for the
>> s390 irq handling code).
> 
> That is interesting as it really isn't related to interrupts at all,
> it just ensures that possible CPUs are set in ->cpumask.
> 
> I guess we'd really want:
> 
> e005655c389e3d25bf3e43f71611ec12f3012de0
> "blk-mq: only select online CPUs in blk_mq_hctx_next_cpu"
> 
> before this commit, but it seems like the whole stack didn't work for
> your either.
> 
> I wonder if there is some weird thing about nr_cpu_ids in s390?

The problem starts as soon as NR_CPUS is larger than the number
of real CPUs.

Aquestions Wouldnt your change in blk_mq_hctx_next_cpu fail if there is more than 1 non-online cpu:

e.g. dont we need something like (whitespace and indent damaged)

@@ -1241,11 +1241,11 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
        if (--hctx->next_cpu_batch <= 0) {
                int next_cpu;
 
+               do  {
                next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
-               if (!cpu_online(next_cpu))
-                       next_cpu = cpumask_next(next_cpu, hctx->cpumask);
                if (next_cpu >= nr_cpu_ids)
                        next_cpu = cpumask_first(hctx->cpumask);
+               } while (!cpu_online(next_cpu));
 
                hctx->next_cpu = next_cpu;
                hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;

it does not fix the issue, though (and it would be pretty inefficient for large NR_CPUS)