linux-kernel - Re: bug in tag handling in blk-mq?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <84145CD7-B917-4B32-8A5C-310C1910DB71@linaro.org>
Date:   Mon, 7 May 2018 20:02:03 +0200
From:   Paolo Valente <paolo.valente@...aro.org>
To:     Jens Axboe <axboe@...nel.dk>
Cc:     Mike Galbraith <efault@....de>, Christoph Hellwig <hch@....de>,
        linux-block <linux-block@...r.kernel.org>,
        Ulf Hansson <ulf.hansson@...aro.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linus Walleij <linus.walleij@...aro.org>,
        Oleksandr Natalenko <oleksandr@...alenko.name>
Subject: Re: bug in tag handling in blk-mq?



> Il giorno 07 mag 2018, alle ore 18:39, Jens Axboe <axboe@...nel.dk> ha scritto:
> 
> On 5/7/18 8:03 AM, Paolo Valente wrote:
>> Hi Jens, Christoph, all,
>> Mike Galbraith has been experiencing hangs, on blk_mq_get_tag, only
>> with bfq [1].  Symptoms seem to clearly point to a problem in I/O-tag
>> handling, triggered by bfq because it limits the number of tags for
>> async and sync write requests (in bfq_limit_depth).
>> 
>> Fortunately, I just happened to find a way to apparently confirm it.
>> With the following one-liner for block/bfq-iosched.c:
>> 
>> @@ -554,8 +554,7 @@ static void bfq_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
>>        if (unlikely(bfqd->sb_shift != bt->sb.shift))
>>                bfq_update_depths(bfqd, bt);
>> 
>> -       data->shallow_depth =
>> -               bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
>> +       data->shallow_depth = 1;
>> 
>>        bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
>>                        __func__, bfqd->wr_busy_queues, op_is_sync(op),
>> 
>> Mike's machine now crashes soon and systematically, while nothing bad
>> happens on my machines, even with heavy workloads (apart from an
>> expected throughput drop).
>> 
>> This change simply reduces to 1 the maximum possible value for the sum
>> of the number of async requests and of sync write requests.
>> 
>> This email is basically a request for help to knowledgeable people.  To
>> start, here are my first doubts/questions:
>> 1) Just to be certain, I guess it is not normal that blk-mq hangs if
>> async requests and sync write requests can be at most one, right?
>> 2) Do you have any hint to where I could look for, to chase this bug?
>> Of course, the bug may be in bfq, i.e, it may be a somehow unrelated
>> bfq bug that causes this hang in blk-mq, indirectly.  But it is hard
>> for me to understand how.
> 
> CC Omar, since he implemented the shallow part. But we'll need some
> traces to show where we are hung, probably also the value of the
> /sys/debug/kernel/block/<dev>/ directory. For the crash mentioned, a
> trace as well. Otherwise we'll be wasting a lot of time on this.
> 
> Is there a reproducer?
> 

Ok Mike, I guess it's your turn now, for at least a stack trace.

Thanks,
Paolo

> -- 
> Jens Axboe