linux-kernel - Re: [PATCH] block/mq-deadline: Speed up the dispatch of low-priority requests

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <82612be1-d61e-1ad5-8fb5-d592a5bc4789@kernel.dk>
Date:   Thu, 26 Aug 2021 12:45:56 -0600
From:   Jens Axboe <axboe@...nel.dk>
To:     Bart Van Assche <bvanassche@....org>,
        Zhen Lei <thunder.leizhen@...wei.com>,
        linux-block <linux-block@...r.kernel.org>,
        linux-kernel@...r.kernel.org
Cc:     Damien Le Moal <damien.lemoal@....com>
Subject: Re: [PATCH] block/mq-deadline: Speed up the dispatch of low-priority
 requests

On 8/26/21 12:13 PM, Jens Axboe wrote:
> On 8/26/21 12:09 PM, Bart Van Assche wrote:
>> On 8/26/21 7:40 AM, Zhen Lei wrote:
>>> lock protection needs to be added only in dd_finish_request(), which
>>> is unlikely to cause significant performance side effects.
>>
>> Not sure the above is correct. Every new atomic instruction has a
>> measurable performance overhead. But I guess in this case that
>> overhead is smaller than the time needed to sum 128 per-CPU variables.
> 
> perpcu counters only really work, if the summing is not in a hot path,
> or if the summing is just some "not zero" thing instead of a full sum.
> They just don't scale at all for even moderately sized systems.

Ugh it's actually even worse in this case, since you do:

static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)               
{                                                                               
	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);        
}

which ends up iterating possible CPUs _twice_!

Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching
to deadline, of which 37% of the overhead is from dd_dispatch().

With the posted patch applied, it runs at 2.3M IOPS with mq-deadline,
which is a lot better. This is on my 3970X test box, so 32 cores, 64
threads.

Bart, either we fix this up ASAP and get rid of the percpu counters in
the hot path, or we revert this patch.

-- 
Jens Axboe