lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 23 Jun 2023 14:37:17 +0000
From:   Chuck Lever III <chuck.lever@...cle.com>
To:     Tejun Heo <tj@...nel.org>
CC:     open list <linux-kernel@...r.kernel.org>,
        Linux NFS Mailing List <linux-nfs@...r.kernel.org>
Subject: Re: contention on pwq->pool->lock under heavy NFS workload



> On Jun 22, 2023, at 3:39 PM, Chuck Lever III <chuck.lever@...cle.com> wrote:
> 
> 
> 
>> On Jun 22, 2023, at 3:23 PM, Tejun Heo <tj@...nel.org> wrote:
>> 
>> Hello,
>> 
>> On Thu, Jun 22, 2023 at 03:45:18PM +0000, Chuck Lever III wrote:
>>> The good news:
>>> 
>>> On stock 6.4-rc7:
>>> 
>>> fio 8k [r=108k,w=46.9k IOPS]
>>> 
>>> On the affinity-scopes-v2 branch (with no other tuning):
>>> 
>>> fio 8k [r=130k,w=55.9k IOPS]
>> 
>> Ah, okay, that's probably coming from per-cpu pwq. Didn't expect that to
>> make that much difference but that's nice.
> 
> "cpu" and "smt" work equally well on this system.
> 
> "cache", "numa", and "system" work equally poorly.
> 
> I have HT disabled, and there's only one NUMA node, so
> the difference here is plausible.
> 
> 
>>> The bad news:
>>> 
>>> pool->lock is still the hottest lock on the system during the test.
>>> 
>>> I'll try some of the alternate scope settings this afternoon.
>> 
>> Yeah, in your system, there's still gonna be one pool shared across all
>> CPUs. SMT or CPU may behave better but it might make sense to add a way to
>> further segment the scope so that e.g. one can split a cache domain N-ways.
> 
> If there could be more than one pool to choose from, then these
> WQs would not be hitting the same lock. Alternately, finding a
> lockless way to queue the work on a pool would be a huge win.

Following up with a few more tests.

I'm using NFS/RDMA for my test because I can drive more IOPS with it.

I've found that setting the nfsiod and rpciod workqueues to "cpu"
scope provide the best benefit for this workload. Changing the
xprtiod workqueue to "cpu" had no discernible effect.

This tracks with the number of queue_work calls for each of these
WQs. 59% of queue_work calls during the test are for the rpciod
WQ, 21% are for nfsiod, and 2% is for xprtiod.


The same test with TCP (using IP-over-IB on the same physical network)
shows no improvement on any test. That suggests there is a bottleneck
somewhere else, when using TCP, that limits its throughput.


--
Chuck Lever


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ