linux-kernel - Re: contention on pwq->pool->lock under heavy NFS workload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <24E8E2D2-F91B-47F6-91BF-02D02750054F@oracle.com>
Date:   Sun, 25 Jun 2023 16:01:38 +0000
From:   Chuck Lever III <chuck.lever@...cle.com>
To:     Tejun Heo <tj@...nel.org>
CC:     open list <linux-kernel@...r.kernel.org>,
        Linux NFS Mailing List <linux-nfs@...r.kernel.org>
Subject: Re: contention on pwq->pool->lock under heavy NFS workload

Hi Tejun-


> On Jun 23, 2023, at 9:44 PM, Tejun Heo <tj@...nel.org> wrote:
> 
> Hey,
> 
> On Fri, Jun 23, 2023 at 02:37:17PM +0000, Chuck Lever III wrote:
>> I'm using NFS/RDMA for my test because I can drive more IOPS with it.
>> 
>> I've found that setting the nfsiod and rpciod workqueues to "cpu"
>> scope provide the best benefit for this workload. Changing the
>> xprtiod workqueue to "cpu" had no discernible effect.
>> 
>> This tracks with the number of queue_work calls for each of these
>> WQs. 59% of queue_work calls during the test are for the rpciod
>> WQ, 21% are for nfsiod, and 2% is for xprtiod.
>> 
>> The same test with TCP (using IP-over-IB on the same physical network)
>> shows no improvement on any test. That suggests there is a bottleneck
>> somewhere else, when using TCP, that limits its throughput.
> 
> Yeah, you can make the necessary workqueues to default to CPU or SMT scope
> using apply_workqueue_attrs(). The interface a bit cumbersome and we
> probably wanna add convenience helpers to switch e.g. affinity scopes but
> it's still just several lines of code.

6037 static ssize_t wq_affn_scope_store(struct device *dev,
6038                                    struct device_attribute *attr,
6039                                    const char *buf, size_t count)
6040 {
6041         struct workqueue_struct *wq = dev_to_wq(dev);
6042         struct workqueue_attrs *attrs;
6043         int affn, ret = -ENOMEM;
6044
6045         affn = parse_affn_scope(buf);
6046         if (affn < 0)
6047                 return affn;
6048
6049         apply_wqattrs_lock();             <<< takes &wq_pool_mutex
6050         attrs = wq_sysfs_prep_attrs(wq);  <<< copies the wq_attrs
6051         if (attrs) {
6052                 attrs->affn_scope = affn;
6053                 ret = apply_workqueue_attrs_locked(wq, attrs);
6054         }
6055         apply_wqattrs_unlock();
6056         free_workqueue_attrs(attrs);
6057         return ret ?: count;
6058 }   

Both wq_pool_mutex and copy_workqueue_attrs() are static, so having
only apply_workqueue_attrs() is not yet enough to carry this off
in workqueue consumers such as sunrpc.ko.

It looks like padata_setup_cpumasks() for example is holding the
CPU read lock, but it doesn't take the wq_pool_mutex.
apply_wqattrs_prepare() has a "lockdep_assert_held(&wq_pool_mutex);" .

I can wait for a v3 of this series so you can construct the public
API the way you prefer.


--
Chuck Lever