netdev - Re: [PATCH net-next v2 2/2] net/rds: Give each connection its own workqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <350dba377281866c153b95b4a03c8b6c998b5d8f.camel@oracle.com>
Date: Sat, 22 Nov 2025 00:36:05 +0000
From: Allison Henderson <allison.henderson@...cle.com>
To: "achender@...nel.org" <achender@...nel.org>,
        "netdev@...r.kernel.org"
	<netdev@...r.kernel.org>,
        "pabeni@...hat.com" <pabeni@...hat.com>
CC: "rds-devel@....oracle.com" <rds-devel@....oracle.com>,
        "horms@...nel.org"
	<horms@...nel.org>,
        "edumazet@...gle.com" <edumazet@...gle.com>,
        "kuba@...nel.org" <kuba@...nel.org>,
        "linux-rdma@...r.kernel.org"
	<linux-rdma@...r.kernel.org>
Subject: Re: [PATCH net-next v2 2/2] net/rds: Give each connection its own
 workqueue

On Thu, 2025-11-20 at 11:43 +0100, Paolo Abeni wrote:
> On 11/17/25 9:23 PM, Allison Henderson wrote:
> > From: Allison Henderson <allison.henderson@...cle.com>
> > 
> > RDS was written to require ordered workqueues for "cp->cp_wq":
> > Work is executed in the order scheduled, one item at a time.
> > 
> > If these workqueues are shared across connections,
> > then work executed on behalf of one connection blocks work
> > scheduled for a different and unrelated connection.
> > 
> > Luckily we don't need to share these workqueues.
> > While it obviously makes sense to limit the number of
> > workers (processes) that ought to be allocated on a system,
> > a workqueue that doesn't have a rescue worker attached,
> > has a tiny footprint compared to the connection as a whole:
> > A workqueue costs ~900 bytes, including the workqueue_struct,
> > pool_workqueue, workqueue_attrs, wq_node_nr_active and the
> > node_nr_active flex array.  While an RDS/IB connection
> > totals only ~5 MBytes.
> 
> The above accounting still looks incorrect to me. AFAICS
> pool_workqueue/cpu_pwq is a per CPU data. On recent hosts it will
> require 64K or more.


I think that's true of bounded queues, but for unbounded queues with only one worker, it should just be one?  If we look
at this call stack: __rds_conn_create -> __alloc_workqueue -> alloc_and_link_pwqs 

We see this in alloc_and_link_pwqs: 

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
	if (!(wq->flags & WQ_UNBOUND)) {

	... per cpu queues allocated here...

	return 0;
	}

        if (wq->flags & __WQ_ORDERED) {
                struct pool_workqueue *dfl_pwq;

                ret = apply_workqueue_attrs_locked(wq, ordered_wq_attrs[highpri]);
                /* there should only be single pwq for ordering guarantee */
                dfl_pwq = rcu_access_pointer(wq->dfl_pwq);
                
                ...
        }
        ....
}

I just realized in my last response I mentioned the kmem_cache_alloc_node call in this function, but that appears in the
!(wq->flags & WQ_UNBOUND) code path, which is not correct in our case.  So apologies for the confusion there.  We should
end up in the __WQ_ORDERED path since the alloc_ordered_workqueue sets  "WQ_UNBOUND | __WQ_ORDERED".  

Then in apply_workqueue_attrs_locked -> apply_wqattrs_commit, we have:
ctx->dfl_pwq = install_unbound_pwq(ctx->wq, -1, ctx->dfl_pwq);

The -1 is the cpu parameter. And cpu < 0 will assign the default pool_workqueue for unbounded queues (dfl_pwq). 

Let me know if this helps or if there's a different code path you're seeing that I've missed? 

> Also it looks like it would a WQ per path, up to 8 WQs per connection.
Yes that's true, it should say "connection path," not just "connection".  So in the text above, how about:


"A workqueue costs ~900 bytes, including the workqueue_struct,
pool_workqueue, workqueue_attrs, wq_node_nr_active and the
node_nr_active flex array.  Each connection can have up to 8 
(RDS_MPATH_WORKERS) paths for a worst case of ~7 KBytes per 
connection.  While an RDS/IB connection totals only ~5 MBytes."

Let me know if that sounds ok, or if you think there should be
more detail in the break down?

We can also rename the patch to "Give each connection path its own workqueue"

> > So we're getting a signficant performance gain
> > (90% of connections fail over under 3 seconds vs. 40%)
> > for a less than 0.02% overhead.
> > 
> > RDS doesn't even benefit from the additional rescue workers:
> > of all the reasons that RDS blocks workers, allocation under
> > memory pressue is the least of our concerns. And even if RDS
> > was stalling due to the memory-reclaim process, the work
> > executed by the rescue workers are highly unlikely to free up
> > any memory. If anything, they might try to allocate even more.
> > 
> > By giving each connection its own workqueues, we allow RDS
And here, how about:

"By giving each connection path its own workqueue, ..."

?

> > to better utilize the unbound workers that the system
> > has available.
> > 
> > Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@...cle.com>
> > Signed-off-by: Allison Henderson <allison.henderson@...cle.com>
> > ---
> >  net/rds/connection.c | 13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/rds/connection.c b/net/rds/connection.c
> > index dc7323707f450..dcb554e10531f 100644
> > --- a/net/rds/connection.c
> > +++ b/net/rds/connection.c
> > @@ -269,7 +269,15 @@ static struct rds_connection *__rds_conn_create(struct net *net,
> >  		__rds_conn_path_init(conn, &conn->c_path[i],
> >  				     is_outgoing);
> >  		conn->c_path[i].cp_index = i;
> > -		conn->c_path[i].cp_wq = rds_wq;
> > +		conn->c_path[i].cp_wq = alloc_ordered_workqueue(
> > +						"krds_cp_wq#%lu/%d", 0,
> > +						rds_conn_count, i);
> This has a reasonable chance of failure under memory pressure, what
> about falling back to rds_wq usage instead of shutting down the
> connection entirely?
Sure, we can add it as a fall back, it just means there will be a little extra handling in rds_conn_path_destroy to make
sure we don't tear down rds_wq.  

I hope this has helped some? Let me know what you think or if you think the queue accounting needs more digging.  Thanks
for the reviews!

Allison

> 
> /P
>