netdev - Re: [PATCH net-next v1 2/2] net/rds: Give each connection its own workqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d2b270014a2c6343ffc50fd18700fa53040b5203.camel@oracle.com>
Date: Sat, 8 Nov 2025 01:24:45 +0000
From: Allison Henderson <allison.henderson@...cle.com>
To: "achender@...nel.org" <achender@...nel.org>,
        "netdev@...r.kernel.org"
	<netdev@...r.kernel.org>,
        "pabeni@...hat.com" <pabeni@...hat.com>
Subject: Re: [PATCH net-next v1 2/2] net/rds: Give each connection its own
 workqueue

On Thu, 2025-11-06 at 11:52 +0100, Paolo Abeni wrote:
> On 11/4/25 10:23 PM, Allison Henderson wrote:
> > On Tue, 2025-11-04 at 15:57 +0100, Paolo Abeni wrote:
> > > On 10/29/25 6:46 PM, Allison Henderson wrote:
> > > > From: Allison Henderson <allison.henderson@...cle.com>
> > > > 
> > > > RDS was written to require ordered workqueues for "cp->cp_wq":
> > > > Work is executed in the order scheduled, one item at a time.
> > > > 
> > > > If these workqueues are shared across connections,
> > > > then work executed on behalf of one connection blocks work
> > > > scheduled for a different and unrelated connection.
> > > > 
> > > > Luckily we don't need to share these workqueues.
> > > > While it obviously makes sense to limit the number of
> > > > workers (processes) that ought to be allocated on a system,
> > > > a workqueue that doesn't have a rescue worker attached,
> > > > has a tiny footprint compared to the connection as a whole:
> > > > A workqueue costs ~800 bytes, while an RDS/IB connection
> > > > totals ~5 MBytes.
> > > 
> > > Still a workqueue per connection feels overkill. Have you considered
> > > moving to WQ_PERCPU for rds_wq? Why does not fit?
> > > 
> > > Thanks,
> > > 
> > > Paolo
> > > 
> > Hi Paolo
> > 
> > I hadnt thought of WQ_PERCPU before, so I did some digging on it.  In our case though, we need FIFO behavior per-
> > connection, so if we switched to queues per cpu, we'd have to pin a CPU to a connection to get the right behavior.  And
> > then that brings back head of the line blocking since now all the items on that queue have to share that CPU even if the
> > other CPUs are idle.  So it wouldn't quite be a synonymous solution for what we're trying to do in this case.  I hope
> > that made sense?  Let me know what you think.
> 
> Still the workqueue per connection gives significant more overhead than
> your estimate above. I guess ~800 bytes is sizeof(struct workqueue_struct)?
> 
> Please note that such struct contains several dynamically allocated
> pointers, among them per_cpu ones: the overall amount of memory used is
> significantly greater than your estimate. You should provide a more
> accurate one.
> 

Sure, I've done some digging around with the workqueues and allocation code for us:

So first, here's the size of workqueue_struct (about 320 bytes):
achender@...kbox:~/mnt/net-next_coverage$ pahole -C workqueue_struct vmlinux | grep size
	/* size: 320, cachelines: 5, members: 24 */

Then that plus other members that get allocated:
achender@...kbox:~/mnt/net-next_coverage$ pahole -C pool_workqueue  vmlinux | grep size
	/* size: 512, cachelines: 8, members: 15 */

achender@...kbox:~/mnt/net-next_coverage$ pahole -C workqueue_attrs | grep size
	/* size: 24, cachelines: 1, members: 3 */
	
achender@...kbox:~/mnt/net-next_coverage$ pahole -C wq_node_nr_active vmlinux  | grep size
	/* size: 32, cachelines: 1, members: 4 */

Also at least one 8 byte pointer in the node_nr_active flex array at the bottom of the workqueue_struct.

So that brings us up to 896 bytes for a single node x86_64 system.  The node_nr_active flex array is as large as the
number of nodes.  So in a multi node environment that one will increase by a factor of (32+8)*N nodes.  So the 800 byte
hand wave is a little low.  If you agree with the above accounting we can include the break down in the commit message,
or just bump up the ballpark figure if that's too wordy.

> Much more importantly, using a workqueue per connection provides
> scalibility gain only in the measure that each workqueue uses a
> different pool and thus creates additional kthread(s). I'm haven't dived
> into the workqueue implementation but I think this is not the case. My
> current guestimate is that you measure some gain because the per
> connection WK actually creates (or just use) a single pool different
> from rds_wq's one.
> 
> Please double check the above.

Sure, so I dont think they get their own pool, but they are allocated part of a shared a part of a pool.  In
__rds_conn_create, we have this call stack where the pool_workqueue gets linked to the queue: 
__rds_conn_create -> __alloc_workqueue -> alloc_and_link_pwqs -> kmem_cache_alloc_node

They do however, get their own kworker.  If we look at alloc_ordered_workqueue in workqueue.h, we get this:

#define alloc_ordered_workqueue(fmt, flags, args...)                    \
        alloc_workqueue(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args)

So, the queues are unbound and ordered, and the 1 is max_active workers.  So, they are allowed to spawn at most one
kworker for that queue. 

> 
> Out of sheer ignorance I suspect/hope that replacing the current
> workqueue with alloc_ordered_workqueue() (possibly UNBOUND?!?) will give
> the same scalability improvement with no cost.
> 
> /P
> 

No worries, from looking at the alloc_ordered_workqueue macro, it looks like the queues are unbound by default.  But if
we had only one unbounded queue with one worker, then we still have the head of the line blocking like we did before. 
It looks to me like ordered work queues imply having only one worker (trying to set more than that on an ordered queue
will just error out in workqueue_set_max_active).  And without the ordering, we loose the fifo behavior we need.  I
think the performance bump we are seeing isn't so much from more workers or pools, its just getting away from the cross-
connection serialization.

I hope that helps?  Let me know if I missed anything or if you think the commit message should be amended in anyway.  

Thank you!
Allison