[<prev] [next>] [day] [month] [year] [list]
Message-ID: <1968222200.618999.1750242067613@office-sso.mailbox.org>
Date: Wed, 18 Jun 2025 12:21:07 +0200 (CEST)
From: Jakub Wartak <jakub.wartak@...lbox.org>
To: "anthony.yznaga@...cle.com" <anthony.yznaga@...cle.com>
Cc: "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"andreyknvl@...il.com" <andreyknvl@...il.com>,
"arnd@...db.de" <arnd@...db.de>,
"brauner@...nel.org" <brauner@...nel.org>,
"catalin.marinas@....com" <catalin.marinas@....com>,
"dave.hansen@...el.com" <dave.hansen@...el.com>,
"david@...hat.com" <david@...hat.com>,
"ebiederm@...ssion.com" <ebiederm@...ssion.com>,
"khalid@...nel.org" <khalid@...nel.org>,
"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"luto@...nel.org" <luto@...nel.org>,
"markhemm@...glemail.com" <markhemm@...glemail.com>,
"maz@...nel.org" <maz@...nel.org>,
"mhiramat@...nel.org" <mhiramat@...nel.org>,
"neilb@...e.de" <neilb@...e.de>, "pcc@...gle.com" <pcc@...gle.com>,
"rostedt@...dmis.org" <rostedt@...dmis.org>,
"vasily.averin@...ux.dev" <vasily.averin@...ux.dev>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
"willy@...radead.org" <willy@...radead.org>,
"xhao@...ux.alibaba.com" <xhao@...ux.alibaba.com>
Subject: Re: [PATCH v2 00/20] Add support for shared PTEs across processes
Hi all,
I wanted to share some results. I modified PostgreSQL (master) to use the proposed here msharefs patchset (v2) on top of linux-6.14.7 kernel as I suspected sharing PTEs might be helpful in some cases, especially with high process counts. Traditionally in PostgreSQL having process counts is an anti-pattern and it's not recommended (for various reasons) to have that many backends (process) running, but I was researching for the exact reasons why (there are plenty others too), but in short that's how I suspected dTLB misses, followed up on PTEs and finally arrived here: msharefs.
I've tried it on a couple scenarios and it always helps (+5% .. 40%) in artificial pgbench readonly measurements on any machine, but here I'm posting results:
a. from some properly isolated legacy SMP box in homelab (4s32c64/4xNUMA nodes, Xeon 46xx, 128GB RAM)
b. PostgreSQL's pgbench OLTP-like benchmark was used with -c $c -j 64 -S -T 60 -P 1
c. PostgreSQLs shared_buffers(shared_memory)=32GB
d. pgbench -i -s 2000 (~31GB, all used data was in shared memory, not in VFS cache, to avoid syscalls),
e. no hugepages were used as msharefs seems to not support it yet (but Anthony already told me he's on it)
f. I've used cpupower with perf governor, D0 and no_turbo as well and data was prewarmed.
Again, having PostgreSQL with 8k or 16k processes is not the way to go, but it illustrates well that fork() model (1 client = 1 process) can really benefit from msharefs:
shared_memory_type=mmap (default on Linux is mmap(MAP_SHARED)+fork())
c=8000 tps = 143-150k (~4s to init all conns)
c=16000 tps = 130-140k (~50s-70s! to init all conns! had to extend benchmark, lots of fork()!)
shared_memory_type=msharefs (literally same as above, open()/fallocate()/ioctl()/mmap()+fork()):
c=8000 tps = ~189k (3s to init all conns)
c=16000 tps = ~189k (6s to init all conns)
That's 1.35x - 1.45x.
Illustrative sample of 1 second of `perf stat -a -e ...` during those run with 16k processes:
# mmap:
# time counts unit events
190.223101118 15257144598 cycles
190.223101118 10485389437 instructions # 0.69 insn per cycle
190.223101118 34413 context-switches
190.223101118 703 cpu-migrations
190.223101118 0 major-faults
190.223101118 256302 minor-faults
190.223101118 3922621887 dTLB-loads
190.223101118 12520660 dTLB-load-misses # 0.32% of all dTLB cache accesses
# msharefs:
# time counts unit events
105.122916131 15256454170 cycles
105.122916131 10732582790 instructions # 0.70 insn per cycle
105.122916131 38420 context-switches
105.122916131 1125 cpu-migrations
105.122916131 0 major-faults
105.122916131 34304 minor-faults
105.122916131 4143569524 dTLB-loads
105.122916131 12179260 dTLB-load-misses # 0.29% of all dTLB cache accesses
On smaller hardware and single socket there are also such gains even on the lower process counts, but the more process are running concurrently and accessing shared memory the bigger the performance boost. I hope this feedback is useful (so it's not only lowering memory use for PTEs, but also quite a nice perf. boost). I would like too to thank Anthony and Khalid for answering some initial questions outside mailing list.
BTW I have not yet posted it to PostgreSQL main hacking mailing list, well... because there's no kernel in the first place to support that ;)
-J.
p.s. I'm not subscribed to linux-mm, so please CC me.
Powered by blists - more mailing lists