linux-kernel - Re: [PATCH v2 00/20] Add support for shared PTEs across processes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <1968222200.618999.1750242067613@office-sso.mailbox.org>
Date: Wed, 18 Jun 2025 12:21:07 +0200 (CEST)
From: Jakub Wartak <jakub.wartak@...lbox.org>
To: "anthony.yznaga@...cle.com" <anthony.yznaga@...cle.com>
Cc: "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"andreyknvl@...il.com" <andreyknvl@...il.com>,
	"arnd@...db.de" <arnd@...db.de>,
	"brauner@...nel.org" <brauner@...nel.org>,
	"catalin.marinas@....com" <catalin.marinas@....com>,
	"dave.hansen@...el.com" <dave.hansen@...el.com>,
	"david@...hat.com" <david@...hat.com>,
	"ebiederm@...ssion.com" <ebiederm@...ssion.com>,
	"khalid@...nel.org" <khalid@...nel.org>,
	"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"luto@...nel.org" <luto@...nel.org>,
	"markhemm@...glemail.com" <markhemm@...glemail.com>,
	"maz@...nel.org" <maz@...nel.org>,
	"mhiramat@...nel.org" <mhiramat@...nel.org>,
	"neilb@...e.de" <neilb@...e.de>, "pcc@...gle.com" <pcc@...gle.com>,
	"rostedt@...dmis.org" <rostedt@...dmis.org>,
	"vasily.averin@...ux.dev" <vasily.averin@...ux.dev>,
	"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
	"willy@...radead.org" <willy@...radead.org>,
	"xhao@...ux.alibaba.com" <xhao@...ux.alibaba.com>
Subject: Re: [PATCH v2 00/20] Add support for shared PTEs across processes

Hi all,

I wanted to share some results. I modified PostgreSQL (master) to use the proposed here msharefs patchset (v2) on top of linux-6.14.7 kernel as I suspected sharing PTEs might be helpful in some cases, especially with high process counts. Traditionally in PostgreSQL having process counts is an anti-pattern and it's not recommended (for various reasons) to have that many backends (process) running, but I was researching for the exact reasons why (there are plenty others too), but in short that's how I suspected dTLB misses, followed up on PTEs and finally arrived here: msharefs.

I've tried it on a couple scenarios and it always helps (+5% .. 40%) in artificial pgbench readonly measurements on any machine, but here I'm posting results:
a. from some properly isolated legacy SMP box in homelab (4s32c64/4xNUMA nodes, Xeon 46xx, 128GB RAM)
b. PostgreSQL's pgbench OLTP-like benchmark was used with -c $c -j 64 -S -T 60 -P 1
c. PostgreSQLs shared_buffers(shared_memory)=32GB
d. pgbench -i -s 2000 (~31GB, all used data was in shared memory, not in VFS cache, to avoid syscalls),
e. no hugepages were used as msharefs seems to not support it yet (but Anthony already told me he's on it) 
f. I've used cpupower with perf governor, D0 and no_turbo as well and data was prewarmed.

Again, having PostgreSQL with 8k or 16k processes is not the way to go, but it illustrates well that fork() model (1 client = 1 process) can really benefit from msharefs:

shared_memory_type=mmap (default on Linux is mmap(MAP_SHARED)+fork())
 c=8000 tps  = 143-150k (~4s to init all conns)
 c=16000 tps = 130-140k (~50s-70s! to init all conns! had to extend benchmark, lots of fork()!)

shared_memory_type=msharefs (literally same as above, open()/fallocate()/ioctl()/mmap()+fork()):
 c=8000 tps  = ~189k (3s to init all conns)
 c=16000 tps = ~189k (6s to init all conns)

That's 1.35x - 1.45x.

Illustrative sample of 1 second of `perf stat -a -e ...` during those run with 16k processes:

# mmap:
#           time             counts unit events
   190.223101118        15257144598      cycles
   190.223101118        10485389437      instructions                     #    0.69  insn per cycle
   190.223101118              34413      context-switches
   190.223101118                703      cpu-migrations
   190.223101118                  0      major-faults
   190.223101118             256302      minor-faults
   190.223101118         3922621887      dTLB-loads
   190.223101118           12520660      dTLB-load-misses                 #    0.32% of all dTLB cache accesses
   
# msharefs:
#           time             counts unit events
   105.122916131        15256454170      cycles
   105.122916131        10732582790      instructions                     #    0.70  insn per cycle
   105.122916131              38420      context-switches
   105.122916131               1125      cpu-migrations
   105.122916131                  0      major-faults
   105.122916131              34304      minor-faults
   105.122916131         4143569524      dTLB-loads
   105.122916131           12179260      dTLB-load-misses                 #    0.29% of all dTLB cache accesses 

On smaller hardware and single socket there are also such gains even on the lower process counts, but the more process are running concurrently and accessing shared memory the bigger the performance boost. I hope this feedback is useful (so it's not only lowering memory use for PTEs, but also quite a nice perf. boost). I would like too to thank Anthony and Khalid for answering some initial questions outside mailing list.

BTW I have not yet posted it to PostgreSQL main hacking mailing list, well... because there's no kernel in the first place to support that ;)

-J.

p.s. I'm not subscribed to linux-mm, so please CC me.