linux-kernel - Re: [PATCH v13 00/13] nommu UML

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <m2y0nqwbzg.wl-thehajime@gmail.com>
Date: Fri, 28 Nov 2025 21:57:55 +0900
From: Hajime Tazaki <thehajime@...il.com>
To: johannes@...solutions.net
Cc: hch@...radead.org,
	linux-um@...ts.infradead.org,
	ricarkol@...gle.com,
	Liam.Howlett@...cle.com,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v13 00/13] nommu UML


On Tue, 25 Nov 2025 18:58:53 +0900,
Johannes Berg wrote:
> 
> On Wed, 2025-11-12 at 17:52 +0900, Hajime Tazaki wrote:
> > > >   What is it for ?
> > > >   ================
> > > >   
> > > >   - Alleviate syscall hook overhead implemented with ptrace(2)
> > > >   - To exercises nommu code over UML (and over KUnit)
> > > >   - Less dependency to host facilities
> > > 
> > > FWIW, in some way, this order of priorities is exactly why this hasn't
> > > been going anywhere, and every time I looked at it I got somewhat
> > > annoyed by what seems to me like choices made to support especially the
> > > first bullet.
> > 
> > over the past versions, I've been emphasized that the 2nd bullet (testing)
> > is the primary usecase as I saw several actually cases from mm folks,
> > 
> > https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html
> > https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d@lucifer.local/
> > 
> > and I think this is not limited to mm code.
> 
> Not sure there's much value in testing much else in no-MMU, but sure,
> I'll give you that it's useful for testing.

under the tree,

% global -xr CONFIG_MMU | grep ifndef  | grep -v -E "arch/|mm/" | wc -l
45

this is a rough picture but there are places to be tested other than
mm codebase.

> > other 2 bullets are additional benefits which we observed in a
> > comment, and our experience.
> 
> But are they really _worthwhile_ benefits? A lot of this design adds
> additional complexity, and it doesn't really seem necessary for the
> testing use case. Making it faster is nice, but it's not like the
> speedup really is 20x for arbitrary tests, that's just for corner cases
> like "sit in a loop of gettimeofday()". And for kunit there's no syscall
> boundary at all, so there's no speedup.

I agree and as I said the reason to take a single-host-process
approach is from the speed and simplicity of removing interaction
between host processes.

I have never claimed that tests should execute fast.
and agree that kunit doesn't benefit from speed as there is no syscall
(unless kunit-uapi patch will be in).

> > > I suspect that the first and third bullet are not even really true any
> > > more, since you moved to seccomp (per our request), yet I think design
> > > choices influenced by them persist.
> > 
> > this observation is not true; the first bullet is still true even
> > using seccomp.  please look at the benchmark result in the patch
> > [12/13], quoted below.
> 
> > [snip]
> 
> So thanks for the correction. If that's the case, however, it means the
> speedup can't be due to the syscall boundary itself (seccomp) but must
> rather be due to some pagefault/mapping handling issue? Which would be
> inherent in no-MMU, even taking an approach of using two host processes
> rather than embedding everything into one.

I'll explain this later in this email.

# nommu doesn't have page fault as there are only physical address.

> > > However, I'm not yet convinced that all of the complexities presented in
> > > this patchset (such as completely separate seccomp implementation) are
> > > actually necessary in support of _just_ the second bullet. These seem to
> > > me like design choices necessary to support the _first_ bullet [1].
> > 
> > separate seccomp implementation is indeed needed due to the design
> > choice we made, to use a single process to host a (um) userspace.
> 
> That sounds misleading or even wrong to me, I'd say it's due to putting
> the (um) userspace in the same host process as the kernel space?

not sure if this is different from my explanation...

> > I don't see why you see this as a _complexity_, as functionally both
> > seccomp handling don't interfere each other.
> 
> The complexity isn't so much in the separate code, which is a small
> factor, but in the "put everything into the same process" aspect of it.
> That has consequences around the host context state handling, things we
> didn't really need to consider before suddenly become crucially
> important. In the current (with-MMU) design, we only need to worry about
> being able to correctly switch between userspace tasks/threads within a
> userspace mm (host) process. With the no-MMU design you propose, we also
> need to be able to correctly switch between kernel and userspace tasks
> within the same single (host) process.
> 
> I think this is a pretty significant difference, and saying "there's no
> complexity here" is simply pretending it isn't a relevant difference. I
> believe you're not even handling this correctly right now in this patch
> set, specifically wrt. the GS register which has been pointed out
> before, but I wouldn't say that I even have a complete picture in my
> head over what state handling would be necessary and sufficient.
> 
> So yeah, I think this warrants taking another look as to whether or not
> the approach of putting everything into the same host process is even
> worth it. I tend to believe that it isn't, given the use cases. And if
> you say the speedup still is with seccomp, that kills the speed argument
> too.

I understand your concern on complexity, thanks for the detail.

the host context state handling is indeed new thing. we've only
verified a limited set of code path, with a basic operation with um +
drivers and some userspace programs.  this should not be perfect at
this moment but can be improved.

> > > I've thought about what would happen if we stuck to creating a (single)
> > > separate process on the host to execute userspace, and just used
> > > CLONE_VM for it. That way, it's still no-MMU with full memory access,
> > > but there's some implicit isolation between the kernel and userspace
> > > processes which will likely remove complexities around FP/SSE/AVX
> > > handling, may completely remove the need for a separate seccomp
> > > implementation, etc.
> > 
> > this would be doable I think, but we went the different way, as
> > using separate host processes (with ptrace/seccomp) is slow and add
> > complexity by the synchronization between processes, which we think
> > it's not easy to maintain in the future.
> 
> Which one is it then, slow or not? Not sure I follow. You just said you
> do have seccomp when comparing speeds, so that in itself doesn't make it
> slow. What synchronization? It'd (have to) be CLONE_VM, but that
> actually _simplifies_ state transfer/synchronization, and we already
> have (to have) state transfer between different userspace threads in the
> same host process for the with-MMU case.

Since I included speed characteristics in the document, I should
explain more on the impact of this, compared to the existing
design/implementation of uml.

many documents, articles said uml is slow (uml document in tree also
mentioned a bit), but cannot find detailed analysis, so I look closely
at how nommu (w/ seccomp) and mmu w/ seccomp behave.

suppose we have a userspace program running under uml (on seccomp-mmu,
seccomp-nommu).


	struct timespec ts1, ts2;
	clock_gettime(CLOCK_REALTIME, &ts1);  // 1)
	getpid()                              // 2)
	clock_gettime(CLOCK_REALTIME, &ts2);  // 3)

# this is a chunk from the benchmark program used in the document.

then collected several events (sched_switch, signal_generate, and
sys_enter_futex) via ftrace.

looking at 3 SIGSYS (sig=31) signals on above code, and below is the
output of the `trace-cmd report`.

- frace seecomp-mmu, 2)-3)= 11 usec
 uml-userspace-3092637 [002] 1749286.670199: signal_generate:      sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0    => 1)
 uml-userspace-3092637 [002] 1749286.670200: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x7fffffffdf8c val=1
 uml-userspace-3092637 [002] 1749286.670201: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x7fffffffdf8c val=0x00000001 utime=0x00000000
 uml-userspace-3092637 [002] 1749286.670202: sched_switch:         uml-userspace:3092637 [120] S ==> swapper/2:0 [120]
          <idle>-0     [028] 1749286.670203: sched_switch:         swapper/28:0 [120] R ==> vmlinux:3092631 [120]
       vmlinux-3092631 [028] 1749286.670205: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x60b64f8c val=1
       vmlinux-3092631 [028] 1749286.670206: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x60b64f8c val=0x00000000 utime=0x00000000
       vmlinux-3092631 [028] 1749286.670207: sched_switch:         vmlinux:3092631 [120] S ==> swapper/28:0 [120]
          <idle>-0     [002] 1749286.670209: sched_switch:         swapper/2:0 [120] R ==> uml-userspace:3092637 [120]
 uml-userspace-3092637 [002] 1749286.670211: signal_generate:      sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0    => 2)
 uml-userspace-3092637 [002] 1749286.670212: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x7fffffffdf8c val=1
 uml-userspace-3092637 [002] 1749286.670213: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x7fffffffdf8c val=0x00000001 utime=0x00000000
 uml-userspace-3092637 [002] 1749286.670214: sched_switch:         uml-userspace:3092637 [120] S ==> swapper/2:0 [120]
          <idle>-0     [028] 1749286.670215: sched_switch:         swapper/28:0 [120] R ==> vmlinux:3092631 [120]
       vmlinux-3092631 [028] 1749286.670216: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x60b64f8c val=1
       vmlinux-3092631 [028] 1749286.670217: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x60b64f8c val=0x00000000 utime=0x00000000
       vmlinux-3092631 [028] 1749286.670218: sched_switch:         vmlinux:3092631 [120] S ==> swapper/28:0 [120]
          <idle>-0     [002] 1749286.670220: sched_switch:         swapper/2:0 [120] R ==> uml-userspace:3092637 [120]
 uml-userspace-3092637 [002] 1749286.670222: signal_generate:      sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0    => 3)


- ftrace seccomp-nommu, 2)-3) =  3 usec
       vmlinux-3092542 [006] 1749158.829292: signal_generate:      sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0    => 1)
       vmlinux-3092542 [006] 1749158.829294: signal_generate:      sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0    => 2)
       vmlinux-3092542 [006] 1749158.829297: signal_generate:      sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0    => 3)

with seccomp-mmu, a host process for userspace (uml-userspace) is
notified with SIGSYS (sig=31) upon syscall from userspace, and switched
task (of host) to vmlinux (um kernel), with the wake/wait
synchronization (which I meant synchronization in my previous email),
and switch back to uml-userspace to continue the userspace process.

so, at least 4 host sched_switch-es per single um syscall.

with current nommu using a single host process, notifications via
SIGSYS is same as seccomp-mmu, but after that there is no context
switch upon syscall issued by a userspace, in the same context to the
next syscall.

nommu implementation with CLONE_VM (btw, the host process, uml-userspace
is already created with CLONE_VM flag IIUC) might face the similar
situation as seccomp-mmu, seeing the same switches between processes.

this becomes the difference between the benchmark results of getpid, which
um-mmu (seccomp)/um-nommu (seccomp) is mostly x10 (26.242 and 2.599
usec) (this was described as an example of benchmark in the patchset).

I didn't look at ptrace mode of MMU, but expect to see the similar (or
more) duration on a single syscall.




in addition to this ftrace measurement above, I conducted more
practical benchmark with iperf3 (forward/reverse path) and netperf
(TCP_STREAM/MAERTS), which aren't corner cases I believe, and below is
the result.

all use the vector driver with gro on via host tap devices.
iperf3/netperf server run on a host and client runs inside uml.

# I can give a complete script to reproduce this if needed.


- iperf3 (Mbps)
              um-mmu(seccomp)	 um-nommu(seccomp)
--------------------------------------------------
iperf3(f)       7984             13152
iperf3(r)       8009             14363

- netperf (Mbps, bufsize=65507bytes)
              um-mmu(seccomp)	 um-nommu(seccomp)
--------------------------------------------------
netperf(STREAM)   5912.93        10792.02
netperf(MAERTS)  29263.53        33970.06


not significant different as we saw with simple syscall benchmark with
getpid(2), but still see an impact with difference.

I would say these results only show partial cases of what UML can do,
different workloads may show different result, but it is still
valuable to present one of the benefits to see the nature of the
feature (of what single process design can do).

Of course, nommu will come with various limitations as I described in
the document; like applications should be aware of the kernel is nommu
(i.e., need to use vfork, PIE binaries, etc).  So traditional uml is
more generic and has broader usage, but with this characteristic of
speed with nommu, I think it is worthwhile and users benefit from this
if they need speed.

I hope this clarifies a bit.

-- Hajime