linux-kernel - Re: [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <c85541c4-6ae5-8898-8044-99e403335f48@torproject.org>
Date:   Wed, 17 Mar 2021 12:57:58 -0500
From:   Jim Newsome <jnewsome@...project.org>
To:     Peter Oskolkov <posk@...gle.com>
Cc:     linux-kernel@...r.kernel.org,
        Rob Jansen <rob.g.jansen@....navy.mil>,
        Ryan Wails <ryan.wails@....navy.mil>
Subject: Re: [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX_SWAP operation

I'm not well versed in this part of the kernel (ok, any part, really),
but I wanted to chime in from a user perspective that I'm very
interested in this functionality.

We (Rob + Ryan + I, cc'd) are currently developing the second generation
of the Shadow simulator <https://shadow.github.io/>, which is used by
various researchers and the Tor Project. In this new architecture,
simulated network-application processes (such as tor, browsers, and web
servers) are each run as a native OS process, started by forking and
exec'ing its unmodified binary. We are interested in supporting large
simulations (e.g. 50k+ processes), and expect them to take on the order
of hours or even days to execute, so scalability and performance matters.

We've prototyped two mechanisms for controlling these simulated
processes, and a third hybrid mechanism that combines the two. I've
mentioned one of these (ptrace) in another thread ("do_wait: make
PIDTYPE_PID case O(1) instead of O(n)"). The other mechanism is to use
an LD_PRELOAD'd shim that implements the libc interface, and
communicates with Shadow via a syscall-like API over IPC.

So far the most performant version we've tried of this IPC is with a bit
of shared memory and a pair of semaphores. It looks much like the
example in Peter's proposal:

> a. T1: futex-wake T2, futex-wait
> b. T2: wakes, does what it has been woken to do
> c. T2: futex-wake T1, futex-wait

We've been able to get the switching costs down using CPU pinning and
SCHED_FIFO. Each physical CPU spends most of its time swapping back and
forth between a Shadow worker thread and an emulated process. Even so,
the new architecture is so far slower than the first generation of
Shadow, which multiplexes the simulated processes into its own handful
of OS processes (but is complex and fragile).

> With FUTEX_SWAP, steps a and c above can be reduced to one futex
> operation that runs 5-10 times faster.

IIUC the proposed primitives could let us further improve performance,
and perhaps drop some of the complexity of attempting to control the
scheduler via pinning and SCHED_FIFO.