[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAFTs51WOGye9EiJEinA=k4rzBptKmzZheg8ZsELwpZ71bZsJ3A@mail.gmail.com>
Date: Wed, 17 Mar 2021 11:43:32 -0700
From: Peter Oskolkov <posk@...k.io>
To: Jim Newsome <jnewsome@...project.org>
Cc: Peter Oskolkov <posk@...gle.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Rob Jansen <rob.g.jansen@....navy.mil>,
Ryan Wails <ryan.wails@....navy.mil>,
Paul Turner <pjt@...gle.com>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>, Ben Segall <bsegall@...gle.com>
Subject: Re: [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX_SWAP operation
Hi Jim, thank you for your interest!
While FUTEX_SWAP seems to be a nonstarter, there is a discussion
off-list on how to approach the larger problem of userspace
scheduling. A full userspace scheduling patchset is likely to take
some time to shape out, but the "core" patches of wait/wake/swap are
more or less ready, so I'll probably post an early RFC version here in
the next week or two.
CC-ing the maintainers.
Thanks,
Peter
On Wed, Mar 17, 2021 at 10:59 AM Jim Newsome <jnewsome@...project.org> wrote:
>
> I'm not well versed in this part of the kernel (ok, any part, really),
> but I wanted to chime in from a user perspective that I'm very
> interested in this functionality.
>
> We (Rob + Ryan + I, cc'd) are currently developing the second generation
> of the Shadow simulator <https://shadow.github.io/>, which is used by
> various researchers and the Tor Project. In this new architecture,
> simulated network-application processes (such as tor, browsers, and web
> servers) are each run as a native OS process, started by forking and
> exec'ing its unmodified binary. We are interested in supporting large
> simulations (e.g. 50k+ processes), and expect them to take on the order
> of hours or even days to execute, so scalability and performance matters.
>
> We've prototyped two mechanisms for controlling these simulated
> processes, and a third hybrid mechanism that combines the two. I've
> mentioned one of these (ptrace) in another thread ("do_wait: make
> PIDTYPE_PID case O(1) instead of O(n)"). The other mechanism is to use
> an LD_PRELOAD'd shim that implements the libc interface, and
> communicates with Shadow via a syscall-like API over IPC.
>
> So far the most performant version we've tried of this IPC is with a bit
> of shared memory and a pair of semaphores. It looks much like the
> example in Peter's proposal:
>
> > a. T1: futex-wake T2, futex-wait
> > b. T2: wakes, does what it has been woken to do
> > c. T2: futex-wake T1, futex-wait
>
> We've been able to get the switching costs down using CPU pinning and
> SCHED_FIFO. Each physical CPU spends most of its time swapping back and
> forth between a Shadow worker thread and an emulated process. Even so,
> the new architecture is so far slower than the first generation of
> Shadow, which multiplexes the simulated processes into its own handful
> of OS processes (but is complex and fragile).
>
> > With FUTEX_SWAP, steps a and c above can be reduced to one futex
> > operation that runs 5-10 times faster.
>
> IIUC the proposed primitives could let us further improve performance,
> and perhaps drop some of the complexity of attempting to control the
> scheduler via pinning and SCHED_FIFO.
Powered by blists - more mailing lists