[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20241105121837.GI24862@noisy.programming.kicks-ass.net>
Date: Tue, 5 Nov 2024 13:18:37 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Florian Weimer <fweimer@...hat.com>
Cc: André Almeida <andrealmeid@...lia.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Darren Hart <dvhart@...radead.org>,
Davidlohr Bueso <dave@...olabs.net>, Arnd Bergmann <arnd@...db.de>,
sonicadvance1@...il.com, linux-kernel@...r.kernel.org,
kernel-dev@...lia.com, linux-api@...r.kernel.org,
Nathan Chancellor <nathan@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Subject: Re: [PATCH v2 0/3] futex: Create set_robust_list2
On Mon, Nov 04, 2024 at 01:36:43PM +0100, Florian Weimer wrote:
> * Peter Zijlstra:
>
> > On Sat, Nov 02, 2024 at 10:58:42PM +0100, Florian Weimer wrote:
> >
> >> QEMU hints towards further problems (in linux-user/syscall.c):
> >>
> >> case TARGET_NR_set_robust_list:
> >> case TARGET_NR_get_robust_list:
> >> /* The ABI for supporting robust futexes has userspace pass
> >> * the kernel a pointer to a linked list which is updated by
> >> * userspace after the syscall; the list is walked by the kernel
> >> * when the thread exits. Since the linked list in QEMU guest
> >> * memory isn't a valid linked list for the host and we have
> >> * no way to reliably intercept the thread-death event, we can't
> >> * support these. Silently return ENOSYS so that guest userspace
> >> * falls back to a non-robust futex implementation (which should
> >> * be OK except in the corner case of the guest crashing while
> >> * holding a mutex that is shared with another process via
> >> * shared memory).
> >> */
> >> return -TARGET_ENOSYS;
> >
> > I don't think we can sanely fix that. Can't QEMU track the robust thing
> > itself and use waitpid() to discover the thread is gone and fudge things
> > from there?
>
> There are race conditions with munmap, I think, and they probably get a
> lot of worse if QEMU does that.
>
> See Rich Felker's bug report:
>
> | The corruption is performed by the kernel when it walks the robust
> | list. The basic situation is the same as in PR #13690, except that
> | here there's actually a potential write to the memory rather than just
> | a read.
> |
> | The sequence of events leading to corruption goes like this:
> |
> | 1. Thread A unlocks the process-shared, robust mutex and is preempted
> | after the mutex is removed from the robust list and atomically
> | unlocked, but before it's removed from the list_op_pending field of
> | the robust list header.
> |
> | 2. Thread B locks the mutex, and, knowing by program logic that it's
> | the last user of the mutex, unlocks and unmaps it, allocates/maps
> | something else that gets assigned the same address as the shared mutex
> | mapping, and then exits.
> |
> | 3. The kernel destroys the process, which involves walking each
> | thread's robust list and processing each thread's list_op_pending
> | field of the robust list header. Since thread A has a list_op_pending
> | pointing at the address previously occupied by the mutex, the kernel
> | obliviously "unlocks the mutex" by writing a 0 to the address and
> | futex-waking it. However, the kernel has instead overwritten part of
> | whatever mapping thread A created. If this is private memory it
> | (probably) doesn't matter since the process is ending anyway (but are
> | there race conditions where this can be seen?). If this is shared
> | memory or a shared file mapping, however, the kernel corrupts it.
> |
> | I suspect the race is difficult to hit since thread A has to get
> | preempted at exactly the wrong time AND thread B has to do a fair
> | amount of work without thread A getting scheduled again. So I'm not
> | sure how much luck we'd have getting a test case.
>
>
> <https://sourceware.org/bugzilla/show_bug.cgi?id=14485#c3>
So I've only managed to conjure up two horrible solutions for this:
- put the robust futex operations under user-space RCU, and mandate a
matching synchronize_rcu() before any munmap() calls.
- add a robust-barrier syscall that waits until all list_op_pending are
either NULL or changed since invocation. And mandate this call before
munmap().
Neither are particularly pretty I admit, but at least they should work.
But doing this and mandating the alignment thing should at least make
this qemu thing workable, no?
> We also have a silent unlocking failure because userspace does not know
> about ROBUST_LIST_LIMIT:
>
> Bug 19089 - Robust mutexes do not take ROBUST_LIST_LIMIT into account
> <https://sourceware.org/bugzilla/show_bug.cgi?id=19089>
>
> (I think we may have discussed this one before, and you may have
> suggested to just hard-code 2048 in userspace because the constant is
> not expected to change.)
>
> So the in-mutex linked list has quite a few problems even outside of
> emulation. 8-(
It's futex, ofcourse its a pain in the arse :-)
And yeah, no better ideas on that limit for now...
Powered by blists - more mailing lists