[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <93ae88a4-1dac-77bf-37f6-f8708a6d83b7@intel.com>
Date: Mon, 15 May 2023 14:36:54 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
"dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
"keescook@...omium.org" <keescook@...omium.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"x86@...nel.org" <x86@...nel.org>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [GIT PULL] x86/shstk for 6.4
On 5/12/23 14:55, Linus Torvalds wrote:
>> There's always a race there because mm->mm_users can always get bumped
>> after the fork()er checks it.
> Ugh. I was assuming that if they don't already have a reference to the
> mm, they can't get one (becasue the '1' comes from 'current->mm', and
> nobody else has that mm).
>
> But maybe that's just not true. Looking around, we have things like
>
> pages->source_mm = current->mm;
> mmgrab(pages->source_mm);
>
> that creates a ref to the mm with a grab, and then later it gets used.
>
> So maybe the "no races can happen" is limited to *both* mm_users and
> mm_count being 1. What would increment it in that case, when 'current'
> is obviously busy forking?
>
> That "both are one" might still be the common case for fork(). Hmm?
get_task_mm() is the devil here. It goes right from having a
task_struct to bumping ->mm_users, no ->mm_count needed. It also bumps
->mm_users while holding task_lock(), which means we can't do something
simple like take mmap_lock in there to avoid racing with fork().
I did hack something together that seems to work for fork() and
get_task_mm(). Basically, we let get_task_mm()'s legacy behavior to be
the fast path. But it diverts over to a slow path if get_task_mm()
notices that an mm's refcounts and mmap_lock are consistent with a
fork() happening elsewhere.
The slow path releases the task_lock() and acquires mmap_lock so it can
sleep until the (potential) fork() is finished.
On the other side, the fork() code checks ->mm_users and ->mm_count. It
can now depend on them being stable because it holds mmap_lock and it
diverts the get_task_mm() callers over to the slow path.
This works for two important cases:
1. get_task_mm() callers since they now conditionally use mmap_lock
2. mmgrab() -> mmget_not_zero() users that later take the mmap_lock
I'm also fairly sure it misses some cases outside of those two. The
patch is also quite ugly. The "->task_doing_fast_fork" mechanism is
pure hackery, for instance.
This seems to stay on the fast paths pretty well, even with 'top' or
some other /proc poking going on. In the end, this is balancing the
extra cost of the get_task_mm() slow path with reduced atomic cost in
the fork() path. It looks promising so far.
Is this worth pursuing?
Powered by blists - more mailing lists