[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ya30xsrQnwyT/R92@hirez.programming.kicks-ass.net>
Date: Mon, 6 Dec 2021 12:32:22 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Peter Oskolkov <posk@...k.io>
Cc: Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Andy Lutomirski <luto@...nel.org>,
Linux Memory Management List <linux-mm@...ck.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-api@...r.kernel.org, Paul Turner <pjt@...gle.com>,
Ben Segall <bsegall@...gle.com>,
Peter Oskolkov <posk@...gle.com>,
Andrei Vagin <avagin@...gle.com>, Jann Horn <jannh@...gle.com>,
Thierry Delisle <tdelisle@...terloo.ca>
Subject: Re: [PATCH v0.9.1 3/6] sched/umcg: implement UMCG syscalls
Sorry, I haven't been feeling too well and as such procastinated on this
because thinking is required :/ Trying to pick up the bits.
On Mon, Nov 29, 2021 at 03:38:38PM -0800, Peter Oskolkov wrote:
> On Mon, Nov 29, 2021 at 1:08 PM Peter Zijlstra <peterz@...radead.org> wrote:
> [...]
> > > > > Another big concern I have is that you removed UMCG_TF_LOCKED. I
> > > >
> > > > OOh yes, I forgot to mention that. I couldn't figure out what it was
> > > > supposed to do.
> [...]
> >
> > So then A does:
> >
> > A::next_tid = C.tid;
> > sys_umcg_wait();
> >
> > Which will:
> >
> > pin(A);
> > pin(S0);
> >
> > cmpxchg(A::state, RUNNING, RUNNABLE);
>
> Hmm.... That's another difference between your patch and mine: my
> approach was "the side that initiates the change updates the state".
> So in my code the userspace changes the current task's state RUNNING
> => RUNNABLE and the next task's state,
I couldn't make that work for wakeups; when a thread blocks in a
random syscall there is no userspace to wake the next thread. And since
it seems required in this case, it's easier and more consistent to
always do it.
> or the server's state, RUNNABLE
> => RUNNING before calling sys_umcg_wait().
Yes, this is indeed required; I've found the same when trying to build
the userspace server loop. And yes, I'm starting to see where you're
coming from.
> I'm still not sure we can live without UMCG_TF_LOCKED. What if worker
> A transfers its server to worker B that A intends to context switch
S0 running A
Therefore:
S0::state == RUNNABLE
A::server_tid = S0.tid
A::state == RUNNING
you want A to switch to B, therefore:
B::state == RUNNABLE
if B is not yet on S0 then:
B::server_tid = S0.tid;
finally:
0:
A::next_tid = B.tid;
1:
A::state = RUNNABLE:
2:
sys_umcg_wait();
3:
> into, and then worker A pagefaults or gets interrupted before calling
> sys_umcg_wait()?
So the problem is tripping umcg_notify_resume() on the labels 1 and 2,
right? tripping it on 0 and 3 is trivially correct.
If we trip it on 1 and !(A::state & TG_PREEMPT), then nothing, since
::state == RUNNING we'll just continue onwards and all is well. That is,
nothing has happened yet.
However, if we trip it on 2: we're screwed. Because at that point
::state is scribbled.
> The server will be woken up and will see that it is
> assigned to worker B; now what? If worker A is "locked" before the
> whole thing starts, the pagefault/interrupt will not trigger
> block/wake detection, worker A will keep RUNNING for all intended
> purposes, and eventually will call sys_umcg_wait() as it had
> intended...
No, the failure case is different; umcg_notify_resume() will simply
block A until someone sets A::state == RUNNING and kicks it, which will
be no-one.
Now, the above situation is actually simple to fix, but it gets more
interesting when we're using sys_umcg_wait() to build wait primitives.
Because in that case we get stuff like:
for (;;) {
self->state = RUNNABLE;
smp_mb();
if (cond)
break;
sys_umcg_wait();
}
self->state = RUNNING;
And we really need to not block and also not do sys_umcg_wait() early.
So yes, I agree that we need a special case here that ensures
umcg_notify_resume() doesn't block. Let me ponder naming and comments.
Either a TF_COND_WAIT or a whole new state. I can't decide yet.
Now, obviously if you do a random syscall anywhere around here, you get
to keep the pieces :-)
I've also added ::next_tid to the whole umcg_pin_pages() thing, and made
it so that ::next_tid gets cleared when it's been used. That way things
like:
self->next_tid = pick_from_runqueue();
sys_that_is_expected_to_sleep();
if (self->next_tid) {
return_to_runqueue(self->next_tid);
self->next_tid = 0;
}
Are much simpler to manage. Either it did sleep and ::next_tid is
consumed, or it didn't sleep and it needs to be returned to the
runqueue.
Powered by blists - more mailing lists