[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87mu70psqq.fsf@x220.int.ebiederm.org>
Date: Fri, 24 Apr 2020 14:51:25 -0500
From: ebiederm@...ssion.com (Eric W. Biederman)
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: LKML <linux-kernel@...r.kernel.org>,
Linux FS Devel <linux-fsdevel@...r.kernel.org>,
Alexey Dobriyan <adobriyan@...il.com>,
Alexey Gladkov <legion@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Oleg Nesterov <oleg@...hat.com>,
Alexey Gladkov <gladkov.alexey@...il.com>
Subject: Re: [PATCH v2 2/2] proc: Ensure we see the exit of each process tid exactly
Linus Torvalds <torvalds@...ux-foundation.org> writes:
> On Thu, Apr 23, 2020 at 8:36 PM Eric W. Biederman <ebiederm@...ssion.com> wrote:
>>
>> At one point my brain I had forgetten that xchg can not take two memory
>> arguments and had hoped to be able to provide stronger guarnatees than I
>> can. Which is where I think the structure of exchange_pids came from.
>
> Note that even if we were to have a "exchange two memory locations
> atomically" instruction (and we don't - even a "double cmpxchg" is
> actually just a double-_sized_ one, not a two different locations
> one), I'm not convinced it makes sense.
>
> There's no way to _walk_ two lists atomically. Any user will only ever
> walk one or the other, so it's not sensible to try to make the two
> list updates be atomic.
>
> And if a user for some reason walks both, the walking itself will
> obviously then be racy - it does one or the other first, and can see
> either the old state, or the new state - or see _neither_ (ie if you
> walk it twice, you might see neither task, or you might see both, just
> depending on order or walk).
>
>> I do agree the clearer we can write things, the easier it is for
>> someone else to come along and follow.
>
> Your alternate write of the function seems a bit more readable to me,
> even if the main effect might be just that it was split up a bit and
> added a few comments and whitespace.
>
> So I'm more happier with that one. That said:
>
>> We can not use a remove and reinser model because that does break rcu
>> accesses, and complicates everything else. With a swap model we have
>> the struct pids pointer at either of the tasks that are swapped but
>> never at nothing.
>
> I'm not suggesting removing the pid entirely - like making task->pid
> be NULL. I'm literally suggesting just doing the RCU list operations
> as "remove and re-insert".
>
> And that shouldn't break anything, for the same reason that an atomic
> exchange doesn't make sense: you can only ever walk one of the lists
> at a time. And regardless of how you walk it, you might not see the
> new state (or the old state) reliably.
>
> Put another way:
>
>> void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
>> {
>> struct hlist_node **lpprev = left->pprev;
>> struct hlist_node **rpprev = right->pprev;
>>
>> rcu_assign_pointer(*lpprev, right);
>> rcu_assign_pointer(*rpprev, left);
>
> These are the only two assignments that matter for anything that walks
> the list (the pprev ones are for things that change the list, and they
> have to have exclusions in place).
>
> And those two writes cannot be atomic anyway, so you fundamentally
> will always be in the situation that a walker can miss one of the
> tasks.
>
> Which is why I think it would be ok to just do the RCU list swap as a
> "remove left, remove right, add left, add right" operation. It doesn't
> seem fundamentally different to a walker than the "switch left/right"
> operation, and it seems much simpler.
>
> Is there something I'm missing?
The problem with
remove
remove
add
add
is:
A lookup that hit between the remove and the add could return nothing.
The function kill_pid_info does everything it can to handle this case
today does:
int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
{
int error = -ESRCH;
struct task_struct *p;
for (;;) {
rcu_read_lock();
p = pid_task(pid, PIDTYPE_PID);
if (p)
error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
rcu_read_unlock();
if (likely(!p || error != -ESRCH))
return error;
/*
* The task was unhashed in between, try again. If it
* is dead, pid_task() will return NULL, if we race with
* de_thread() it will find the new leader.
*/
}
}
Now kill_pid_info is signalling the entire task and is just using
PIDTYPE_PID to find a thread in the task.
With the remove then add model there will be a point where pid_task
will return nothing, because ever so briefly the lists will be
empty.
However with an actually swap we will find a task and kill_pid_info
will work. It pathloglical cases lock_task_sighand might have to loop
and we would need to find the new task that has the given pid. But
kill_pid_info is guaranteed to work with swaps and will fail with
remove add.
> But I'm *not* suggesting that we change these simple parts to be
> "remove thread_pid or pid pointer, and then insert a new one":
>
>> /* Swap thread_pid */
>> rpid = left->thread_pid;
>> lpid = right->thread_pid;
>> rcu_assign_pointer(left->thread_pid, lpid);
>> rcu_assign_pointer(right->thread_pid, rpid);
>>
>> /* Swap the cached pid value */
>> WRITE_ONCE(left->pid, pid_nr(lpid));
>> WRITE_ONCE(right->pid, pid_nr(rpid));
>> }
>
> because I agree that for things that don't _walk_ the list, but just
> look up "thread_pid" vs "pid" atomically but asynchronously, we
> obviously need to get one or the other, not some kind of "empty"
> state.
For PIDTYPE_PID and PIDTYPE_TGID these practically aren't lists but
pointers to the appropriate task. Only for PIDTYPE_PGID and PIDTYPE_SID
do these become lists in practice.
That not-really-a-list status allows for signel delivery to indivdual
processes to happen in rcu context. Which is where we would get into
trouble with add/remove.
Since signals are guaranteed to be delivered to the entire session
or the entire process group all of the list walking happens under
the tasklist_lock currently. Which really keeps list walking from
being a concern.
>> Does that look a little more readable?
>
> Regardless, I find your new version at least a lot more readable, so
> I'm ok with it.
Good. Then I will finish cleaning it up and go with that version.
> It looks like Oleg found an independent issue, though.
Yes, and I will definitely work through those.
Eric
Powered by blists - more mailing lists