[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211014130220.GA5812@fuller.cnet>
Date: Thu, 14 Oct 2021 10:02:20 -0300
From: Marcelo Tosatti <mtosatti@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Nitesh Lal <nilal@...hat.com>,
Nicolas Saenz Julienne <nsaenzju@...hat.com>,
Frederic Weisbecker <frederic@...nel.org>,
Christoph Lameter <cl@...ux.com>,
Juri Lelli <juri.lelli@...hat.com>,
Alex Belits <abelits@...its.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [patch v4 1/8] add basic task isolation prctl interface
<snip>
> What are the requirements of the signal exactly (and why it is popular) ?
> Because the interruption event can be due to:
>
> * An IPI.
> * A system call.
IRQs (easy to trace), exceptions.
> In the "full task isolation mode" patchset (the one from Alex), a system call
> will automatically generate a SIGKILL once a system call is performed
> (after the prctl to enable task isolated mode, but
> before the prctl to disable task isolated mode).
> This can be implemented, if desired, by SECCOMP syscall blocking
> (which already exists).
>
> For other interruptions, which happen through IPIs, one can print
> the stack trace of the program (or interrupt) that generated
> the IPI to find out the cause (which is what rt-trace-bpf.py is doing).
>
> An alternative would be to add tracepoints so that one can
> find out which function in the kernel caused the CPU and
> task to become "a target for interruptions".
For example, adding a tracepoint to mark_vmstat_dirty() function
(allowing to see how that function was invoked on a given CPU, and
by whom) appears to be sufficient information to debug problems.
(mark_vmstat_dirty() from
[patch v4 5/8] task isolation: sync vmstats conditional on changes)
Instead of a coredump image with a SIGKILL sent at that point.
Looking at
https://github.com/abelits/libtmc
One can see the notification via SIGUSR1 being used.
To support something similar to it, one would add a new bit to
flags field of:
+struct task_isol_activate_control {
+ __u64 flags;
+ __u64 quiesce_oneshot_mask;
+ __u64 pad[6];
+};
Remove
+ ret = -EINVAL;
+ if (act_ctrl.flags)
+ goto out;
>From the handler, shrink the padded space and use it.
>
> > > > Also, see:
> > > >
> > > > https://lkml.kernel.org/r/20210929152429.186930629@infradead.org
> > >
> > > As you can see from the below pseudocode, we were thinking of queueing
> > > the (invalidate icache or TLB flush) in case app is in userspace,
> > > to perform on return to kernel space, but the approach in your patch might be
> > > superior (will take sometime to parse that thread...).
> >
> > Let me assume you're talking about kernel TLB invalidates, otherwise it
> > would be terribly broken.
> >
> > > > Suppose:
> > > >
> > > > CPU0 CPU1
> > > >
> > > > sys_prctl()
> > > > <kernel entry>
> > > > // marks task 'important'
> > > > text_poke_sync()
> > > > // checks CPU0, not userspace, queues IPI
> > > > <kernel exit>
> > > >
> > > > $important userspace arch_send_call_function_ipi_mask()
> > > > <IPI>
> > > > // finds task is 'important' and
> > > > // can't take interrupts
> > > > sigkill()
> > > >
> > > > *Whoopsie*
> > > >
> > > >
> > > > Fundamentally CPU1 can't elide the IPI until CPU0 is in userspace,
> > > > therefore CPU0 can't wait for quescence in kernelspace, but if it goes
> > > > to userspace, it'll get killed on interruption. Catch-22.
To reiterate on this point:
> > > > CPU0 CPU1
> > > >
> > > > sys_prctl()
> > > > <kernel entry>
> > > > // marks task 'important'
> > > > text_poke_sync()
> > > > // checks CPU0, not userspace, queues IPI
> > > > <kernel exit>
1) Such races can be fixed by proper uses of atomic variables.
2) If a signal to an application is desired, fail to see why this
interface (ignoring bugs related to the particular mechanism) does not
allow it.
So hopefully this addresses your comments.
Powered by blists - more mailing lists