linux-kernel - Re: [patch v4 1/8] add basic task isolation prctl interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20211014130220.GA5812@fuller.cnet>
Date:   Thu, 14 Oct 2021 10:02:20 -0300
From:   Marcelo Tosatti <mtosatti@...hat.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, Nitesh Lal <nilal@...hat.com>,
        Nicolas Saenz Julienne <nsaenzju@...hat.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Christoph Lameter <cl@...ux.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Alex Belits <abelits@...its.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [patch v4 1/8] add basic task isolation prctl interface

<snip>

> What are the requirements of the signal exactly (and why it is popular) ?
> Because the interruption event can be due to:
> 
> * An IPI.
> * A system call.

IRQs (easy to trace), exceptions.

> In the "full task isolation mode" patchset (the one from Alex), a system call
> will automatically generate a SIGKILL once a system call is performed
> (after the prctl to enable task isolated mode, but
> before the prctl to disable task isolated mode).
> This can be implemented, if desired, by SECCOMP syscall blocking
> (which already exists).
> 
> For other interruptions, which happen through IPIs, one can print
> the stack trace of the program (or interrupt) that generated
> the IPI to find out the cause (which is what rt-trace-bpf.py is doing).
> 
> An alternative would be to add tracepoints so that one can
> find out which function in the kernel caused the CPU and
> task to become "a target for interruptions".

For example, adding a tracepoint to mark_vmstat_dirty() function
(allowing to see how that function was invoked on a given CPU, and
by whom) appears to be sufficient information to debug problems.

(mark_vmstat_dirty() from
[patch v4 5/8] task isolation: sync vmstats conditional on changes)

Instead of a coredump image with a SIGKILL sent at that point.

Looking at

https://github.com/abelits/libtmc

One can see the notification via SIGUSR1 being used.

To support something similar to it, one would add a new bit to 
flags field of:

+struct task_isol_activate_control {
+       __u64 flags;
+       __u64 quiesce_oneshot_mask;
+       __u64 pad[6];
+};

Remove 

+ 	       ret = -EINVAL;
+               if (act_ctrl.flags)
+                       goto out;

>From the handler, shrink the padded space and use it.

> 
> > > > Also, see:
> > > > 
> > > >   https://lkml.kernel.org/r/20210929152429.186930629@infradead.org
> > > 
> > > As you can see from the below pseudocode, we were thinking of queueing
> > > the (invalidate icache or TLB flush) in case app is in userspace,
> > > to perform on return to kernel space, but the approach in your patch might be
> > > superior (will take sometime to parse that thread...).
> > 
> > Let me assume you're talking about kernel TLB invalidates, otherwise it
> > would be terribly broken.
> > 
> > > > Suppose:
> > > > 
> > > > 	CPU0					CPU1
> > > > 
> > > > 	sys_prctl()
> > > > 	<kernel entry>
> > > > 	  // marks task 'important'
> > > > 						text_poke_sync()
> > > > 						  // checks CPU0, not userspace, queues IPI
> > > > 	<kernel exit>
> > > > 
> > > > 	$important userspace			  arch_send_call_function_ipi_mask()
> > > > 	<IPI>
> > > > 	  // finds task is 'important' and
> > > > 	  // can't take interrupts
> > > > 	  sigkill()
> > > > 
> > > > *Whoopsie*
> > > > 
> > > > 
> > > > Fundamentally CPU1 can't elide the IPI until CPU0 is in userspace,
> > > > therefore CPU0 can't wait for quescence in kernelspace, but if it goes
> > > > to userspace, it'll get killed on interruption. Catch-22.

To reiterate on this point:

> > > >         CPU0                                    CPU1
> > > > 
> > > >         sys_prctl()
> > > >         <kernel entry>
> > > >           // marks task 'important'
> > > >                                                 text_poke_sync()
> > > >                                                   // checks CPU0, not userspace, queues IPI
> > > >         <kernel exit>

1) Such races can be fixed by proper uses of atomic variables.

2) If a signal to an application is desired, fail to see why this
interface (ignoring bugs related to the particular mechanism) does not
allow it.

So hopefully this addresses your comments.