linux-kernel - Re: [patch v4 1/8] add basic task isolation prctl interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211013105637.GA88322@fuller.cnet>
Date:   Wed, 13 Oct 2021 07:56:37 -0300
From:   Marcelo Tosatti <mtosatti@...hat.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, Nitesh Lal <nilal@...hat.com>,
        Nicolas Saenz Julienne <nsaenzju@...hat.com>,
        Frederic Weisbecker <frederic@...nel.org>,
        Christoph Lameter <cl@...ux.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Alex Belits <abelits@...its.com>, Peter Xu <peterx@...hat.com>
Subject: Re: [patch v4 1/8] add basic task isolation prctl interface

Hi Peter,

On Tue, Oct 12, 2021 at 03:05:34PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 07, 2021 at 04:23:47PM -0300, Marcelo Tosatti wrote:
> > Add basic prctl task isolation interface, which allows
> > informing the kernel that application is executing 
> > latency sensitive code (where interruptions are undesired).
> > 
> > Interface is described by task_isolation.rst (added by
> > next patch).
> 
> That does not absolve you from actually writing a changelog here.
> Life is too short to try and read rst shit.

The rst is concise and contains all necessary information.

Changelog is on the patch header (I would appreciate reviews of
the interface itself, not sure why the changelog is important).

The rst compiled in PDF form is attached. Its only 6 pages long, it
described the interface (please if you think of any improvement 
to that document, and not only the interface).

Also check the examples on the second patch.

"===============================
Task isolation prctl interface
===============================

Certain types of applications benefit from running uninterrupted by
background OS activities. Realtime systems and high-bandwidth networking
applications with user-space drivers can fall into the category.

To create an OS noise free environment for the application, this
interface allows userspace to inform the kernel the start and
end of the latency sensitive application section (with configurable
system behaviour for that section).

Note: the prctl interface is independent of nohz_full=.

The prctl options are:


        - PR_ISOL_FEAT_GET: Retrieve supported features.
        - PR_ISOL_CFG_GET: Retrieve task isolation configuration.
        - PR_ISOL_CFG_SET: Set task isolation configuration.
        - PR_ISOL_ACTIVATE_GET: Retrieve task isolation activation state.
        - PR_ISOL_ACTIVATE_SET: Set task isolation activation state."

> What is the envisioned usage of these isolating prctl() thingies,

To inform the kernel that an application is entering/exiting 
task isolated mode, where the kernel can behave in different ways
(it depends on what commands you implement using the interface).

At the moment, only a single action is performed: to sync any pending
per-CPU vm statistics that might wake up a CPU's vmstat_worker 
thread.

The point of the interface, as requested, is that it is extensible 
allowing for new features to be implemented.

For example, one could implement blocking of certain kernel activities
which require interruption to userspace.

> including the kill-me-on-any-interruption thing, vs the inherently racy
> nature of some of the don't disturb me stuff.

The kill-me-on-any-interruption thing is not included in this patchset
(maybe it can be implemented using this interface, if desired, but we
do not see the need for such feature at the moment).

The inherently racy nature of some of the don't disturb me stuff:

> 
> Also, see:
> 
>   https://lkml.kernel.org/r/20210929152429.186930629@infradead.org

As you can see from the below pseudocode, we were thinking of queueing
the (invalidate icache or TLB flush) in case app is in userspace,
to perform on return to kernel space, but the approach in your patch might be
superior (will take sometime to parse that thread...).

> Suppose:
> 
> 	CPU0					CPU1
> 
> 	sys_prctl()
> 	<kernel entry>
> 	  // marks task 'important'
> 						text_poke_sync()
> 						  // checks CPU0, not userspace, queues IPI
> 	<kernel exit>
> 
> 	$important userspace			  arch_send_call_function_ipi_mask()
> 	<IPI>
> 	  // finds task is 'important' and
> 	  // can't take interrupts
> 	  sigkill()
> 
> *Whoopsie*
> 
> 
> Fundamentally CPU1 can't elide the IPI until CPU0 is in userspace,
> therefore CPU0 can't wait for quescence in kernelspace, but if it goes
> to userspace, it'll get killed on interruption. Catch-22.

We have been using a BPF tool for logging and monitoring of
interruptions:
https://github.com/xzpeter/rt-trace-bpf

But there is no such thing as 

         // finds task is 'important' and
         // can't take interrupts
         sigkill()

On this patchset.

We have been discussing something like this to avoid TLBs / invalidate
i-cache IPIs, but 

 #define CPU_REQ_FLUSH_TLB       0x1     /*      __flush_tlb_all()       */ 
 #define CPU_REQ_INV_ICACHE      0x2     /*      sync_core()             */ 
 
 #define IN_USER_MODE            (0x1 << 16) 
 
 /* when CPU is in kernel mode, should IPI rather than queuing the 
    request on per-cpu state variable */ 
 #define IN_KERNEL_MODE                (0) 
 
 Then on entry/exit would have to add a conditional branch: 
 
 Exit to userspace: 
 ----------------- 
 
       cpu = smp_processor_id(); 
 
       if (isolation_enabled(cpu)) { 
               atomic_or(IN_USER_MODE, &percpudata->user_kernel_state); 
       } 
 
 Kernel entry: 
 ------------- 
 
       cpu = smp_processor_id(); 
 
       if (isolation_enabled(cpu)) { 
               reqs = atomic_xchg(&percpudata->user_kernel_state, IN_KERNEL_MODE); 
               if (reqs & CPU_REQ_FLUSH_TLB) 
                       flush_tlb_all(); 
               if (reqs & CPU_REQ_INV_ICACHE) 
                       invalidate_icache(); 
       } 
 
 Request side: 
 ------------- 
 
       int targetcpu; 
 
       do { 
               struct percpudata *pcpudata = per_cpu(&percpudata, targetcpu); 
 
               old_state = pcpudata->user_kernel_state; 
 
               /* in kernel mode ? */ 
               if (!(old_state & IN_USER_MODE)) { 
                       smp_call_function_single(request_fn, targetcpu, 1); 
                       break; 
               } 
               new_state = remote_state | CPU_REQ_FLUSH_TLB; // (or CPU_REQ_INV_ICACHE)
       } while (atomic_cmpxchg(&pcpudata->user_kernel_state, old_state, new_state) != old_state); 


Thanks for your comments!

Download attachment "task_isolation.pdf" of type "application/pdf" (90947 bytes)