linux-kernel - Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230705223956.GD2813335@hirez.programming.kicks-ass.net>
Date:   Thu, 6 Jul 2023 00:39:56 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Valentin Schneider <vschneid@...hat.com>
Cc:     linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org, kvm@...r.kernel.org, linux-mm@...ck.org,
        bpf@...r.kernel.org, x86@...nel.org,
        Nicolas Saenz Julienne <nsaenzju@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Jonathan Corbet <corbet@....net>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Andy Lutomirski <luto@...nel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Uladzislau Rezki <urezki@...il.com>,
        Christoph Hellwig <hch@...radead.org>,
        Lorenzo Stoakes <lstoakes@...il.com>,
        Josh Poimboeuf <jpoimboe@...nel.org>,
        Kees Cook <keescook@...omium.org>,
        Sami Tolvanen <samitolvanen@...gle.com>,
        Ard Biesheuvel <ardb@...nel.org>,
        Nicholas Piggin <npiggin@...il.com>,
        Juerg Haefliger <juerg.haefliger@...onical.com>,
        Nicolas Saenz Julienne <nsaenz@...nel.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Nadav Amit <namit@...are.com>,
        Dan Carpenter <error27@...il.com>,
        Chuang Wang <nashuiliang@...il.com>,
        Yang Jihong <yangjihong1@...wei.com>,
        Petr Mladek <pmladek@...e.com>,
        "Jason A. Donenfeld" <Jason@...c4.com>, Song Liu <song@...nel.org>,
        Julian Pidancet <julian.pidancet@...cle.com>,
        Tom Lendacky <thomas.lendacky@....com>,
        Dionna Glaze <dionnaglaze@...gle.com>,
        Thomas Weißschuh <linux@...ssschuh.net>,
        Juri Lelli <juri.lelli@...hat.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Marcelo Tosatti <mtosatti@...hat.com>,
        Yair Podemsky <ypodemsk@...hat.com>
Subject: Re: [RFC PATCH 11/14] context-tracking: Introduce work deferral
 infrastructure

On Wed, Jul 05, 2023 at 07:12:53PM +0100, Valentin Schneider wrote:

> Note: A previous approach by PeterZ [1] used an extra bit in
> context_tracking.state to flag the presence of deferred callbacks to
> execute, and the actual callbacks were stored in a separate atomic
> variable.
> 
> This meant that the atomic read of context_tracking.state was sufficient to
> determine whether there are any deferred callbacks to execute.
> Unfortunately, it presents a race window. Consider the work setting
> function as:
> 
>   preempt_disable();
>   seq = atomic_read(&ct->seq);
>   if (__context_tracking_seq_in_user(seq)) {
> 	  /* ctrl-dep */
> 	  atomic_or(work, &ct->work);
> 	  ret = atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
>   }
>   preempt_enable();
> 
>   return ret;
> 
> Then the following can happen:
> 
>   CPUx                                             CPUy
> 						     CT_SEQ_WORK \in context_tracking.state
>     atomic_or(WORK_N, &ct->work);
> 						      ct_kernel_enter()
> 							ct_state_inc();
>     atomic_try_cmpxchg(&ct->seq, &seq, seq|CT_SEQ_WORK);
> 
> The cmpxchg() would fail, ultimately causing an IPI for WORK_N to be
> sent. Unfortunately, the work bit would remain set, and it can't be sanely
> cleared in case another CPU set it concurrently - this would ultimately
> lead to a double execution of the callback, one as a deferred callback and
> one in the IPI. As not all IPI callbacks are idempotent, this is
> undesirable.

So adding another atomic is arguably worse.

The thing is, if the NOHZ_FULL CPU is actually doing context transitions
(SYSCALLs etc..) then everything is fundamentally racy, there is no
winning that game, we could find the remote CPU is in-kernel, send an
IPI, the remote CPU does return-to-user and receives the IPI.

And then the USER is upset... because he got an IPI.

The whole NOHZ_FULL thing really only works if userspace does not do
SYSCALLs.

But the sad sad state of affairs is that some people think it is
acceptable to do SYSCALLs while NOHZ_FULL and cry about how slow stuff
is.