linux-kernel - Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a user->kernel transition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <57D81DB6-2D96-4A12-9FD5-6F0702AC49F6@vmware.com>
Date:   Wed, 5 Jul 2023 18:48:19 +0000
From:   Nadav Amit <namit@...are.com>
To:     Valentin Schneider <vschneid@...hat.com>
CC:     Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        "linux-trace-kernel@...r.kernel.org" 
        <linux-trace-kernel@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        linux-mm <linux-mm@...ck.org>, bpf <bpf@...r.kernel.org>,
        the arch/x86 maintainers <x86@...nel.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Jonathan Corbet <corbet@....net>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Andy Lutomirski <luto@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Uladzislau Rezki <urezki@...il.com>,
        Christoph Hellwig <hch@...radead.org>,
        Lorenzo Stoakes <lstoakes@...il.com>,
        Josh Poimboeuf <jpoimboe@...nel.org>,
        Kees Cook <keescook@...omium.org>,
        Sami Tolvanen <samitolvanen@...gle.com>,
        Ard Biesheuvel <ardb@...nel.org>,
        Nicholas Piggin <npiggin@...il.com>,
        Juerg Haefliger <juerg.haefliger@...onical.com>,
        Nicolas Saenz Julienne <nsaenz@...nel.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Dan Carpenter <error27@...il.com>,
        Chuang Wang <nashuiliang@...il.com>,
        Yang Jihong <yangjihong1@...wei.com>,
        Petr Mladek <pmladek@...e.com>,
        "Jason A. Donenfeld" <Jason@...c4.com>, Song Liu <song@...nel.org>,
        Julian Pidancet <julian.pidancet@...cle.com>,
        Tom Lendacky <thomas.lendacky@....com>,
        Dionna Glaze <dionnaglaze@...gle.com>,
        Thomas Weißschuh <linux@...ssschuh.net>,
        Juri Lelli <juri.lelli@...hat.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Marcelo Tosatti <mtosatti@...hat.com>,
        Yair Podemsky <ypodemsk@...hat.com>
Subject: Re: [RFC PATCH 00/14] context_tracking,x86: Defer some IPIs until a
 user->kernel transition



> On Jul 5, 2023, at 11:12 AM, Valentin Schneider <vschneid@...hat.com> wrote:
> 
> Deferral approach
> =================
> 
> Storing each and every callback, like a secondary call_single_queue turned out
> to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
> userspace for as long as possible - no signal of any form would be sent when
> deferring an IPI. This means that any form of queuing for deferred callbacks
> would end up as a convoluted memory leak.
> 
> Deferred IPIs must thus be coalesced, which this series achieves by assigning
> IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
> kernel entry.

I have some experience with similar an optimization. Overall, it can make
sense and as you show, it can reduce the number of interrupts.

The main problem of such an approach might be in cases where a process
frequently enters and exits the kernel between deferred-IPIs, or even worse -
the IPI is sent while the remote CPU is inside the kernel. In such cases, you
pay the extra cost of synchronization and cache traffic, and might not even
get the benefit of reducing the number of IPIs.

In a sense, it's a more extreme case of the overhead that x86’s lazy-TLB
mechanism introduces while tracking whether a process is running or not. But
lazy-TLB would change is_lazy much less frequently than context tracking,
which means that the deferring the IPIs as done in this patch-set has a
greater potential to hurt performance than lazy-TLB.

tl;dr - it would be beneficial to show some performance number for both a
“good” case where a process spends most of the time in userspace, and “bad”
one where a process enters and exits the kernel very frequently. Reducing
the number of IPIs is good but I don’t think it is a goal by its own.

[ BTW: I did not go over the patches in detail. Obviously, there are
  various delicate points that need to be checked, as avoiding the
  deferring of IPIs if page-tables are freed. ]