linux-kernel - Re: [RFC PATCH v3 11/15] context-tracking: Introduce work deferral infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhttbqm1s2.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Fri, 29 Nov 2024 17:40:29 +0100
From: Valentin Schneider <vschneid@...hat.com>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
 kvm@...r.kernel.org, linux-mm@...ck.org, bpf@...r.kernel.org,
 x86@...nel.org, rcu@...r.kernel.org, linux-kselftest@...r.kernel.org,
 Nicolas Saenz Julienne <nsaenzju@...hat.com>, Steven Rostedt
 <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>, Jonathan
 Corbet <corbet@....net>, Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar
 <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
 <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, Paolo
 Bonzini <pbonzini@...hat.com>, Wanpeng Li <wanpengli@...cent.com>, Vitaly
 Kuznetsov <vkuznets@...hat.com>, Andy Lutomirski <luto@...nel.org>, Peter
 Zijlstra <peterz@...radead.org>, "Paul E. McKenney" <paulmck@...nel.org>,
 Neeraj Upadhyay <quic_neeraju@...cinc.com>, Joel Fernandes
 <joel@...lfernandes.org>, Josh Triplett <josh@...htriplett.org>, Boqun
 Feng <boqun.feng@...il.com>, Mathieu Desnoyers
 <mathieu.desnoyers@...icios.com>, Lai Jiangshan <jiangshanlai@...il.com>,
 Zqiang <qiang.zhang1211@...il.com>, Andrew Morton
 <akpm@...ux-foundation.org>, Uladzislau Rezki <urezki@...il.com>,
 Christoph Hellwig <hch@...radead.org>, Lorenzo Stoakes
 <lstoakes@...il.com>, Josh Poimboeuf <jpoimboe@...nel.org>, Jason Baron
 <jbaron@...mai.com>, Kees Cook <keescook@...omium.org>, Sami Tolvanen
 <samitolvanen@...gle.com>, Ard Biesheuvel <ardb@...nel.org>, Nicholas
 Piggin <npiggin@...il.com>, Juerg Haefliger
 <juerg.haefliger@...onical.com>, Nicolas Saenz Julienne
 <nsaenz@...nel.org>, "Kirill A. Shutemov"
 <kirill.shutemov@...ux.intel.com>, Nadav Amit <namit@...are.com>, Dan
 Carpenter <error27@...il.com>, Chuang Wang <nashuiliang@...il.com>, Yang
 Jihong <yangjihong1@...wei.com>, Petr Mladek <pmladek@...e.com>, "Jason A.
 Donenfeld" <Jason@...c4.com>, Song Liu <song@...nel.org>, Julian Pidancet
 <julian.pidancet@...cle.com>, Tom Lendacky <thomas.lendacky@....com>,
 Dionna Glaze <dionnaglaze@...gle.com>, Thomas Weißschuh
 <linux@...ssschuh.net>, Juri Lelli <juri.lelli@...hat.com>, Marcelo
 Tosatti <mtosatti@...hat.com>, Yair Podemsky <ypodemsk@...hat.com>, Daniel
 Wagner <dwagner@...e.de>, Petr Tesarik <ptesarik@...e.com>
Subject: Re: [RFC PATCH v3 11/15] context-tracking: Introduce work deferral
 infrastructure

On 24/11/24 22:46, Frederic Weisbecker wrote:
> Le Fri, Nov 22, 2024 at 03:56:59PM +0100, Valentin Schneider a écrit :
>> On 20/11/24 18:30, Frederic Weisbecker wrote:
>> > Le Wed, Nov 20, 2024 at 06:10:43PM +0100, Valentin Schneider a écrit :
>> >> On 20/11/24 15:23, Frederic Weisbecker wrote:
>> >>
>> >> > Ah but there is CT_STATE_GUEST and I see the last patch also applies that to
>> >> > CT_STATE_IDLE.
>> >> >
>> >> > So that could be:
>> >> >
>> >> > bool ct_set_cpu_work(unsigned int cpu, unsigned int work)
>> >> > {
>> >> >    struct context_tracking *ct = per_cpu_ptr(&context_tracking, cpu);
>> >> >    unsigned int old;
>> >> >    bool ret = false;
>> >> >
>> >> >    preempt_disable();
>> >> >
>> >> >    old = atomic_read(&ct->state);
>> >> >
>> >> >    /* CT_STATE_IDLE can be added to last patch here */
>> >> >    if (!(old & (CT_STATE_USER | CT_STATE_GUEST))) {
>> >> >            old &= ~CT_STATE_MASK;
>> >> >            old |= CT_STATE_USER;
>> >> >    }
>> >>
>> >> Hmph, so that lets us leverage the cmpxchg for a !CT_STATE_KERNEL check,
>> >> but we get an extra loop if the target CPU exits kernelspace not to
>> >> userspace (e.g. vcpu or idle) in the meantime - not great, not terrible.
>> >
>> > The thing is, what you read with atomic_read() should be close to reality.
>> > If it already is != CT_STATE_KERNEL then you're good (minus racy changes).
>> > If it is CT_STATE_KERNEL then you still must do a failing cmpxchg() in any case,
>> > at least to make sure you didn't miss a context tracking change. So the best
>> > you can do is a bet.
>> >
>> >>
>> >> At the cost of one extra bit for the CT_STATE area, with CT_STATE_KERNEL=1
>> >> we could do:
>> >>
>> >>   old = atomic_read(&ct->state);
>> >>   old &= ~CT_STATE_KERNEL;
>> >
>> > And perhaps also old |= CT_STATE_IDLE (I'm seeing the last patch now),
>> > so you at least get a chance of making it right (only ~CT_STATE_KERNEL
>> > will always fail) and CPUs usually spend most of their time idle.
>> >
>> 
>> I'm thinking with:
>> 
>>         CT_STATE_IDLE		= 0,
>>         CT_STATE_USER		= 1,
>>         CT_STATE_GUEST		= 2,
>>         CT_STATE_KERNEL		= 4, /* Keep that as a standalone bit */
>
> Right!
>
>> 
>> we can stick with old &= ~CT_STATE_KERNEL; and that'll let the cmpxchg
>> succeed for any of IDLE/USER/GUEST.
>
> Sure but if (old & CT_STATE_KERNEL), cmpxchg() will consistently fail.
> But you can make a bet that it has switched to CT_STATE_IDLE between
> the atomic_read() and the first atomic_cmpxchg(). This way you still have
> a tiny chance to succeed.
>
> That is:
>
>    old = atomic_read(&ct->state);
>    if (old & CT_STATE_KERNEl)
>       old |= CT_STATE_IDLE;
>    old &= ~CT_STATE_KERNEL;
>
>
>    do {
>       atomic_try_cmpxchg(...)
>
> Hmm?

But it could equally be CT_STATE_{USER, GUEST}, right? That is, if we have
all of this enabled them we assume the isolated CPUs spend the least amount
of time in the kernel, if they don't we get to blame the user.