linux-kernel - Re: [RFC] sched,livepatch: call klp_try_switch_task in __cond

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <78DFED12-571B-489C-A662-DA333555266B@fb.com>
Date:   Wed, 11 May 2022 16:33:57 +0000
From:   Song Liu <songliubraving@...com>
To:     Petr Mladek <pmladek@...e.com>
CC:     Josh Poimboeuf <jpoimboe@...nel.org>, Rik van Riel <riel@...com>,
        "song@...nel.org" <song@...nel.org>,
        "joe.lawrence@...hat.com" <joe.lawrence@...hat.com>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
        "live-patching@...r.kernel.org" <live-patching@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "jpoimboe@...hat.com" <jpoimboe@...hat.com>
Subject: Re: [RFC] sched,livepatch: call klp_try_switch_task in __cond_resched



> On May 11, 2022, at 2:24 AM, Petr Mladek <pmladek@...e.com> wrote:
> 
> On Tue 2022-05-10 17:33:31, Josh Poimboeuf wrote:
>> On Tue, May 10, 2022 at 11:57:04PM +0000, Song Liu wrote:
>>>> If it's a real bug, we should fix it everywhere, not just for Facebook.
>>>> Otherwise CONFIG_PREEMPT and/or non-x86 arches become second-class
>>>> citizens.
>>> 
>>> I think "is it a real bug?" is the top question for me. So maybe we 
>>> should take a step back.
>>> 
>>> The behavior we see is: A busy kernel thread blocks klp transition 
>>> for more than a minute. But the transition eventually succeeded after 
>>> < 10 retries on most systems. The kernel thread is well-behaved, as 
>>> it calls cond_resched() at a reasonable frequency, so this is not a 
>>> deadlock. 
>>> 
>>> If I understand Petr correctly, this behavior is expected, and thus 
>>> is not a bug or issue for the livepatch subsystem. This is different
>>> to our original expectation, but if this is what we agree on, we 
>>> will look into ways to incorporate long wait time for patch 
>>> transition in our automations. 
>> 
>> That's how we've traditionally looked at it, though apparently Red Hat
>> and SUSE have implemented different ideas of what a long wait time is.
>> 
>> In practice, one minute has always been enough for all of kpatch's users
>> -- AFAIK, everybody except SUSE -- up until now.
> 
> I am actually surprised that nobody met the problem yet. There are
> "only" 60 attempts to transition the pending tasks.

Maybe we should consider increase the frequency we try? Say to 10 times
per second? I guess this will solve most of the failures we are seeing
in current case. 

> 
> Well, the problem is mainly with kthreads. User space processes are
> migrated also on the kernel boundary. And the fake signal is likely
> pretty effective here. And it probably is not that common that
> a kthread would occupy a single CPU all the time.
> 
> 
>> Though, these options might be considered workarounds, as it's
>> theoretically possible for a kthread to be CPU-bound indefinitely,
>> beyond any arbitrarily chosen timeout.  But maybe that's not realistic
>> beyond a certain timeout value of X and we don't care?  I dunno.
> 
> I agree that it might happen theoretically. And it would be great
> to be prepared for this. My only concern is the complexity and risk.
> We should know that it is worth it.
> 
> 
>> As I have been trying to say, that won't work for PREEMPT+!ORC, because,
>> when the kthread gets preempted, the stack trace will be attempted from
>> an IRQ and will be reported as unreliable.
> 
> This limits the range of possible solutions quite a lot. But it is
> how it is.
> 
>> Ideally we'd have the ORC unwinder for all arches, that would make this
>> much easier.  But we're not there yet.
> 
> The alternative solution is that the process has to migrate itself
> on some safe location.
> 
> One crazy idea. It still might be possible to find the called
> functions on the stack even when it is not reliable. Then it
> might be possible to add another ftrace handler on
> these found functions. This other ftrace handler might migrate
> the task when it calls this function again.
> 
> It assumes that the task will call the same functions again
> and again. Also it might require that the tasks checks its
> own stack from the ftrace handler. I am not sure if this
> is possible.
> 
> There might be other variants of this approach.

This might be the ultimate solution! As ftrace allows filtering based
on pid (/sys/kernel/tracing/set_ftrace_pid), we can technically trigger
klp_try_switch_task() on every function of the pending tasks. If this 
works, we should finish most of the transition in seconds. And the only
failure there would be threads with being patched function at the very 
sottom of its stack. Am I too optimistic here? 

Thanks,
Song