[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YnoawYtoCSvrK7lb@alley>
Date: Tue, 10 May 2022 09:56:49 +0200
From: Petr Mladek <pmladek@...e.com>
To: Song Liu <songliubraving@...com>
Cc: Song Liu <song@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
"live-patching@...r.kernel.org" <live-patching@...r.kernel.org>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
Josh Poimboeuf <jpoimboe@...hat.com>,
"joe.lawrence@...hat.com" <joe.lawrence@...hat.com>,
Kernel Team <Kernel-team@...com>
Subject: Re: [RFC] sched,livepatch: call klp_try_switch_task in __cond_resched
On Mon 2022-05-09 16:22:11, Song Liu wrote:
>
>
> > On May 9, 2022, at 8:07 AM, Petr Mladek <pmladek@...e.com> wrote:
> >
> > On Sat 2022-05-07 10:46:28, Song Liu wrote:
> >> Busy kernel threads may block the transition of livepatch. Call
> >> klp_try_switch_task from __cond_resched to make the transition easier.
> >
> > Do you have some numbers how this speeds up the transition
> > and how it slows down the scheduler, please?
>
> We don’t have number on how much this would slow down the scheduler.
> For the transition, we see cases where the transition cannot finish
> with in 60 seconds (how much "kpatch load" waits by default).
60s might be too low limit, see below.
> > cond_resched() is typically called in cycles with many interactions
> > where the task might spend a lot of time. There are two possibilities.
> > cond_resched() is called in:
> >
> > + livepatched function
> >
> > In this case, klp_try_switch_task(current) will always fail.
> > And it will non-necessarily slow down every iteration by
> > checking the very same stack.
> >
> >
> > + non-livepatched function
> >
> > In this case, the transition will succeed on the first attempt.
> > OK, but it would succeed also without that patch. The task would
> > most likely sleep in this cond_resched() so that it might
> > be successfully transitioned on the next occasion.
>
> We are in the non-live patched case. But the transition didn’t happen
> in time, because the kernel thread doesn’t go to sleep. While there is
> clearly something weird with this thread, we think live patch should
> work because the thread does call cond_resched from time to time.
I guess that it goes to sleep. Otherwise it would trigger soft lockup
report if you have the watchdog enabled.
IMHO, the problem is that klp_transition_work_fn() tries the
transition "only" once per second, see
void klp_try_complete_transition(void)
{
[...]
schedule_delayed_work(&klp_transition_work,
round_jiffies_relative(HZ));
[...]
}
It means that there are "only" 60 attempts to migrate the busy process.
It fails when the process is in the running state or sleeping in a
livepatched function. There is a _non-zero_ chance of a bad luck.
It would be great to measure how long it will take to complete
the transition if you remove the limit 60s.
Anyway, the limit 60s looks like a bad idea to me. It is too low.
For example, we do not use any limit at all in SUSE products.
And the only report was that some thread from a 3rd party
module could not be migrated. It was stuck with a livepatched
function on the stack. The kthread had really problematic
design. I am afraid that even this patch would not help
in that case.
Now, back to the limit. There are basically two problems when
the transition takes too long:
+ It blocks another transition. But the frequency of new
livepatches it typically counted in days or even weeks.
+ It means that a process is running the buggy/vulnerable
code longer time. But few hours should be acceptable
given how long it takes to prepare the livepatch.
Also it looks better to have 99.9% of processes running
the fixed code that revert the fix just because a single
process needs longer time to get transitioned.
I could imagine that there really is a process that is almost
impossible to livepatch. It might get worse on NO_HZ system.
The question is it happens in the real life.
I would personally start with prolonging or removing the limit.
Best Regards,
Petr
Powered by blists - more mailing lists