[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yn5QHpc9YlAbP1li@alley>
Date: Fri, 13 May 2022 14:33:34 +0200
From: Petr Mladek <pmladek@...e.com>
To: Song Liu <songliubraving@...com>
Cc: Josh Poimboeuf <jpoimboe@...nel.org>, Rik van Riel <riel@...com>,
"song@...nel.org" <song@...nel.org>,
"joe.lawrence@...hat.com" <joe.lawrence@...hat.com>,
"peterz@...radead.org" <peterz@...radead.org>,
"mingo@...hat.com" <mingo@...hat.com>,
"vincent.guittot@...aro.org" <vincent.guittot@...aro.org>,
"live-patching@...r.kernel.org" <live-patching@...r.kernel.org>,
Kernel Team <Kernel-team@...com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"jpoimboe@...hat.com" <jpoimboe@...hat.com>
Subject: Re: [RFC] sched,livepatch: call klp_try_switch_task in __cond_resched
On Wed 2022-05-11 16:33:57, Song Liu wrote:
>
>
> > On May 11, 2022, at 2:24 AM, Petr Mladek <pmladek@...e.com> wrote:
> >
> > On Tue 2022-05-10 17:33:31, Josh Poimboeuf wrote:
> >> On Tue, May 10, 2022 at 11:57:04PM +0000, Song Liu wrote:
> >>>> If it's a real bug, we should fix it everywhere, not just for Facebook.
> >>>> Otherwise CONFIG_PREEMPT and/or non-x86 arches become second-class
> >>>> citizens.
> >>>
> >>> I think "is it a real bug?" is the top question for me. So maybe we
> >>> should take a step back.
> >>>
> >>> The behavior we see is: A busy kernel thread blocks klp transition
> >>> for more than a minute. But the transition eventually succeeded after
> >>> < 10 retries on most systems. The kernel thread is well-behaved, as
> >>> it calls cond_resched() at a reasonable frequency, so this is not a
> >>> deadlock.
> >>>
> >>> If I understand Petr correctly, this behavior is expected, and thus
> >>> is not a bug or issue for the livepatch subsystem. This is different
> >>> to our original expectation, but if this is what we agree on, we
> >>> will look into ways to incorporate long wait time for patch
> >>> transition in our automations.
> >>
> >> That's how we've traditionally looked at it, though apparently Red Hat
> >> and SUSE have implemented different ideas of what a long wait time is.
> >>
> >> In practice, one minute has always been enough for all of kpatch's users
> >> -- AFAIK, everybody except SUSE -- up until now.
> >
> > I am actually surprised that nobody met the problem yet. There are
> > "only" 60 attempts to transition the pending tasks.
>
> Maybe we should consider increase the frequency we try? Say to 10 times
> per second? I guess this will solve most of the failures we are seeing
> in current case.
My concern is that klp_try_complete_transition() checks all processes
under read_lock(&tasklist_lock). It might create some contention
on this lock. I am not sure if this lock is fair. It might slow down
block writers (creating/deleting tasks).
Best Regards,
Petr
Powered by blists - more mailing lists