[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y9OzJzHIASUeIrzO@alley>
Date: Fri, 27 Jan 2023 12:19:03 +0100
From: Petr Mladek <pmladek@...e.com>
To: "Seth Forshee (DigitalOcean)" <sforshee@...italocean.com>
Cc: Jason Wang <jasowang@...hat.com>,
"Michael S. Tsirkin" <mst@...hat.com>,
Jiri Kosina <jikos@...nel.org>,
Miroslav Benes <mbenes@...e.cz>,
Joe Lawrence <joe.lawrence@...hat.com>,
Josh Poimboeuf <jpoimboe@...nel.org>,
virtualization@...ts.linux-foundation.org, kvm@...r.kernel.org,
netdev@...r.kernel.org, live-patching@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/2] vhost: improve livepatch switching for heavily
loaded vhost worker kthreads
On Thu 2023-01-26 15:12:35, Seth Forshee (DigitalOcean) wrote:
> On Thu, Jan 26, 2023 at 06:03:16PM +0100, Petr Mladek wrote:
> > On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote:
> > > We've fairly regularaly seen liveptches which cannot transition within kpatch's
> > > timeout period due to busy vhost worker kthreads.
> >
> > I have missed this detail. Miroslav told me that we have solved
> > something similar some time ago, see
> > https://lore.kernel.org/all/20220507174628.2086373-1-song@kernel.org/
>
> Interesting thread. I had thought about something along the lines of the
> original patch, but there are some ideas in there that I hadn't
> considered.
Could you please provide some more details about the test system?
Is there anything important to make it reproducible?
The following aspects come to my mind. It might require:
+ more workers running on the same system
+ have a dedicated CPU for the worker
+ livepatching the function called by work->fn()
+ running the same work again and again
+ huge and overloaded system
> > Honestly, kpatch's timeout 1 minute looks incredible low to me. Note
> > that the transition is tried only once per minute. It means that there
> > are "only" 60 attempts.
> >
> > Just by chance, does it help you to increase the timeout, please?
>
> To be honest my test setup reproduces the problem well enough to make
> KLP wait significant time due to vhost threads, but it seldom causes it
> to hit kpatch's timeout.
>
> Our system management software will try to load a patch tens of times in
> a day, and we've seen real-world cases where patches couldn't load
> within kpatch's timeout for multiple days. But I don't have such an
> environment readily accessible for my own testing. I can try to refine
> my test case and see if I can get it to that point.
My understanding is that you try to load the patch repeatedly but
it always fails after the 1 minute timeout. It means that it always
starts from the beginning (no livepatched process).
Is there any chance to try it with a longer timeout, for example, one
hour? It should increase the chance if there are more problematic kthreads.
> > This low timeout might be useful for testing. But in practice, it does
> > not matter when the transition is lasting one hour or even longer.
> > It takes much longer time to prepare the livepatch.
>
> Agreed. And to be clear, we cope with the fact that patches may take
> hours or even days to get applied in some cases. The patches I sent are
> just about improving the only case I've identified which has lead to
> kpatch failing to load a patch for a day or longer.
If it is acceptable to wait hours or even days then the 1 minute
timeout is quite contra-productive. We actually do not use any timeout
at all in livepatches provided by SUSE.
Best Regards,
Petr
Powered by blists - more mailing lists