netdev - Re: [PATCH 0/2] vhost: improve livepatch switching for heavily loaded vhost worker kthreads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y9OzJzHIASUeIrzO@alley>
Date:   Fri, 27 Jan 2023 12:19:03 +0100
From:   Petr Mladek <pmladek@...e.com>
To:     "Seth Forshee (DigitalOcean)" <sforshee@...italocean.com>
Cc:     Jason Wang <jasowang@...hat.com>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        Jiri Kosina <jikos@...nel.org>,
        Miroslav Benes <mbenes@...e.cz>,
        Joe Lawrence <joe.lawrence@...hat.com>,
        Josh Poimboeuf <jpoimboe@...nel.org>,
        virtualization@...ts.linux-foundation.org, kvm@...r.kernel.org,
        netdev@...r.kernel.org, live-patching@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/2] vhost: improve livepatch switching for heavily
 loaded vhost worker kthreads

On Thu 2023-01-26 15:12:35, Seth Forshee (DigitalOcean) wrote:
> On Thu, Jan 26, 2023 at 06:03:16PM +0100, Petr Mladek wrote:
> > On Fri 2023-01-20 16:12:20, Seth Forshee (DigitalOcean) wrote:
> > > We've fairly regularaly seen liveptches which cannot transition within kpatch's
> > > timeout period due to busy vhost worker kthreads.
> > 
> > I have missed this detail. Miroslav told me that we have solved
> > something similar some time ago, see
> > https://lore.kernel.org/all/20220507174628.2086373-1-song@kernel.org/
> 
> Interesting thread. I had thought about something along the lines of the
> original patch, but there are some ideas in there that I hadn't
> considered.

Could you please provide some more details about the test system?
Is there anything important to make it reproducible?

The following aspects come to my mind. It might require:

   + more workers running on the same system
   + have a dedicated CPU for the worker
   + livepatching the function called by work->fn()
   + running the same work again and again
   + huge and overloaded system


> > Honestly, kpatch's timeout 1 minute looks incredible low to me. Note
> > that the transition is tried only once per minute. It means that there
> > are "only" 60 attempts.
> > 
> > Just by chance, does it help you to increase the timeout, please?
> 
> To be honest my test setup reproduces the problem well enough to make
> KLP wait significant time due to vhost threads, but it seldom causes it
> to hit kpatch's timeout.
> 
> Our system management software will try to load a patch tens of times in
> a day, and we've seen real-world cases where patches couldn't load
> within kpatch's timeout for multiple days. But I don't have such an
> environment readily accessible for my own testing. I can try to refine
> my test case and see if I can get it to that point.

My understanding is that you try to load the patch repeatedly but
it always fails after the 1 minute timeout. It means that it always
starts from the beginning (no livepatched process).

Is there any chance to try it with a longer timeout, for example, one
hour? It should increase the chance if there are more problematic kthreads.

> > This low timeout might be useful for testing. But in practice, it does
> > not matter when the transition is lasting one hour or even longer.
> > It takes much longer time to prepare the livepatch.
> 
> Agreed. And to be clear, we cope with the fact that patches may take
> hours or even days to get applied in some cases. The patches I sent are
> just about improving the only case I've identified which has lead to
> kpatch failing to load a patch for a day or longer.

If it is acceptable to wait hours or even days then the 1 minute
timeout is quite contra-productive. We actually do not use any timeout
at all in livepatches provided by SUSE.

Best Regards,
Petr