linux-kernel - Re: [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZJrvhACxmaQmmwYP@dhcp22.suse.cz>
Date:   Tue, 27 Jun 2023 16:17:40 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        virtualization@...ts.linux-foundation.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        John Hubbard <jhubbard@...dia.com>,
        Oscar Salvador <osalvador@...e.de>,
        Jason Wang <jasowang@...hat.com>,
        Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
Subject: Re: [PATCH v1 3/5] mm/memory_hotplug: make
 offline_and_remove_memory() timeout instead of failing on fatal signals

On Tue 27-06-23 15:14:11, David Hildenbrand wrote:
> On 27.06.23 14:40, Michal Hocko wrote:
> > On Tue 27-06-23 13:22:18, David Hildenbrand wrote:
> > > John Hubbard writes [1]:
> > > 
> > >          Some device drivers add memory to the system via memory hotplug.
> > >          When the driver is unloaded, that memory is hot-unplugged.
> > > 
> > >          However, memory hot unplug can fail. And these days, it fails a
> > >          little too easily, with respect to the above case. Specifically, if
> > >          a signal is pending on the process, hot unplug fails.
> > > 
> > >          [...]
> > > 
> > >          So in this case, other things (unmovable pages, un-splittable huge
> > >          pages) can also cause the above problem. However, those are
> > >          demonstrably less common than simply having a pending signal. I've
> > >          got bug reports from users who can trivially reproduce this by
> > >          killing their process with a "kill -9", for example.
> > 
> > This looks like a bug of the said driver no? If the tear down process is
> > killed it could very well happen right before offlining so you end up in
> > the very same state. Or what am I missing?
> 
> IIUC (John can correct me if I am wrong):
> 
> 1) The process holds the device node open
> 2) The process gets killed or quits
> 3) As the process gets torn down, it closes the device node
> 4) Closing the device node results in the driver removing the device and
>    calling offline_and_remove_memory()
> 
> So it's not a "tear down process" that triggers that offlining_removal
> somehow explicitly, it's just a side-product of it letting go of the device
> node as the process gets torn down.

Isn't that just fragile? The operation might fail for other reasons. Why
cannot there be a hold on the resource to control the tear down
explicitly?

> > > Especially with ZONE_MOVABLE, offlining is supposed to work in most
> > > cases when offlining actually hotplugged (not boot) memory, and only fail
> > > in rare corner cases (e.g., some driver holds a reference to a page in
> > > ZONE_MOVABLE, turning it unmovable).
> > > 
> > > In these corner cases we really don't want to be stuck forever in
> > > offline_and_remove_memory(). But in the general cases, we really want to
> > > do our best to make memory offlining succeed -- in a reasonable
> > > timeframe.
> > > 
> > > Reliably failing in the described case when there is a fatal signal pending
> > > is sub-optimal. The pending signal check is mostly only relevant when user
> > > space explicitly triggers offlining of memory using sysfs device attributes
> > > ("state" or "online" attribute), but not when coming via
> > > offline_and_remove_memory().
> > > 
> > > So let's use a timer instead and ignore fatal signals, because they are
> > > not really expressive for offline_and_remove_memory() users. Let's default
> > > to 30 seconds if no timeout was specified, and limit the timeout to 120
> > > seconds.
> > 
> > I really hate having timeouts back. They just proven to be hard to get
> > right and it is essentially a policy implemented in the kernel. They
> > simply do not belong to the kernel space IMHO.
> 
> As much as I agree with you in terms of offlining triggered from user space
> (e.g., write "state" or "online" attribute) where user-space is actually in
> charge  and can do something reasonable (timeout, retry, whatever), in these
> the offline_and_remove_memory() case it's the driver that wants a
> best-effort memory offlining+removal.
> 
> If it times out, virtio-mem will simply try another block or retry later.
> Right now, it could get stuck forever in offline_and_remove_memory(), which
> is obviously "not great". Fortunately, for virtio-mem it's configurable and
> we use the alloc_contig_range()-method for now as default.

It seems that offline_and_remove_memory is using a wrong operation then.
If it wants an opportunistic offlining with some sort of policy. Timeout
might be just one policy to use but failure mode or a retry count might
be a better fit for some users. So rather than (ab)using offline_pages,
would be make more sense to extract basic offlining steps and allow
drivers like virtio-mem to reuse them and define their own policy?

> If it would time out for John's driver, we most certainly don't want to be
> stuck in offline_and_remove_memory(), blocking device/driver unloading (and
> even a reboot IIRC) possibly forever.

Now I am confused. John driver wants to tear down the device now? There
is no way to release that memory otherwise AFAIU from the initial
problem description.
-- 
Michal Hocko
SUSE Labs