linux-kernel - Re: [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <59ed032f-cfde-7eda-f755-9d05c15d2828@nvidia.com>
Date:   Tue, 27 Jun 2023 14:34:19 -0700
From:   John Hubbard <jhubbard@...dia.com>
To:     Michal Hocko <mhocko@...e.com>,
        David Hildenbrand <david@...hat.com>
CC:     <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
        <virtualization@...ts.linux-foundation.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        "Oscar Salvador" <osalvador@...e.de>,
        Jason Wang <jasowang@...hat.com>,
        Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
Subject: Re: [PATCH v1 3/5] mm/memory_hotplug: make
 offline_and_remove_memory() timeout instead of failing on fatal signals

On 6/27/23 08:14, Michal Hocko wrote:
> On Tue 27-06-23 16:57:53, David Hildenbrand wrote:
...
>>>> IIUC (John can correct me if I am wrong):
>>>>
>>>> 1) The process holds the device node open
>>>> 2) The process gets killed or quits
>>>> 3) As the process gets torn down, it closes the device node
>>>> 4) Closing the device node results in the driver removing the device and
>>>>      calling offline_and_remove_memory()
>>>>
>>>> So it's not a "tear down process" that triggers that offlining_removal
>>>> somehow explicitly, it's just a side-product of it letting go of the device
>>>> node as the process gets torn down.
>>>
>>> Isn't that just fragile? The operation might fail for other reasons. Why
>>> cannot there be a hold on the resource to control the tear down
>>> explicitly?
>>
>> I'll let John comment on that. But from what I understood, in most setups
>> where ZONE_MOVABLE gets used for hotplugged memory
>> offline_and_remove_memory() succeeds and allows for reusing the device later
>> without a reboot.
>>
>> For the cases where it doesn't work, a reboot is required.
  
That is exactly correct. That's what we ran into.

And there are workarounds (for example: kthreads don't have any signals
pending...), but I did want to follow through here and make -mm aware of the
problem. And see if there is a better way.

...
>>> It seems that offline_and_remove_memory is using a wrong operation then.
>>> If it wants an opportunistic offlining with some sort of policy. Timeout
>>> might be just one policy to use but failure mode or a retry count might
>>> be a better fit for some users. So rather than (ab)using offline_pages,
>>> would be make more sense to extract basic offlining steps and allow
>>> drivers like virtio-mem to reuse them and define their own policy?

...like this, perhaps. Sounds promising!


thanks,
-- 
John Hubbard
NVIDIA