linux-kernel - Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f7511bd9-edc7-4ad7-ba1f-2920d7d94517@redhat.com>
Date: Fri, 10 May 2024 09:54:20 +0200
From: David Hildenbrand <david@...hat.com>
To: John Hubbard <jhubbard@...dia.com>, Alistair Popple <apopple@...dia.com>,
 Jason Gunthorpe <jgg@...dia.com>
Cc: Christoph Hellwig <hch@...radead.org>,
 Andrew Morton <akpm@...ux-foundation.org>,
 LKML <linux-kernel@...r.kernel.org>, linux-rdma@...r.kernel.org,
 linux-mm@...ck.org, Mike Marciniszyn <mike.marciniszyn@...el.com>,
 Leon Romanovsky <leon@...nel.org>, Artemy Kovalyov <artemyko@...dia.com>,
 Michael Guralnik <michaelgur@...dia.com>, Pak Markthub
 <pmarkthub@...dia.com>, Vivek Kasireddy <vivek.kasireddy@...el.com>
Subject: Re: [RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to
 migration glitches

On 02.05.24 20:10, John Hubbard wrote:
> On 5/1/24 11:56 PM, David Hildenbrand wrote:
>> On 02.05.24 03:05, Alistair Popple wrote:
>>> Jason Gunthorpe <jgg@...dia.com> writes:
> ...
>>>>> This doesn't make sense.  IFF a blind retry is all that is needed it
>>>>> should be done in the core functionality.  I fear it's not that easy,
>>>>> though.
>>>>
>>>> +1
>>>>
>>>> This migration retry weirdness is a GUP issue, it needs to be solved
>>>> in the mm not exposed to every pin_user_pages caller.
>>>>
>>>> If it turns out ZONE_MOVEABLE pages can't actually be reliably moved
>>>> then it is pretty broken..
>>>
>>> I wonder if we should remove the arbitrary retry limit in
>>> migrate_pages() entirely for ZONE_MOVEABLE pages and just loop until
>>> they migrate? By definition there should only be transient references on
>>> these pages so why do we need to limit the number of retries in the
>>> first place?
>>
>> There are some weird things that still needs fixing: vmsplice() is the
>> example that doesn't use FOLL_LONGTERM.
>>
> 
> Hi David!
> 

Sorry for the late reply!

> Do you have any other call sites in mind? It sounds like one way forward
> is to fix each call site...

We also have udmabuf, that is currently getting fixed [1] similarly to 
how we handle GUP. Could you and/or Jason also have a look at the 
GUP-related bits? I acked it but the patch set does not seem to make 
progress.

https://lkml.kernel.org/r/20240411070157.3318425-1-vivek.kasireddy@intel.com

The sad story is:

(a) vmsplice() is harder to fix (identify all put_page() and replace 
them by unpin_user_page()), but will get fixed at some point.

(b) even !longterm DMA can take double-digit seconds

(c) other wrong code is likely to exist or to appear again and it's
     rather hard to identify+prevent reliably

IMHO we should expect migration to take a longer time and maybe never 
happening in some cases.

Memory offlining (e.g., echo "offline" > 
sys/devices/system/memory/memory0/state) currently tries forever to 
migrate pages and can be killed if started from user space using a fatal 
signal. If memory offlining happens from kernel context (unplugging 
DIMM, ACPI code triggers offlining), we'd much rather want to fail at 
some point instead of looping forever, but it hasn't really popped up as 
a problem so far.

virtio-mem uses alloc_contig_range() for best-effort allocation and will 
skip such temporarily unmovable ranges to try again later. Here, we 
really don't want to loop forever in migration code but rather fail 
earlier and try unplug of another memory block.

So as long as page pinning is triggered from user context where the user 
can simply abort the process (e.g., kill the process), sleep+retry on 
ZONE_MOVABLE + MIGRATE_CMA sounds reasonable.

> 
> This is an unhappy story right now: the pin_user_pages*() APIs are
> significantly worse than before the "migrate pages away automatically"
> upgrade, from a user point of view. Because now, the APIs fail
> intermittently for callers who follow the rules--because
> pin_user_pages() is not fully working yet, basically.
> 
> Other ideas, large or small, about how to approach a fix?

What Jason says makes sense to me: sleep+retry. My only concern is when 
pin_user_pages_*() is called from non-killable context where failing at 
some point might be more reasonable. But maybe that use case doesn't 
really exist?

-- 
Cheers,

David / dhildenb