linux-kernel - Re: [PATCH] mm: Fix __wp_page_copy_user fallback path for remote mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0977a33b-8318-43a5-a5a1-4eb8c93ca270@redhat.com>
Date: Thu, 7 Nov 2024 18:14:37 +0100
From: David Hildenbrand <david@...hat.com>
To: Asahi Lina <lina@...hilina.net>, Andrew Morton <akpm@...ux-foundation.org>
Cc: Sergio Lopez Pascual <slp@...hat.com>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, asahi@...ts.linux.dev
Subject: Re: [PATCH] mm: Fix __wp_page_copy_user fallback path for remote mm

On 07.11.24 17:43, Asahi Lina wrote:
> On 11/5/24 9:03 PM, David Hildenbrand wrote:
>> On 01.11.24 13:08, Asahi Lina wrote:
>>> If the source page is a PFN mapping, we copy back from userspace.
>>> However, if this fault is a remote access, we cannot use
>>> __copy_from_user_inatomic. Instead, use access_remote_vm() in this case.
>>>
>>> Fixes WARN and incorrect zero-filling when writing to CoW mappings in
>>> a remote process, such as when using gdb on a binary present on a DAX
>>> filesystem.
>>>
>>> [  143.683782] ------------[ cut here ]------------
>>> [  143.683784] WARNING: CPU: 1 PID: 350 at mm/memory.c:2904
>>> __wp_page_copy_user+0x120/0x2bc
>>> [  143.683793] CPU: 1 PID: 350 Comm: gdb Not tainted 6.6.52 #1
>>> [  143.683794] Hardware name: linux,dummy-virt (DT)
>>> [  143.683795] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS
>>> BTYPE=--)
>>> [  143.683796] pc : __wp_page_copy_user+0x120/0x2bc
>>> [  143.683798] lr : __wp_page_copy_user+0x254/0x2bc
>>> [  143.683799] sp : ffff80008272b8b0
>>> [  143.683799] x29: ffff80008272b8b0 x28: 0000000000000000 x27:
>>> ffff000083bad580
>>> [  143.683801] x26: 0000000000000000 x25: 0000fffff7fd5000 x24:
>>> ffff000081db04c0
>>> [  143.683802] x23: ffff00014f24b000 x22: fffffc00053c92c0 x21:
>>> ffff000083502150
>>> [  143.683803] x20: 0000fffff7fd5000 x19: ffff80008272b9d0 x18:
>>> 0000000000000000
>>> [  143.683804] x17: ffff000081db0500 x16: ffff800080fe52a0 x15:
>>> 0000fffff7fd5000
>>> [  143.683804] x14: 0000000000bb1845 x13: 0000000000000080 x12:
>>> ffff80008272b880
>>> [  143.683805] x11: ffff000081d13600 x10: ffff000081d13608 x9 :
>>> ffff000081d1360c
>>> [  143.683806] x8 : ffff000083a16f00 x7 : 0000000000000010 x6 :
>>> ffff00014f24b000
>>> [  143.683807] x5 : ffff00014f24c000 x4 : 0000000000000000 x3 :
>>> ffff000083582000
>>> [  143.683807] x2 : 0000000000000f80 x1 : 0000fffff7fd5000 x0 :
>>> 0000000000001000
>>> [  143.683808] Call trace:
>>> [  143.683809]  __wp_page_copy_user+0x120/0x2bc
>>> [  143.683810]  wp_page_copy+0x98/0x5c0
>>> [  143.683813]  do_wp_page+0x250/0x530
>>> [  143.683814]  __handle_mm_fault+0x278/0x284
>>> [  143.683817]  handle_mm_fault+0x64/0x1e8
>>> [  143.683819]  faultin_page+0x5c/0x110
>>> [  143.683820]  __get_user_pages+0xc8/0x2f4
>>> [  143.683821]  get_user_pages_remote+0xac/0x30c
>>> [  143.683823]  __access_remote_vm+0xb4/0x368
>>> [  143.683824]  access_remote_vm+0x10/0x1c
>>> [  143.683826]  mem_rw.isra.0+0xc4/0x218
>>> [  143.683831]  mem_write+0x18/0x24
>>> [  143.683831]  vfs_write+0xa0/0x37c
>>> [  143.683834]  ksys_pwrite64+0x7c/0xc0
>>> [  143.683834]  __arm64_sys_pwrite64+0x20/0x2c
>>> [  143.683835]  invoke_syscall+0x48/0x10c
>>> [  143.683837]  el0_svc_common.constprop.0+0x40/0xe0
>>> [  143.683839]  do_el0_svc+0x1c/0x28
>>> [  143.683841]  el0_svc+0x3c/0xdc
>>> [  143.683846]  el0t_64_sync_handler+0x120/0x12c
>>> [  143.683848]  el0t_64_sync+0x194/0x198
>>> [  143.683849] ---[ end trace 0000000000000000 ]---
>>>
>>> Signed-off-by: Asahi Lina <lina@...hilina.net>
>>> ---
>>>    mm/memory.c | 7 ++++++-
>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index
>>> 3ccee51adfbbd007b24331fe6874265f231a877b..dba25d9734063ac02cdaeb0a5cd5432473f6372e 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3081,13 +3081,18 @@ static inline int __wp_page_copy_user(struct
>>> page *dst, struct page *src,
>>>                update_mmu_cache_range(vmf, vma, addr, vmf->pte, 1);
>>>        }
>>>    +    /* If the mm is a remote mm, copy in the page using
>>> access_remote_vm() */
>>> +    if (current->mm != mm) {
>>> +        if (access_remote_vm(mm, (unsigned long)uaddr, kaddr,
>>> PAGE_SIZE, 0) != PAGE_SIZE)
>>
>> access_remote_vm() will do a mmap_read_lock_killable() and then call
>> into get_user_page_vma_remote() -- fortunately read-access, otherwise
>> we'd be in trouble :) .
>>
>> So we should already be holding the mmap read lock from the previous
>> access_remote_vm() users (who we end up here) ... doesn't this complain
>> with lockdep about recursive locking?
>>
>> I keep forgetting locking rules, so I might just be wrong.
> 
> You're right, this complains with lockdep:
> 
> [   23.154031]
> [   23.154093] ============================================
> [   23.154193] WARNING: possible recursive locking detected
> [   23.154229] 6.6.52 #2 Not tainted
> [   23.154270] --------------------------------------------
> [   23.154306] gdb/349 is trying to acquire lock:
> [   23.154343] ffff0000862e3450 (&mm->mmap_lock){++++}-{3:3}, at:
> __access_remote_vm+0x3c/0x3a8
> [   23.154431]
> [   23.154431] but task is already holding lock:
> [   23.154474] ffff0000862e3450 (&mm->mmap_lock){++++}-{3:3}, at:
> __access_remote_vm+0x3c/0x3a8
> [   23.154553]
> [   23.154553] other info that might help us debug this:
> [   23.154598]  Possible unsafe locking scenario:
> [   23.154598]
> [   23.154641]        CPU0
> [   23.154665]        ----
> [   23.154685]   lock(&mm->mmap_lock);
> [   23.154712]   lock(&mm->mmap_lock);
> [   23.154741]
> [   23.154741]  *** DEADLOCK ***
> [   23.154741]
> [   23.154790]  May be due to missing lock nesting notation
> [   23.154790]
> [   23.154838] 2 locks held by gdb/349:
> [   23.154868]  #0: ffff0000835b53f8 (sb_writers#4){.+.+}-{0:0}, at:
> vfs_write+0x84/0x2e0
> [   23.154945]  #1: ffff0000862e3450 (&mm->mmap_lock){++++}-{3:3}, at:
> __access_remote_vm+0x3c/0x3a8
> [   23.155023]
> [   23.155023] stack backtrace:
> [   23.155060] CPU: 5 PID: 349 Comm: gdb Not tainted 6.6.52 #2
> [   23.155112] Hardware name: linux,dummy-virt (DT)
> [   23.155148] Call trace:
> [   23.155167]  dump_backtrace+0x98/0x118
> [   23.155209]  show_stack+0x18/0x24
> [   23.155240]  dump_stack_lvl+0x60/0xac
> [   23.155292]  dump_stack+0x18/0x24
> [   23.155320]  print_deadlock_bug+0x260/0x34c
> [   23.155364]  validate_chain+0x364/0x4c0
> [   23.155393]  __lock_acquire+0x564/0xb64
> [   23.155420]  lock_acquire.part.0+0x9c/0x1bc
> [   23.155448]  lock_acquire+0x9c/0x140
> [   23.155477]  down_read_killable+0x44/0x158
> [   23.155521]  __access_remote_vm+0x3c/0x3a8
> [   23.155562]  __wp_page_copy_user+0x13c/0x3a8
> [   23.155611]  wp_page_copy+0x98/0x4d8
> [   23.155640]  do_wp_page+0x290/0x594
> [   23.155671]  __handle_mm_fault+0x258/0x25c
> [   23.155712]  handle_mm_fault+0x64/0x1f0
> [   23.155755]  faultin_page+0x64/0x138
> [   23.155798]  __get_user_pages+0x11c/0x340
> [   23.155843]  get_user_pages_remote+0xc4/0x404
> [   23.155895]  __access_remote_vm+0xf4/0x3a8
> [   23.155922]  access_remote_vm+0x10/0x1c
> [   23.155952]  mem_rw.isra.0+0xc4/0x218
> [   23.155996]  mem_write+0x18/0x24
> [   23.156023]  vfs_write+0xa4/0x2e0
> [   23.156066]  ksys_pwrite64+0x7c/0xc0
> [   23.156109]  __arm64_sys_pwrite64+0x20/0x2c
> [   23.156152]  invoke_syscall+0x48/0x10c
> [   23.156196]  el0_svc_common.constprop.0+0x40/0xe0
> [   23.156249]  do_el0_svc+0x1c/0x28
> [   23.156293]  el0_svc+0x54/0x140
> [   23.156334]  el0t_64_sync_handler+0x120/0x12c
> [   23.156384]  el0t_64_sync+0x194/0x198
> 
> I guess the locking implementation is recursive so that's why this
> didn't actually deadlock...
> 
> I'm not sure what the right way to do this is then. The underlying
> reason why the fallback code is being called is that do_wp_page() calls
> vm_normal_page(), which returns NULL for VM_PFNMAP pages. So vmf->page
> is NULL and __wp_page_copy_user has to use the fallback path. However,
> the reason GUP works is that follow_page_pte() and friends have a
> specific fallback path for the pte_devmap() case that grabs a struct
> page anyway. Maybe similar logic should be in do_wp_page() so it can
> grab a struct page for PFN mappings too?

There is currently WIP to remove pte_devmap() and make vm_normal_page() 
return these pages as well.

But that would not be in VM_PFNMAP mappings, because VM_PFNMAP means 
"don't you ever look at the struct page".

Likely, you do not have a VM_PFNMAP mapping here but instead a 
VM_MIXEDMAP mapping(or likely no special mapping at all)?

vm_normal_page() returns NULL for pte_devmap(), independent of 
VM_PFNMAP, because pte_special() should succeed on them.



I recall that there is still a problem with false-positives on 
folio_test_anon() with ZONE_DEVICE pages, so it's maybe not that easy 
... and the whole get_dev_pagemap() stuff is nasty.

Likely we would have to do what GUP does, and temporarily grab a pgmap 
reference. Gah.


So if we sort out the pagemap stuff and the possibly wrong 
folio_test_anon() on some ZONE_DEVICE pages (but not all, because IIRC 
DEVICE_PRIVATE can be anon ...), it might be doable.

But it sounds ugly, especially because that code might change soon and 
not require messing with ZONE_DEVICE pages on that level.

And then, we'd not be able to handle VM_PFNMAP cleanly ...


Maybe we could test if the PFN has a directmap and simply read using 
that? I mean, that's what kmap_local_page() ends up doing on systems 
without highmem ... and without !defined(HASHED_PAGE_VIRTUAL) && 
!defined(WANT_PAGE_VIRTUAL) the kmap_local_page() really just is a 
page_to_virt(), which is mostly mapping a PFN to the corresponding 
virtual address ...

But it doesn't universally work ...

> 
> Or if the problem is just the lock, would just eliding the locking work?
> I guess that only works if all the calls into wp_page_copy() are
> guaranteed to hold the mmap lock already, but I don't know if that is
> true...

The whole "GUP recursively calling into GUP" code looks concerning. 
Could we even trigger a case where we get a recursive page fault handler 
call, because of some odd race? (concurrent MADV_DONTNEED or similar)

I think we should much rather fail this remote fault if there is no easy 
way to make it work right now.

At least I suspect this is primarily a "debugger" scenario that didn't 
work so far and we could leave it "not working because not supported" in 
a nicer way?


If this really must succeed, I can spend some time thinking about how to 
do this cleaner ...

-- 
Cheers,

David / dhildenb