linux-kernel - Re: [PATCH 2/3] hugetlbfs: close race between MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231002043958.GB11194@monkey>
Date:   Sun, 1 Oct 2023 21:39:58 -0700
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     riel@...riel.com
Cc:     linux-kernel@...r.kernel.org, kernel-team@...a.com,
        linux-mm@...ck.org, akpm@...ux-foundation.org,
        muchun.song@...ux.dev, leit@...a.com, willy@...radead.org
Subject: Re: [PATCH 2/3] hugetlbfs: close race between MADV_DONTNEED and page
 fault

On 09/30/23 20:55, riel@...riel.com wrote:
> From: Rik van Riel <riel@...riel.com>
> 
> Malloc libraries, like jemalloc and tcalloc, take decisions on when
> to call madvise independently from the code in the main application.
> 
> This sometimes results in the application page faulting on an address,
> right after the malloc library has shot down the backing memory with
> MADV_DONTNEED.
> 
> Usually this is harmless, because we always have some 4kB pages
> sitting around to satisfy a page fault. However, with hugetlbfs
> systems often allocate only the exact number of huge pages that
> the application wants.
> 
> Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
> any lock taken on the page fault path, which can open up the following
> race condition:
> 
>        CPU 1                            CPU 2
> 
>        MADV_DONTNEED
>        unmap page
>        shoot down TLB entry
>                                        page fault
>                                        fail to allocate a huge page
>                                        killed with SIGBUS
>        free page
> 
> Fix that race by pulling the locking from __unmap_hugepage_final_range
> into helper functions called from zap_page_range_single. This ensures
> page faults stay locked out of the MADV_DONTNEED VMA until the
> huge pages have actually been freed.
> 
> Signed-off-by: Rik van Riel <riel@...riel.com>
> ---
>  include/linux/hugetlb.h | 35 +++++++++++++++++++++++++++++++++--
>  mm/hugetlb.c            | 20 +++++++++++---------
>  mm/memory.c             | 13 ++++++++-----

Hi Rik,

Something is not right here.  I have not looked closely at the patch,
but running libhugetlbfs test suite hits this NULL deref in misalign (2M: 32).

[   51.891236] BUG: kernel NULL pointer dereference, address: 00000000000001c0
[   51.892420] #PF: supervisor read access in kernel mode
[   51.893353] #PF: error_code(0x0000) - not-present page
[   51.894207] PGD 80000001eeac0067 P4D 80000001eeac0067 PUD 1fa577067 PMD 0 
[   51.895299] Oops: 0000 [#1] PREEMPT SMP PTI
[   51.896010] CPU: 0 PID: 1004 Comm: misalign Not tainted 6.6.0-rc3-next-20230925+ #13
[   51.897285] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[   51.898674] RIP: 0010:__hugetlb_zap_begin+0x76/0x90
[   51.899488] Code: 06 48 8b 3a 48 39 cf 73 11 48 81 c7 ff ff ff 3f 48 81 e7 00 00 00 c0 48 89 3a 48 89 df e8 42 cd ff ff 48 8b 83 88 00 00 00 5b <48> 8b b8 c0 01 00 00 48 81 c7 28 01 00 00 e9 87 3b 91 00 0f 1f 80
[   51.902194] RSP: 0018:ffffc9000487bbf0 EFLAGS: 00010246
[   51.903019] RAX: 0000000000000000 RBX: 00000000f7a00000 RCX: 00000000c0000000
[   51.904088] RDX: 0000000000440073 RSI: ffffc9000487bc00 RDI: ffff8881fa71dcb8
[   51.905207] RBP: 00000000f7800000 R08: 00000000f7a00000 R09: 00000000f7a00000
[   51.906284] R10: ffff8881fb5b8040 R11: ffff8881fb5b89b0 R12: ffff8881fa71dcb8
[   51.907351] R13: ffffc9000487bd80 R14: ffffc9000487bc78 R15: 0000000000000001
[   51.908648] FS:  0000000000000000(0000) GS:ffff888277c00000(0063) knlGS:00000000f7c99700
[   51.910613] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[   51.911983] CR2: 00000000000001c0 CR3: 00000001fa412005 CR4: 0000000000370ef0
[   51.913417] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   51.914535] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   51.915602] Call Trace:
[   51.916069]  <TASK>
[   51.916480]  ? __die+0x1f/0x70
[   51.917057]  ? page_fault_oops+0x159/0x450
[   51.917737]  ? do_user_addr_fault+0x65/0x850
[   51.919360]  ? exc_page_fault+0x6d/0x1c0
[   51.920021]  ? asm_exc_page_fault+0x22/0x30
[   51.920712]  ? __hugetlb_zap_begin+0x76/0x90
[   51.921440]  unmap_vmas+0xb3/0x100
[   51.922057]  unmap_region.constprop.0+0xcc/0x140
[   51.922837]  ? lock_release+0x142/0x290
[   51.923474]  ? preempt_count_add+0x47/0xa0
[   51.924150]  mmap_region+0x565/0xab0
[   51.924809]  do_mmap+0x35a/0x520
[   51.925384]  vm_mmap_pgoff+0xdf/0x200
[   51.926008]  ksys_mmap_pgoff+0x18f/0x200
[   51.926834]  ? syscall_enter_from_user_mode_prepare+0x19/0x60
[   51.928006]  __do_fast_syscall_32+0x68/0x100
[   51.928962]  do_fast_syscall_32+0x2f/0x70
[   51.929896]  entry_SYSENTER_compat_after_hwframe+0x7b/0x8d

I think you previously built libhugetlbfs, so hopefully you can recreate.

The stack trace (and test) suggest hugetlbfs_file_mmap returns an error
due to misalignment, and then we unmap the vma just previously created.
Looks code is now calling hugetlb_zap_begin before unmap_single_vma.
The code/comment in unmap_single_vma mentions this special cleanup
case.  Looks like vma->vm_file is NULL and __hugetlb_zap_begin is doing
a i_mmap_lock_write(vma->vm_file->f_mapping).

        if (start != end) {
                if (unlikely(is_vm_hugetlb_page(vma))) {
                        /*
                         * It is undesirable to test vma->vm_file as it
                         * should be non-null for valid hugetlb area.
                         * However, vm_file will be NULL in the error
                         * cleanup path of mmap_region. When
                         * hugetlbfs ->mmap method fails,
                         * mmap_region() nullifies vma->vm_file
                         * before calling this function to clean up.
                         * Since no pte has actually been setup, it is
                         * safe to do nothing in this case.
                         */
                        if (vma->vm_file) {
                                zap_flags_t zap_flags = details ?
                                    details->zap_flags : 0;
                                __unmap_hugepage_range_final(tlb, vma, start, end,
                                                             NULL, zap_flags);
                        }

Looks like vma->vm_file is NULL and __hugetlb_zap_begin is trying to do
i_mmap_lock_write(vma->vm_file->f_mapping).

Guess I did look closely. :)
-- 
Mike Kravetz