linux-kernel - Re: [PATCH] mm/khugepaged: fix filemap page_to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20201010150901.GX20115@casper.infradead.org>
Date:   Sat, 10 Oct 2020 16:09:01 +0100
From:   Matthew Wilcox <willy@...radead.org>
To:     Hugh Dickins <hughd@...gle.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Song Liu <songliubraving@...com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Denis Lisov <dennis.lissov@...il.com>, Qian Cai <cai@....pw>,
        Suren Baghdasaryan <surenb@...gle.com>,
        David Rientjes <rientjes@...gle.com>,
        Minchan Kim <minchan@...nel.org>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH] mm/khugepaged: fix filemap page_to_pgoff(page) != offset

On Fri, Oct 09, 2020 at 08:07:59PM -0700, Hugh Dickins wrote:
> There have been elusive reports of filemap_fault() hitting its
> VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
> with CONFIG_READ_ONLY_THP_FOR_FS=y.
> 
> Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
> CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
> without NUMA reuses the same huge page after collapse_file() failed
> (whereas NUMA targets its allocation to the respective node each time).
> And most of us were usually testing with CONFIG_NUMA=y kernels.

Good catch.  There have been at least three bugs in recent times which
can cause this VM_BUG_ON_PAGE() to trigger.  This one, one where swapping
out a THP led to all 512 entries pointing to the same non-huge page on
swapin (fixed in -mm) and one that I introduced for a few weeks in -mm
where failing to split a THP would lead to random tree corruption due
to a non-zeroed node being freed to the slab cache.

There may yet be a fourth.  I've seen it occasionally in recent testing
so I'll add this patch and see if it disappears.

> Instead, non-NUMA khugepaged_prealloc_page() release the old page
> if anyone else has a reference to it (1% of cases when I tested).

I think this is a good way to fix the problem.  We could also change
khugepaged to insert a frozen page, ensuring that find_get_entry()
would spin until the collapse has succeeded or the page was removed
from the cache again.  But I have no problem with this approach.

I want to note that this is a silent data corruption for reads.
generic_file_buffered_read() has a reference to the page, so this
patch will fix it, but before it could be copying the wrong data
to userspace.

Reviewed-by: Matthew Wilcox (Oracle) <willy@...radead.org>