[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Sat, 10 Oct 2020 16:09:01 +0100
From: Matthew Wilcox <willy@...radead.org>
To: Hugh Dickins <hughd@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Song Liu <songliubraving@...com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Yang Shi <yang.shi@...ux.alibaba.com>,
Denis Lisov <dennis.lissov@...il.com>, Qian Cai <cai@....pw>,
Suren Baghdasaryan <surenb@...gle.com>,
David Rientjes <rientjes@...gle.com>,
Minchan Kim <minchan@...nel.org>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [PATCH] mm/khugepaged: fix filemap page_to_pgoff(page) != offset
On Fri, Oct 09, 2020 at 08:07:59PM -0700, Hugh Dickins wrote:
> There have been elusive reports of filemap_fault() hitting its
> VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
> with CONFIG_READ_ONLY_THP_FOR_FS=y.
>
> Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
> CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
> without NUMA reuses the same huge page after collapse_file() failed
> (whereas NUMA targets its allocation to the respective node each time).
> And most of us were usually testing with CONFIG_NUMA=y kernels.
Good catch. There have been at least three bugs in recent times which
can cause this VM_BUG_ON_PAGE() to trigger. This one, one where swapping
out a THP led to all 512 entries pointing to the same non-huge page on
swapin (fixed in -mm) and one that I introduced for a few weeks in -mm
where failing to split a THP would lead to random tree corruption due
to a non-zeroed node being freed to the slab cache.
There may yet be a fourth. I've seen it occasionally in recent testing
so I'll add this patch and see if it disappears.
> Instead, non-NUMA khugepaged_prealloc_page() release the old page
> if anyone else has a reference to it (1% of cases when I tested).
I think this is a good way to fix the problem. We could also change
khugepaged to insert a frozen page, ensuring that find_get_entry()
would spin until the collapse has succeeded or the page was removed
from the cache again. But I have no problem with this approach.
I want to note that this is a silent data corruption for reads.
generic_file_buffered_read() has a reference to the page, so this
patch will fix it, but before it could be copying the wrong data
to userspace.
Reviewed-by: Matthew Wilcox (Oracle) <willy@...radead.org>
Powered by blists - more mailing lists