linux-kernel - Re: read() data corruption with CONFIG_READ_ONLY_THP_FOR

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YhZFr+kXIJFgiMaf@casper.infradead.org>
Date:   Wed, 23 Feb 2022 14:33:19 +0000
From:   Matthew Wilcox <willy@...radead.org>
To:     Vlastimil Babka <vbabka@...e.cz>
Cc:     "stable@...r.kernel.org" <stable@...r.kernel.org>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Christoph Hellwig <hch@...radead.org>, Jan Kara <jack@...e.cz>,
        Takashi Iwai <tiwai@...e.de>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>, patches@...ts.linux.dev,
        LKML <linux-kernel@...r.kernel.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Subject: Re: read() data corruption with CONFIG_READ_ONLY_THP_FOR_FS=y

On Wed, Feb 23, 2022 at 02:54:43PM +0100, Vlastimil Babka wrote:
> we have found a bug involving CONFIG_READ_ONLY_THP_FOR_FS=y, introduced in
> 5.12 by cbd59c48ae2b ("mm/filemap: use head pages in
> generic_file_buffered_read")
> and apparently fixed in 5.17-rc1 by 6b24ca4a1a8d ("mm: Use multi-index
> entries in the page cache")
> The latter commit is part of folio rework so likely not stable material, so
> it would be nice to have a small fix for e.g. 5.15 LTS. Preferably from
> someone who understands xarray :)

[...]

> I've hacked some printk on top 5.16 (attached debug.patch)
> which gives this output:
> 
> i=0 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152
> i=1 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152
> i=2 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=3 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=4 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=5 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=6 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=7 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=8 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=9 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=10 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=11 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=12 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=13 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=14 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> 
> It seems filemap_get_read_batch() should be returning pages ffffea0004340000
> and ffffea0004470000 consecutively in the pvec, but returns the first one 8
> times, so it's read twice and then the rest is just skipped over as it's
> beyond the requested read size.
> 
> I suspect these lines:
>   xas.xa_index = head->index + thp_nr_pages(head) - 1;
>   xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
> 
> commit 6b24ca4a1a8d changes those to xas_advance() (introduced one patch
> earlier), so some self-contained fix should be possible for prior kernels?
> But I don't understand xarray well enough.

I figured it out!

In v5.15 (indeed, everything before commit 6b24ca4a1a8d), an order-9
page is stored in 512 consecutive slots.  The XArray stores 64 entries
per level.  So what happens is we start looking at index 0 and we walk
down to the bottom of the tree and find the THP at index 0.

                xas.xa_index = head->index + thp_nr_pages(head) - 1;
                xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;

So we've advanced xas.xa_index to 511, but advanced xas.xa_offset to 63.
Then we call xas_next() which calls __xas_next(), which moves us along to
array index 64 while we think we're looking at index 512.

We could make __xas_next() more resistant to this kind of abuse (by
extracting the correct offset in the parent node from xa_index), but
as you say, we're looking for a small fix for LTS.  I suggest this
will probably do the right thing:

+++ b/mm/filemap.c
@@ -2354,8 +2354,7 @@ static void filemap_get_read_batch(struct address_space *mapping,
                        break;
                if (PageReadahead(head))
                        break;
-               xas.xa_index = head->index + thp_nr_pages(head) - 1;
-               xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
+               xas_set(&xas, head->index + thp_nr_pages(head) - 1);
                continue;
 put_page:
                put_page(head);

but I'll start trying the reproducer now.