linux-kernel - Re: [PATCH v5 4/9] mm: filemap: use xa_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d3dc75e2-40a7-8439-734c-19d83707164c@google.com>
Date: Sun, 25 Aug 2024 14:55:41 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: Baolin Wang <baolin.wang@...ux.alibaba.com>
cc: akpm@...ux-foundation.org, hughd@...gle.com, willy@...radead.org, 
    david@...hat.com, wangkefeng.wang@...wei.com, chrisl@...nel.org, 
    ying.huang@...el.com, 21cnbao@...il.com, ryan.roberts@....com, 
    shy828301@...il.com, ziy@...dia.com, ioworker0@...il.com, 
    da.gomez@...sung.com, p.raghav@...sung.com, linux-mm@...ck.org, 
    linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 4/9] mm: filemap: use xa_get_order() to get the swap
 entry order

On Mon, 12 Aug 2024, Baolin Wang wrote:

> In the following patches, shmem will support the swap out of large folios,
> which means the shmem mappings may contain large order swap entries, so
> using xa_get_order() to get the folio order of the shmem swap entry to
> update the '*start' correctly.
> 
> Signed-off-by: Baolin Wang <baolin.wang@...ux.alibaba.com>
> ---
>  mm/filemap.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4130be74f6fd..4c312aab8b1f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2056,6 +2056,8 @@ unsigned find_get_entries(struct address_space *mapping, pgoff_t *start,
>  		folio = fbatch->folios[idx];
>  		if (!xa_is_value(folio))
>  			nr = folio_nr_pages(folio);
> +		else
> +			nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
>  		*start = indices[idx] + nr;
>  	}
>  	return folio_batch_count(fbatch);
> @@ -2120,6 +2122,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t *start,
>  		folio = fbatch->folios[idx];
>  		if (!xa_is_value(folio))
>  			nr = folio_nr_pages(folio);
> +		else
> +			nr = 1 << xa_get_order(&mapping->i_pages, indices[idx]);
>  		*start = indices[idx] + nr;
>  	}
>  	return folio_batch_count(fbatch);
> -- 

Here we have a problem, but I'm not suggesting a fix for it yet: I
need to get other fixes out first, then turn to thinking about this -
it's not easy.

That code is almost always right, so it works well enough for most
people not to have noticed, but there are two issues with it.

The first issue is that it's assuming indices[idx] is already aligned
to nr: not necessarily so.  I believe it was already wrong in the
folio_nr_pages() case, but the time I caught it wrong with a printk
was in the swap (value) case.  (I may not be stating this correctly:
again more thought needed but I can't spend so long writing.)

And immediately afterwards that kernel build failed with a corrupted
(all 0s) .o file - I'm building on ext4 on /dev/loop0 on huge tmpfs while
swapping, and happen to be using the "-o discard" option to ext4 mount.

I've been chasing these failures (maybe a few minutes in, maybe half an
hour) for days, then had the idea of trying without the "-o discard":
whereupon these builds can be repeated successfully for many hours.
IIRC ext4 discard to /dev/loop0 entails hole-punch to the tmpfs.

The alignment issue can easily be corrected, but that might not help.
(And those two functions would look better with the rcu_read_unlock()
moved down to just before returning: but again, wouldn't really help.)

The second issue is that swap is more slippery to work with than
folios or pages: in the folio_nr_pages() case, that number is stable
because we hold a refcount (which stops a THP from being split), and
later we'll be taking folio lock too.  None of that in the swap case,
so (depending on how a large entry gets split) the xa_get_order() result
is not reliable. Likewise for other uses of xa_get_order() in this series.

There needs to be some kind of locking or retry to make the order usable,
and to avoid shmem_free_swap() occasionally freeing more than it ought.
I'll give it thought after.

Hugh