linux-kernel - Re: Hang when swapping huge=within

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <731904cf-d862-4c0e-ae5b-26444faff253@linux.alibaba.com>
Date: Mon, 24 Feb 2025 11:21:09 +0800
From: Baolin Wang <baolin.wang@...ux.alibaba.com>
To: Kairui Song <ryncsn@...il.com>, "Alex Xu (Hello71)" <alex_y_xu@...oo.ca>
Cc: Lance Yang <ioworker0@...il.com>, linux-mm@...ck.org,
 Daniel Gomez <da.gomez@...sung.com>, Barry Song <baohua@...nel.org>,
 David Hildenbrand <david@...hat.com>, Hugh Dickins <hughd@...gle.com>,
 Kefeng Wang <wangkefeng.wang@...wei.com>,
 Matthew Wilcox <willy@...radead.org>, Ryan Roberts <ryan.roberts@....com>,
 linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org>,
 "ziy@...dia.com" <ziy@...dia.com>
Subject: Re: Hang when swapping huge=within_size tmpfs from zram

Hi Kairui,

On 2025/2/24 02:22, Kairui Song wrote:
> On Mon, Feb 24, 2025 at 1:53 AM Kairui Song <ryncsn@...il.com> wrote:
>>
>> On Fri, Feb 7, 2025 at 3:24 PM Baolin Wang
>> <baolin.wang@...ux.alibaba.com> wrote:
>>>
>>> On 2025/2/5 22:39, Lance Yang wrote:
>>>> On Wed, Feb 5, 2025 at 2:38 PM Baolin Wang
>>>> <baolin.wang@...ux.alibaba.com> wrote:
>>>>> On 2025/2/5 09:55, Baolin Wang wrote:
>>>>>> Hi Alex,
>>>>>>
>>>>>> On 2025/2/5 09:23, Alex Xu (Hello71) wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> On 6.14-rc1, I found that creating a lot of files in tmpfs then deleting
>>>>>>> them reliably hangs when tmpfs is mounted with huge=within_size, and it
>>>>>>> is swapped out to zram (zstd/zsmalloc/no backing dev). I bisected this
>>>>>>> to acd7ccb284b "mm: shmem: add large folio support for tmpfs".
>>>>>>>
>>>>>>> When the issue occurs, rm uses 100% CPU, cannot be killed, and has no
>>>>>>> output in /proc/pid/stack or wchan. Eventually, an RCU stall is
>>>>>>> detected:
>>>>>>
>>>>>> Thanks for your report. Let me try to reproduce the issue locally and
>>>>>> see what happens.
>>>>>>
>>>>>>> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>>>>>> rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-11): P25160
>>>>>>> rcu:     (detected by 10, t=2102 jiffies, g=532677, q=4997 ncpus=12)
>>>>>>> task:rm              state:R  running task     stack:0     pid:25160
>>>>>>> tgid:25160 ppid:24309  task_flags:0x400000 flags:0x00004004
>>>>>>> Call Trace:
>>>>>>>     <TASK>
>>>>>>>     ? __schedule+0x388/0x1000
>>>>>>>     ? kmem_cache_free.part.0+0x23d/0x280
>>>>>>>     ? sysvec_apic_timer_interrupt+0xa/0x80
>>>>>>>     ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>>>>>>>     ? xas_load+0x12/0xc0
>>>>>>>     ? xas_load+0x8/0xc0
>>>>>>>     ? xas_find+0x144/0x190
>>>>>>>     ? find_lock_entries+0x75/0x260
>>>>>>>     ? shmem_undo_range+0xe6/0x5f0
>>>>>>>     ? shmem_evict_inode+0xe4/0x230
>>>>>>>     ? mtree_erase+0x7e/0xe0
>>>>>>>     ? inode_set_ctime_current+0x2e/0x1f0
>>>>>>>     ? evict+0xe9/0x260
>>>>>>>     ? _atomic_dec_and_lock+0x31/0x50
>>>>>>>     ? do_unlinkat+0x270/0x2b0
>>>>>>>     ? __x64_sys_unlinkat+0x30/0x50
>>>>>>>     ? do_syscall_64+0x37/0xe0
>>>>>>>     ? entry_SYSCALL_64_after_hwframe+0x50/0x58
>>>>>>>     </TASK>
>>>>>>>
>>>>>>> Let me know what information is needed to further troubleshoot this
>>>>>>> issue.
>>>>>
>>>>> Sorry, I can't reproduce this issue, and my testing process is as follows:
>>>>> 1. Mount tmpfs with huge=within_size
>>>>> 2. Create and write a tmpfs file
>>>>> 3. Swap out the large folios of the tmpfs file to zram
>>>>> 4. Execute 'rm' command to remove the tmpfs file
>>>>
>>>> I’m unable to reproduce the issue as well, and am following steps similar
>>>> to Baolin's process:
>>>>
>>>> 1) Mount tmpfs with the huge=within_size option and enable swap (using
>>>> zstd/zsmalloc without a backing device).
>>>> 2) Create and write over 10,000 files in the tmpfs.
>>>> 3) Swap out the large folios of these tmpfs files to zram.
>>>> 4) Use the rm command to delete all the files from the tmpfs.
>>>>
>>>> Testing with both 2MiB and 64KiB large folio sizes, and with
>>>> shmem_enabled=within_size, but everything works as expected.
>>>
>>> Thanks Lance for confirming again.
>>>
>>> Alex, could you give more hints on how to reproduce this issue?
>>>
>>
>> Hi Baolin,
>>
>> I can reproduce this issue very easily with the build linux kernel
>> test, and the failure rate is very high. I'm not exactly sure this is
>> the same bug but very likely, my test step:
>>
>> 1. Create a 10G ZRAM device and set up SWAP on it.
>> 2. Create a 1G memcg, and spawn a shell in it.
>> 3. Mount tmpfs with huge=within_size, and then untar linux kernel
>> source code into it.
>> 4. Build with make -j32 (higher or lower job number may also work),
>> the build will always fall within 10s due to file corrupted.

Very appreciated for your reproducer, and now I can reproduce the issue 
locally.

>> After some debugging, the reason is in shmem_swapin_folio, when swap
>> cache is hit `folio = swap_cache_get_folio(swap, NULL, 0);` sets folio
>> to a 0 order folio, then the following shmem_add_to_page_cache will
>> insert a order 0 folio overriding a high order entry in shmem's
>> xarray, so data are lost. Swap cache hit could be due to many reasons,
>> in this case it's the readahead.

Yes, thanks for your analysis. I missed that the swap readahead can swap 
in order 0 folios asynchronously.

>>
>> One quick fix is just always split the entry upon shmem fault of 0
>> order folio like this:
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 4ea6109a8043..c8e5c419c675 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2341,6 +2341,10 @@ static int shmem_swapin_folio(struct inode
>> *inode, pgoff_t index,
>>                  }
>>          }
>>
>> +       /* Swapin of 0 order folio must always ensure the entries are split */
>> +       if (!folio_order(folio))
>> +               shmem_split_large_entry(inode, index, swap, gfp);
>> +
>>   alloced:
>>          /* We have to do this with folio locked to prevent races */
>>          folio_lock(folio);

I don't think we should always split the large entry when getting folio 
from the swap cache. Instead, splitting should only be done when the 
order stored in the shmem mapping is inconsistent with the folio order, 
as well as updating the swap value.

Could you help to try below fix? I tested it and it can work well with 
your reproducer. Thanks a lot.

diff --git a/mm/shmem.c b/mm/shmem.c
index 671f63063fd4..7e70081a96d4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2253,7 +2253,7 @@ static int shmem_swapin_folio(struct inode *inode, 
pgoff_t index,
         struct folio *folio = NULL;
         bool skip_swapcache = false;
         swp_entry_t swap;
-       int error, nr_pages;
+       int error, nr_pages, order, split_order;

         VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
         swap = radix_to_swp_entry(*foliop);
@@ -2272,10 +2272,9 @@ static int shmem_swapin_folio(struct inode 
*inode, pgoff_t index,

         /* Look it up and read it in.. */
         folio = swap_cache_get_folio(swap, NULL, 0);
+       order = xa_get_order(&mapping->i_pages, index);
         if (!folio) {
-               int order = xa_get_order(&mapping->i_pages, index);
                 bool fallback_order0 = false;
-               int split_order;

                 /* Or update major stats only when swapin succeeds?? */
                 if (fault_type) {
@@ -2339,6 +2338,29 @@ static int shmem_swapin_folio(struct inode 
*inode, pgoff_t index,
                         error = -ENOMEM;
                         goto failed;
                 }
+       } else if (order != folio_order(folio)) {
+               /*
+                * Swap readahead may swap in order 0 folios into swapcache
+                * asynchronously, while the shmem mapping can still stores
+                * large swap entries. In such cases, we should split the
+                * large swap entry to prevent possible data loss.
+                */
+               split_order = shmem_split_large_entry(inode, index, 
swap, gfp);
+               if (split_order < 0) {
+                       error = split_order;
+                       goto failed;
+               }
+
+               /*
+                * If the large swap entry has already been split, it is
+                * necessary to recalculate the new swap entry based on
+                * the old order alignment.
+                */
+               if (split_order > 0) {
+                       pgoff_t offset = index - round_down(index, 1 << 
split_order);
+
+                       swap = swp_entry(swp_type(swap), 
swp_offset(swap) + offset);
+               }
         }

  alloced:
@@ -2346,7 +2368,8 @@ static int shmem_swapin_folio(struct inode *inode, 
pgoff_t index,
         folio_lock(folio);
         if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
             folio->swap.val != swap.val ||
-           !shmem_confirm_swap(mapping, index, swap)) {
+           !shmem_confirm_swap(mapping, index, swap) ||
+           xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
                 error = -EEXIST;
                 goto unlock;
         }

>> And Hi Alex, can you help confirm if the above patch fixes your reported bug?
>>
>> If we are OK with this, this should be merged into 6.14 I think, but
>> for the long term, it might be a good idea to just share a similar
>> logic of (or just reuse) __filemap_add_folio for shmem?
>> __filemap_add_folio will split the entry on insert, and code will be
>> much cleaner.
> 
> Some extra comments for above patch: If it raced with another split,
> or the entry used for swap cache lookup is wrongly aligned due to
> large entry, the shmem_add_to_page_cache below will fail with -EEXIST
> and try again. So that seems to be working well in my test.