linux-kernel - Re: [PATCH] f2fs: avoid deadlock in gc thread under low memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ba76460f-4e22-634a-d46f-78e1fd4ac10e@tcl.com>
Date:   Thu, 14 Apr 2022 10:27:44 +0800
From:   Wu Yan <wu-yan@....com>
To:     Jaegeuk Kim <jaegeuk@...nel.org>
CC:     <linux-f2fs-devel@...ts.sourceforge.net>,
        <linux-kernel@...r.kernel.org>, <tang.ding@....com>
Subject: Re: [PATCH] f2fs: avoid deadlock in gc thread under low memory

On 4/14/22 10:18, Jaegeuk Kim wrote:
> On 04/14, Wu Yan wrote:
>> On 4/14/22 01:00, Jaegeuk Kim wrote:
>>> On 04/13, Rokudo Yan wrote:
>>>> There is a potential deadlock in gc thread may happen
>>>> under low memory as below:
>>>>
>>>> gc_thread_func
>>>>    -f2fs_gc
>>>>     -do_garbage_collect
>>>>      -gc_data_segment
>>>>       -move_data_block
>>>>        -set_page_writeback(fio.encrypted_page);
>>>>        -f2fs_submit_page_write
>>>> as f2fs_submit_page_write try to do io merge when possible, so the
>>>> encrypted_page is marked PG_writeback but may not submit to block
>>>> layer immediately, if system enter low memory when gc thread try
>>>> to move next data block, it may do direct reclaim and enter fs layer
>>>> as below:
>>>>      -move_data_block
>>>>       -f2fs_grab_cache_page(index=?, for_write=false)
>>>>        -grab_cache_page
>>>>         -find_or_create_page
>>>>          -pagecache_get_page
>>>>           -__page_cache_alloc --  __GFP_FS is set
>>>>            -alloc_pages_node
>>>>             -__alloc_pages
>>>>              -__alloc_pages_slowpath
>>>>               -__alloc_pages_direct_reclaim
>>>>                -__perform_reclaim
>>>>                 -try_to_free_pages
>>>>                  -do_try_to_free_pages
>>>>                   -shrink_zones
>>>>                    -mem_cgroup_soft_limit_reclaim
>>>>                     -mem_cgroup_soft_reclaim
>>>>                      -mem_cgroup_shrink_node
>>>>                       -shrink_node_memcg
>>>>                        -shrink_list
>>>>                         -shrink_inactive_list
>>>>                          -shrink_page_list
>>>>                           -wait_on_page_writeback -- the page is marked
>>>>                          writeback during previous move_data_block call
>>>>
>>>> the gc thread wait for the encrypted_page writeback complete,
>>>> but as gc thread held sbi->gc_lock, the writeback & sync thread
>>>> may blocked waiting for sbi->gc_lock, so the bio contain the
>>>> encrypted_page may nerver submit to block layer and complete the
>>>> writeback, which cause deadlock. To avoid this deadlock condition,
>>>> we mark the gc thread with PF_MEMALLOC_NOFS flag, then it will nerver
>>>> enter fs layer when try to alloc cache page during move_data_block.
>>>>
>>>> Signed-off-by: Rokudo Yan <wu-yan@....com>
>>>> ---
>>>>    fs/f2fs/gc.c | 6 ++++++
>>>>    1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
>>>> index e020804f7b07..cc71f77b98c8 100644
>>>> --- a/fs/f2fs/gc.c
>>>> +++ b/fs/f2fs/gc.c
>>>> @@ -38,6 +38,12 @@ static int gc_thread_func(void *data)
>>>>    	wait_ms = gc_th->min_sleep_time;
>>>> +	/*
>>>> +	 * Make sure that no allocations from gc thread will ever
>>>> +	 * recurse to the fs layer to avoid deadlock as it will
>>>> +	 * hold sbi->gc_lock during garbage collection
>>>> +	 */
>>>> +	memalloc_nofs_save();
>>>
>>> I think this cannot cover all the f2fs_gc() call cases. Can we just avoid by:
>>>
>>> --- a/fs/f2fs/gc.c
>>> +++ b/fs/f2fs/gc.c
>>> @@ -1233,7 +1233,7 @@ static int move_data_block(struct inode *inode, block_t bidx,
>>>                                   CURSEG_ALL_DATA_ATGC : CURSEG_COLD_DATA;
>>>
>>>           /* do not read out */
>>> -       page = f2fs_grab_cache_page(inode->i_mapping, bidx, false);
>>> +       page = f2fs_grab_cache_page(inode->i_mapping, bidx, true);
>>>           if (!page)
>>>                   return -ENOMEM;
>>>
>>> Thanks,
>>>
>>>>    	set_freezable();
>>>>    	do {
>>>>    		bool sync_mode, foreground = false;
>>>> -- 
>>>> 2.25.1
>>
>> Hi, Jaegeuk
>>
>> I'm not sure if any other case may trigger the issue, but the stack traces I
>> have caught so far are all the same as below:
>>
>> f2fs_gc-253:12  D 226966.808196 572 302561 150976 0x1200840 0x0 572
>> 237207473347056
>> <ffffff889d88668c> __switch_to+0x134/0x150
>> <ffffff889e764b6c> __schedule+0xd5c/0x1100
>> <ffffff889e76554c> io_schedule+0x90/0xc0
>> <ffffff889d9fb880> wait_on_page_bit+0x194/0x208
>> <ffffff889da167b4> shrink_page_list+0x62c/0xe74
>> <ffffff889da1d354> shrink_inactive_list+0x2c0/0x698
>> <ffffff889da181f4> shrink_node_memcg+0x3dc/0x97c
>> <ffffff889da17d44> mem_cgroup_shrink_node+0x144/0x218
>> <ffffff889da6610c> mem_cgroup_soft_limit_reclaim+0x188/0x47c
>> <ffffff889da17a40> do_try_to_free_pages+0x204/0x3a0
>> <ffffff889da176c8> try_to_free_pages+0x35c/0x4d0
>> <ffffff889da05d60> __alloc_pages_nodemask+0x7a4/0x10d0
>> <ffffff889d9fc82c> pagecache_get_page+0x184/0x2ec
> 
> Is this deadlock trying to grab a lock, instead of waiting for writeback?
> Could you share all the backtraces of the tasks?
> 
> For writeback above, looking at the code, f2fs_gc uses three mappings, meta,
> node, and data, and meta/node inodes are masking GFP_NOFS in f2fs_iget(),
> while data inode does not. So, the above f2fs_grab_cache_page() in
> move_data_block() is actually calling w/o NOFS.
> 
>> <ffffff889dbf8860> do_garbage_collect+0xfe0/0x2828
>> <ffffff889dbf7434> f2fs_gc+0x4a0/0x8ec
>> <ffffff889dbf6bf4> gc_thread_func+0x240/0x4d4
>> <ffffff889d8de9b0> kthread+0x17c/0x18c
>> <ffffff889d88567c> ret_from_fork+0x10/0x18
>>
>> Thanks
>> yanwu

Hi, Jaegeuk

The gc thread is blocked on wait_on_page_writeback(encrypted page submit 
before) when it try grab data inode page, the parsed stack traces as below:

ppid=572 pid=572 D cpu=1 prio=120 wait=378s f2fs_gc-253:12
    Native callstack:
	vmlinux  wait_on_page_bit_common(page=0xFFFFFFBF7D2CD700, state=2, 
lock=false) + 304 
                                <mm/filemap.c:1035>
	vmlinux  wait_on_page_bit(page=0xFFFFFFBF7D2CD700, bit_nr=15) + 400 
 
                             <mm/filemap.c:1074>
	vmlinux  wait_on_page_writeback(page=0xFFFFFFBF7D2CD700) + 36 
 
                             <include/linux/pagemap.h:557>
	vmlinux  shrink_page_list(page_list=0xFFFFFF8011E83418, 
pgdat=contig_page_data, sc=0xFFFFFF8011E835B8, ttu_flags=0, 
stat=0xFFFFFF8011E833F0, force_reclaim=false) + 1576  <mm/vmscan.c:1171>
	vmlinux  shrink_inactive_list(lruvec=0xFFFFFFE003C304C0, 
sc=0xFFFFFF8011E835B8, lru=LRU_INACTIVE_FILE) + 700 
                                          <mm/vmscan.c:1966>
	vmlinux  shrink_list(lru=LRU_INACTIVE_FILE, lruvec=0xFFFFFF8011E834B8, 
sc=0xFFFFFF8011E835B8) + 128 
                            <mm/vmscan.c:2350>
	vmlinux  shrink_node_memcg(pgdat=contig_page_data, 
memcg=0xFFFFFFE003C1A300, sc=0xFFFFFF8011E835B8, 
lru_pages=0xFFFFFF8011E835B0) + 984 
<mm/vmscan.c:2726>
	vmlinux  mem_cgroup_shrink_node(memcg=0xFFFFFFE003C1A300, 
gfp_mask=21102794, noswap=false, pgdat=contig_page_data, 
nr_scanned=0xFFFFFF8011E836A0) + 320                   <mm/vmscan.c:3416>
	vmlinux  mem_cgroup_soft_reclaim(root_memcg=0xFFFFFFE003C1A300, 
pgdat=contig_page_data) + 164 
                                   <mm/memcontrol.c:1643>
	vmlinux  mem_cgroup_soft_limit_reclaim(pgdat=contig_page_data, order=0, 
gfp_mask=21102794, total_scanned=0xFFFFFF8011E83720) + 388 
                           <mm/memcontrol.c:2913>
	vmlinux  shrink_zones(zonelist=contig_page_data + 14784, 
sc=0xFFFFFF8011E83790) + 352 
                                          <mm/vmscan.c:3094>
	vmlinux  do_try_to_free_pages(zonelist=contig_page_data + 14784, 
sc=0xFFFFFF8011E83790) + 512 
                                  <mm/vmscan.c:3164>
	vmlinux  try_to_free_pages(zonelist=contig_page_data + 14784, order=0, 
gfp_mask=21102794, nodemask=0) + 856 
                            <mm/vmscan.c:3370>
	vmlinux  __perform_reclaim(gfp_mask=300431548, order=0, 
ac=0xFFFFFF8011E83900) + 60 
                                           <mm/page_alloc.c:3831>
	vmlinux  __alloc_pages_direct_reclaim(gfp_mask=300431548, order=0, 
alloc_flags=300431604, ac=0xFFFFFF8011E83900) + 60 
                                <mm/page_alloc.c:3853>
	vmlinux  __alloc_pages_slowpath(gfp_mask=300431548, order=0, 
ac=0xFFFFFF8011E83900) + 1244 
                                      <mm/page_alloc.c:4240>
	vmlinux  __alloc_pages_nodemask() + 1952 
 
                             <mm/page_alloc.c:4463>
	vmlinux  __alloc_pages(gfp_mask=21102794, order=0, preferred_nid=0) + 
16 
                             <include/linux/gfp.h:515>
	vmlinux  __alloc_pages_node(nid=0, gfp_mask=21102794, order=0) + 16 
 
                             <include/linux/gfp.h:528>
	vmlinux  alloc_pages_node(nid=0, gfp_mask=21102794, order=0) + 16 
 
                             <include/linux/gfp.h:542>
	vmlinux  __page_cache_alloc(gfp=21102794) + 16 
 
                             <include/linux/pagemap.h:226>
	vmlinux  pagecache_get_page() + 384 
 
                             <mm/filemap.c:1520>
	vmlinux  find_or_create_page(offset=209) + 112 
 
                             <include/linux/pagemap.h:333>
	vmlinux  grab_cache_page(index=209) + 112 
 
                             <include/linux/pagemap.h:399>
	vmlinux  f2fs_grab_cache_page(index=209, for_write=false) + 112 
 
                             <fs/f2fs/f2fs.h:2429>
	vmlinux  move_data_block(inode=0xFFFFFFDFD578EEA0, gc_type=300432152, 
segno=21904, off=145) + 3584 
                             <fs/f2fs/gc.c:1119>
	vmlinux  gc_data_segment(sbi=0xFFFFFFE007C03000, 
sum=0xFFFFFF8011E83B10, gc_list=0xFFFFFF8011E83AB8, segno=21904, 
gc_type=300432152) + 3644                               <fs/f2fs/gc.c:1475>
	vmlinux  do_garbage_collect(sbi=0xFFFFFFE007C03000, start_segno=21904, 
gc_list=0xFFFFFF8011E83CF0, gc_type=0) + 4060 
                            <fs/f2fs/gc.c:1592>
	vmlinux  f2fs_gc(sbi=0xFFFFFFE007C03000, background=true, 
segno=4294967295) + 1180 
                                         <fs/f2fs/gc.c:1684>
	vmlinux  gc_thread_func(data=0xFFFFFFE007C03000) + 572 
 
                             <fs/f2fs/gc.c:118>
	vmlinux  kthread() + 376 
 
                             <kernel/kthread.c:232>
	vmlinux  ret_from_fork() +

Thanks
yanwu