linux-kernel - Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 18 Jul 2011 12:22:26 +1000
From:	Dave Chinner <david@...morbit.com>
To:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Cc:	Christoph Hellwig <hch@...radead.org>,
	Mel Gorman <mgorman@...e.de>, Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>, XFS <xfs@....sgi.com>,
	Johannes Weiner <jweiner@...hat.com>,
	Wu Fengguang <fengguang.wu@...el.com>, Jan Kara <jack@...e.cz>,
	Rik van Riel <riel@...hat.com>,
	Minchan Kim <minchan.kim@...il.com>
Subject: Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in
 direct reclaim

On Fri, Jul 15, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 00:46:43 -0400
> > Christoph Hellwig <hch@...radead.org> wrote:
> > 
> > > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > +			/*
> > > > > +			 * Only kswapd can writeback filesystem pages to
> > > > > +			 * avoid risk of stack overflow
> > > > > +			 */
> > > > > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > > +				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > > +				goto keep_locked;
> > > > > +			}
> > > > > +
> > > > 
> > > > 
> > > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > > 
> > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > for the typical non-overwrite workloads, and none has fallen apart.
> > > 
> > > In fact there's no way we can enable them as the memcg calling contexts
> > > tend to have massive stack usage.
> > > 
> > 
> > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
> 
> Here's an example writeback stack trace. Notice how deep it is from
> the __writepage() call?
....
> 
> So from ->writepage, there is about 3.5k of stack usage here.  2.5k
> of that is in XFS, and the worst I've seen is around 4k before
> getting to the IO subsystem, which in the worst case I've seen
> consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
> down to IO take over 6k of stack space on x86_64....

BTW, here's a stack frame that indicates swap IO:

dave@...t-4:~$ cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (46 entries)
        -----    ----   --------
  0)     5080      40   zone_statistics+0xad/0xc0
  1)     5040     272   get_page_from_freelist+0x2ad/0x7e0
  2)     4768     288   __alloc_pages_nodemask+0x133/0x7b0
  3)     4480      48   kmem_getpages+0x62/0x160
  4)     4432     112   cache_grow+0x2d1/0x300
  5)     4320      80   cache_alloc_refill+0x219/0x260
  6)     4240      64   kmem_cache_alloc+0x182/0x190
  7)     4176      16   mempool_alloc_slab+0x15/0x20
  8)     4160     144   mempool_alloc+0x63/0x140
  9)     4016      16   scsi_sg_alloc+0x4c/0x60
 10)     4000     112   __sg_alloc_table+0x66/0x140
 11)     3888      32   scsi_init_sgtable+0x33/0x90
 12)     3856      48   scsi_init_io+0x31/0xc0
 13)     3808      32   scsi_setup_fs_cmnd+0x79/0xe0
 14)     3776     112   sd_prep_fn+0x150/0xa90
 15)     3664      64   blk_peek_request+0xc7/0x230
 16)     3600      96   scsi_request_fn+0x68/0x500
 17)     3504      16   __blk_run_queue+0x1b/0x20
 18)     3488      96   __make_request+0x2cb/0x310
 19)     3392     192   generic_make_request+0x26d/0x500
 20)     3200      96   submit_bio+0x64/0xe0
 21)     3104      48   swap_writepage+0x83/0xd0
 22)     3056     112   pageout+0x122/0x2f0
 23)     2944     192   shrink_page_list+0x458/0x5f0
 24)     2752     192   shrink_inactive_list+0x1ec/0x410
 25)     2560     224   shrink_zone+0x468/0x500
 26)     2336     144   do_try_to_free_pages+0x2b7/0x3f0
 27)     2192     176   try_to_free_pages+0xa4/0x120
 28)     2016     288   __alloc_pages_nodemask+0x43f/0x7b0
 29)     1728      48   kmem_getpages+0x62/0x160
 30)     1680     128   fallback_alloc+0x192/0x240
 31)     1552      96   ____cache_alloc_node+0x9a/0x170
 32)     1456      16   __kmalloc+0x17d/0x200
 33)     1440     128   kmem_alloc+0x77/0xf0
 34)     1312     128   xfs_log_commit_cil+0x95/0x3d0
 35)     1184      96   _xfs_trans_commit+0x1e9/0x2a0
 36)     1088     208   xfs_create+0x57a/0x640
 37)      880      96   xfs_vn_mknod+0xa1/0x1b0
 38)      784      16   xfs_vn_create+0x10/0x20
 39)      768      64   vfs_create+0xb1/0xe0
 40)      704      96   do_last+0x5f5/0x770
 41)      608     144   path_openat+0xd5/0x400
 42)      464     224   do_filp_open+0x49/0xa0
 43)      240      96   do_sys_open+0x107/0x1e0
 44)      144      16   sys_open+0x20/0x30
 45)      128     128   system_call_fastpath+0x16/0x1b


That's pretty damn bad. From kmem_alloc to the top of the stack is
more than 3.5k through the direct reclaim swap IO path. That, to me,
kind of indicates that even doing swap IO on dirty anonymous pages
from direct reclaim risks overflowing the 8k stack on x86_64....

Umm, hold on a second, WTF is my standard create-lots-of-zero-length
inodes-in-parallel doing swapping? Oh, shit, it's also running about
50% slower (50-60k files/s instead of 110-120l files/s)....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/