linux-kernel - Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170127093735.GB4143@dhcp22.suse.cz>
Date:   Fri, 27 Jan 2017 10:37:35 +0100
From:   Michal Hocko <mhocko@...nel.org>
To:     Theodore Ts'o <tytso@....edu>
Cc:     Jan Kara <jack@...e.cz>, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        Dave Chinner <david@...morbit.com>, djwong@...nel.org,
        Chris Mason <clm@...com>, David Sterba <dsterba@...e.cz>,
        ceph-devel@...r.kernel.org, cluster-devel@...hat.com,
        linux-nfs@...r.kernel.org, logfs@...fs.org,
        linux-xfs@...r.kernel.org, linux-ext4@...r.kernel.org,
        linux-btrfs@...r.kernel.org, linux-mtd@...ts.infradead.org,
        reiserfs-devel@...r.kernel.org,
        linux-ntfs-dev@...ts.sourceforge.net,
        linux-f2fs-devel@...ts.sourceforge.net,
        linux-afs@...ts.infradead.org, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

On Fri 27-01-17 01:13:18, Theodore Ts'o wrote:
> On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > > I'm convinced the current series is OK, only real life will tell us whether
> > > > we missed something or not ;)
> > > 
> > > I would like to extend the changelog of "jbd2: mark the transaction
> > > context with the scope GFP_NOFS context".
> > > 
> > > "
> > > Please note that setups without journal do not suffer from potential
> > > recursion problems and so they do not need the scope protection because
> > > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > > points from the direct reclaim) can reenter a locked context which is
> > > doing the allocation currently.
> > > "
> > 
> > Could you comment on this Ted, please?
> 
> I guess....   so there still is one way this could screw us, and it's this reason for GFP_NOFS:
> 
>         - to prevent from stack overflows during the reclaim because
> 	          the allocation is performed from a deep context already
> 
> The writepages call stack can be pretty deep.  (Especially if we're
> using ext4 in no journal mode over, say, iSCSI.)
> 
> How much stack space can get consumed by a reclaim?

./scripts/stackusage with allyesconfig says:

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264     static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520     static
./mm/vmscan.c:2946      try_to_free_pages       216     static
./mm/vmscan.c:2753      do_try_to_free_pages    304     static
./mm/vmscan.c:2517      shrink_node     	352     static
./mm/vmscan.c:2317      shrink_node_memcg       560     static
./mm/vmscan.c:1692      shrink_inactive_list    688     static
./mm/vmscan.c:908       shrink_page_list        608     static

So this would be 3512 for the standard LRUs reclaim whether we have
GFP_FS or not. shrink_page_list can recurse to releasepage but there is
no NOFS protection there so it doesn't make much sense to check this
path. So we are left with the slab shrinkers path

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264     static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520     static
./mm/vmscan.c:2946      try_to_free_pages       216     static
./mm/vmscan.c:2753      do_try_to_free_pages    304     static
./mm/vmscan.c:2517      shrink_node     	352     static
./mm/vmscan.c:427       shrink_slab     	336     static
./fs/super.c:56 	super_cache_scan        104     static << here we have the NOFS protection
./fs/dcache.c:1089      prune_dcache_sb 	152     static
./fs/dcache.c:939       shrink_dentry_list      96      static
./fs/dcache.c:509       __dentry_kill   	72      static
./fs/dcache.c:323       dentry_unlink_inode     64      static
./fs/inode.c:1527       iput    		80      static
./fs/inode.c:532        evict   		72      static

This is where the fs specific callbacks play role and I am not sure
which paths can pass through for ext4 in the nojournal mode and how much
of the stack this can eat. But currently we are at +536 wrt. NOFS
context. This is quite a lot but still much less (2632 vs. 3512) than
the regular reclaim. So there is quite some stack space to eat... I am
wondering whether we have to really treat nojournal mode any special
just because of the stack usage?

If this ever turn out to be a problem and with the vmapped stacks we
have good chances to get a proper stack traces on a potential overflow
we can add the scope API around the problematic code path with the
explanation why it is needed.

Does that make sense to you?

Thanks!
-- 
Michal Hocko
SUSE Labs