linux-ext4 - Re: [RFC PATCHv2 1/1] fs: ext4: Don't use CMA for buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGWkznG7_=zjKZBO-sj=79F3a3tgZuXqCXbvddDDG2Atv5043g@mail.gmail.com>
Date: Thu, 12 Sep 2024 17:10:44 +0800
From: Zhaoyang Huang <huangzhaoyang@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: "Theodore Ts'o" <tytso@....edu>, "zhaoyang.huang" <zhaoyang.huang@...soc.com>, 
	Andreas Dilger <adilger.kernel@...ger.ca>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org, 
	steve.kang@...soc.com
Subject: Re: [RFC PATCHv2 1/1] fs: ext4: Don't use CMA for buffer_head

On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@...e.cz> wrote:
>
> On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote:
> > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@....edu> wrote:
> > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > > > >
> > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > > > getting ejected from memory to try to make room for some contiguous
> > > > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > > > upside of using GFP_MOVEABLE for the buffer cache?
> > > >
> > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > > > extra IO as they just be migrated to free pages instead of ejected
> > > > directly when they are the target memory area. In terms of reclaiming,
> > > > all migrate types of page blocks possess the same position.
> > >
> > > Where is that being done?  I don't see any evidence of this kind of
> > > migration in fs/buffer.c.
> > The journaled pages which carry jh->bh are treated as file pages
> > during isolation of a range of PFNs in the callstack below[1]. The bh
> > will be migrated via each aops's migrate_folio and performs what you
> > described below such as copy the content and reattach the bh to a new
> > page. In terms of the journal enabled ext4 partition, the inode is a
> > blockdev inode which applies buffer_migrate_folio_norefs as its
> > migrate_folio[2].
> >
> > [1]
> > cma_alloc/alloc_contig_range
> >     __alloc_contig_migrate_range
> >         migrate_pages
> >             migrate_folio_move
> >                 move_to_new_folio
> >
> > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)
> >
> > [2]
> > static int __buffer_migrate_folio(struct address_space *mapping,
> >                 struct folio *dst, struct folio *src, enum migrate_mode mode,
> >                 bool check_refs)
> > {
> > ...
> >         if (check_refs) {
> >                 bool busy;
> >                 bool invalidated = false;
> >
> > recheck_buffers:
> >                 busy = false;
> >                 spin_lock(&mapping->i_private_lock);
> >                 bh = head;
> >                 do {
> >                         if (atomic_read(&bh->b_count)) {
> >           //My case failed here as bh is referred by a journal head.
> >                                 busy = true;
> >                                 break;
> >                         }
> >                         bh = bh->b_this_page;
> >                 } while (bh != head);
>
> Correct. Currently pages with journal heads attached cannot be migrated
> mostly out of pure caution that the generic code isn't sure what's
> happening with them. As I wrote in [1] we could make pages with jhs on
> checkpoint list only migratable as for them the buffer lock is enough to
> stop anybody from touching the bh data. Bhs which are part of a running /
> committing transaction are not realistically migratable but then these
> states are more shortlived so it shouldn't be a big problem.
By observing from our test case, the jh remains there for a long time
when journal->j_free is bigger than j_max_transaction_buffers which
failed cma_alloc. So you think this is rare or abnormal?

[6] j_free & j_max_transaction_buffers
crash_arm64_v8.0.4++> struct
journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
  j_free = 0x3f1,
  j_max_transaction_buffers = 0x100,

>
> > > > > Just curious, because in general I'm blessed by not having to use CMA
> > > > > in the first place (not having I/O devices too primitive so they can't
> > > > > do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> > > > > missing some of the design considerations behind CMA.  I thought in
> > > > > general CMA tends to used in early boot to allocate things like frame
> > > > > buffers, and after that CMA doesn't tend to get used at all?  That's
> > > > > clearly not the case for you, apparently?
> > > >
> > > > Yes. CMA is designed for contiguous physical memory and has been used
> > > > via cma_alloc during the whole lifetime especially on the system
> > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
> > > > blocks, they also could have compaction path retry for many times
> > > > which is common during high-order alloc_pages.
> > >
> > > But then what's the point of using CMA-eligible memory for the buffer
> > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer
> > > cache allocations?  After all, that's what is being proposed for
> > > ext4's ext4_getblk().  What's the downside of avoiding the use of
> > > CMA-eligible memory for ext4's buffer cache?  Why not do this for
> > > *all* buffers in the buffer cache?
> > Since migration which arised from alloc_pages or cma_alloc always
> > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer
> > cache pages under regular files are the best candidate for migration
> > as we just need to modify page cache and PTE. Actually, all FSs apply
> > GFP_MOVABLE on their regular files via the below functions.
> >
> > new_inode
> >     alloc_inode
> >         inode_init_always(struct super_block *sb, struct inode *inode)
> >         {
> >          ...
> >             mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>
> Here you speak about data page cache pages. Indeed they can be allocated
> from CMA area. But when Ted speaks about "buffer cache" he specifically
> means page cache of the block device inode and there I can see:
>
> struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
> {
> ...
>         mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> ...
> }
>
> so at this point I'm confused how come you can see block device pages in
> CMA area. Are you using data=journal mode of ext4 in your setup by any
> chance? That would explain it but then that is a horrible idea as well...
The page of 'fffffffe01a51c00'[1] which has bh attached comes from
creating bitmap_blk by ext4_getblk->sb_getblk within process[2] where
the gfpmask has GFP_MOVABLE. IMO, GFP_USER is used for regular file
pages under the super_block but not available for metadata, right?

[1]
crash_arm64_v8.0.4++> kmem -p|grep
ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping)
fffffffe01a51c00  e9470000 ffffff808f0aa150        3  2 8000000008020
lru,private //within CMA area
fffffffe03d189c0 174627000 ffffff808f0aa150        4  2
2004000000008020 lru,private
fffffffe03d88e00 176238000 ffffff808f0aa150      3f9  2
2008000000008020 lru,private
fffffffe03d88e40 176239000 ffffff808f0aa150        6  2
2008000000008020 lru,private
fffffffe03d88e80 17623a000 ffffff808f0aa150        5  2
2008000000008020 lru,private
fffffffe03d88ec0 17623b000 ffffff808f0aa150        1  2
2008000000008020 lru,private
fffffffe03d88f00 17623c000 ffffff808f0aa150        0  2
2008000000008020 lru,private
fffffffe040e6540 183995000 ffffff808f0aa150      3f4  2
2004000000008020 lru,private

[2]
02148 < 4> [   14.133703] [08-11 18:38:25.133] __find_get_block+0x29c/0x634
02149 < 4> [   14.133711] [08-11 18:38:25.133] __getblk_gfp+0xa8/0x290
0214A < 4> [   14.133716] [08-11 18:38:25.133] ext4_read_inode_bitmap+0xa0/0x6c4
0214B < 4> [   14.133725] [08-11 18:38:25.133] __ext4_new_inode+0x34c/0x10d4
0214C < 4> [   14.133730] [08-11 18:38:25.133] ext4_create+0xdc/0x1cc
0214D < 4> [   14.133737] [08-11 18:38:25.133] path_openat+0x4fc/0xc84
0214E < 4> [   14.133745] [08-11 18:38:25.133] do_filp_open+0xc0/0x16c
0214F < 4> [   14.133751] [08-11 18:38:25.133] do_sys_openat2+0x8c/0xf8
02150 < 4> [   14.133758] [08-11 18:38:25.133] __arm64_sys_openat+0x78/0xa4
02151 < 4> [   14.133764] [08-11 18:38:25.133] invoke_syscall+0x60/0x11c
02152 < 4> [   14.133771] [08-11 18:38:25.133] el0_svc_common+0xb4/0xe8
02153 < 4> [   14.133777] [08-11 18:38:25.133] do_el0_svc+0x24/0x30
02154 < 4> [   14.133783] [08-11 18:38:25.133] el0_svc+0x3c/0x70

>
>                                                                 Honza
> --
> Jan Kara <jack@...e.com>
> SUSE Labs, CR