linux-ext4 - Re: [RFC PATCHv2 1/1] fs: ext4: Don't use CMA for buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240912084119.j3oqfikuavymctlm@quack3>
Date: Thu, 12 Sep 2024 10:41:19 +0200
From: Jan Kara <jack@...e.cz>
To: Zhaoyang Huang <huangzhaoyang@...il.com>
Cc: Theodore Ts'o <tytso@....edu>,
	"zhaoyang.huang" <zhaoyang.huang@...soc.com>,
	Andreas Dilger <adilger.kernel@...ger.ca>,
	Baolin Wang <baolin.wang@...ux.alibaba.com>,
	linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
	steve.kang@...soc.com
Subject: Re: [RFC PATCHv2 1/1] fs: ext4: Don't use CMA for buffer_head

On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote:
> On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@....edu> wrote:
> > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > > >
> > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > > getting ejected from memory to try to make room for some contiguous
> > > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > > upside of using GFP_MOVEABLE for the buffer cache?
> > >
> > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > > extra IO as they just be migrated to free pages instead of ejected
> > > directly when they are the target memory area. In terms of reclaiming,
> > > all migrate types of page blocks possess the same position.
> >
> > Where is that being done?  I don't see any evidence of this kind of
> > migration in fs/buffer.c.
> The journaled pages which carry jh->bh are treated as file pages
> during isolation of a range of PFNs in the callstack below[1]. The bh
> will be migrated via each aops's migrate_folio and performs what you
> described below such as copy the content and reattach the bh to a new
> page. In terms of the journal enabled ext4 partition, the inode is a
> blockdev inode which applies buffer_migrate_folio_norefs as its
> migrate_folio[2].
> 
> [1]
> cma_alloc/alloc_contig_range
>     __alloc_contig_migrate_range
>         migrate_pages
>             migrate_folio_move
>                 move_to_new_folio
> 
> mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)
> 
> [2]
> static int __buffer_migrate_folio(struct address_space *mapping,
>                 struct folio *dst, struct folio *src, enum migrate_mode mode,
>                 bool check_refs)
> {
> ...
>         if (check_refs) {
>                 bool busy;
>                 bool invalidated = false;
> 
> recheck_buffers:
>                 busy = false;
>                 spin_lock(&mapping->i_private_lock);
>                 bh = head;
>                 do {
>                         if (atomic_read(&bh->b_count)) {
>           //My case failed here as bh is referred by a journal head.
>                                 busy = true;
>                                 break;
>                         }
>                         bh = bh->b_this_page;
>                 } while (bh != head);

Correct. Currently pages with journal heads attached cannot be migrated
mostly out of pure caution that the generic code isn't sure what's
happening with them. As I wrote in [1] we could make pages with jhs on
checkpoint list only migratable as for them the buffer lock is enough to
stop anybody from touching the bh data. Bhs which are part of a running /
committing transaction are not realistically migratable but then these
states are more shortlived so it shouldn't be a big problem.

> > > > Just curious, because in general I'm blessed by not having to use CMA
> > > > in the first place (not having I/O devices too primitive so they can't
> > > > do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> > > > missing some of the design considerations behind CMA.  I thought in
> > > > general CMA tends to used in early boot to allocate things like frame
> > > > buffers, and after that CMA doesn't tend to get used at all?  That's
> > > > clearly not the case for you, apparently?
> > >
> > > Yes. CMA is designed for contiguous physical memory and has been used
> > > via cma_alloc during the whole lifetime especially on the system
> > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
> > > blocks, they also could have compaction path retry for many times
> > > which is common during high-order alloc_pages.
> >
> > But then what's the point of using CMA-eligible memory for the buffer
> > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer
> > cache allocations?  After all, that's what is being proposed for
> > ext4's ext4_getblk().  What's the downside of avoiding the use of
> > CMA-eligible memory for ext4's buffer cache?  Why not do this for
> > *all* buffers in the buffer cache?
> Since migration which arised from alloc_pages or cma_alloc always
> happens, we need appropriate users over MOVABLE pages. AFAIU, buffer
> cache pages under regular files are the best candidate for migration
> as we just need to modify page cache and PTE. Actually, all FSs apply
> GFP_MOVABLE on their regular files via the below functions.
> 
> new_inode
>     alloc_inode
>         inode_init_always(struct super_block *sb, struct inode *inode)
>         {
>          ...
>             mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);

Here you speak about data page cache pages. Indeed they can be allocated
from CMA area. But when Ted speaks about "buffer cache" he specifically
means page cache of the block device inode and there I can see:

struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
{
...
        mapping_set_gfp_mask(&inode->i_data, GFP_USER);
...
}

so at this point I'm confused how come you can see block device pages in
CMA area. Are you using data=journal mode of ext4 in your setup by any
chance? That would explain it but then that is a horrible idea as well...

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR