lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGWkznGQkoJbUW7hkUK1+i4ww9ihtY2cUTZbC_jqwFq3HDqE4g@mail.gmail.com>
Date: Fri, 13 Sep 2024 09:39:57 +0800
From: Zhaoyang Huang <huangzhaoyang@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: "Theodore Ts'o" <tytso@....edu>, "zhaoyang.huang" <zhaoyang.huang@...soc.com>, 
	Andreas Dilger <adilger.kernel@...ger.ca>, Baolin Wang <baolin.wang@...ux.alibaba.com>, 
	linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org, 
	steve.kang@...soc.com
Subject: Re: [RFC PATCHv2 1/1] fs: ext4: Don't use CMA for buffer_head

On Thu, Sep 12, 2024 at 6:16 PM Jan Kara <jack@...e.cz> wrote:
>
> On Thu 12-09-24 17:10:44, Zhaoyang Huang wrote:
> > On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@...e.cz> wrote:
> > >
> > > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote:
> > > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@....edu> wrote:
> > > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > > > > > >
> > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > > > > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > > > > > getting ejected from memory to try to make room for some contiguous
> > > > > > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > > > > > upside of using GFP_MOVEABLE for the buffer cache?
> > > > > >
> > > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > > > > > extra IO as they just be migrated to free pages instead of ejected
> > > > > > directly when they are the target memory area. In terms of reclaiming,
> > > > > > all migrate types of page blocks possess the same position.
> > > > >
> > > > > Where is that being done?  I don't see any evidence of this kind of
> > > > > migration in fs/buffer.c.
> > > > The journaled pages which carry jh->bh are treated as file pages
> > > > during isolation of a range of PFNs in the callstack below[1]. The bh
> > > > will be migrated via each aops's migrate_folio and performs what you
> > > > described below such as copy the content and reattach the bh to a new
> > > > page. In terms of the journal enabled ext4 partition, the inode is a
> > > > blockdev inode which applies buffer_migrate_folio_norefs as its
> > > > migrate_folio[2].
> > > >
> > > > [1]
> > > > cma_alloc/alloc_contig_range
> > > >     __alloc_contig_migrate_range
> > > >         migrate_pages
> > > >             migrate_folio_move
> > > >                 move_to_new_folio
> > > >
> > > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)
> > > >
> > > > [2]
> > > > static int __buffer_migrate_folio(struct address_space *mapping,
> > > >                 struct folio *dst, struct folio *src, enum migrate_mode mode,
> > > >                 bool check_refs)
> > > > {
> > > > ...
> > > >         if (check_refs) {
> > > >                 bool busy;
> > > >                 bool invalidated = false;
> > > >
> > > > recheck_buffers:
> > > >                 busy = false;
> > > >                 spin_lock(&mapping->i_private_lock);
> > > >                 bh = head;
> > > >                 do {
> > > >                         if (atomic_read(&bh->b_count)) {
> > > >           //My case failed here as bh is referred by a journal head.
> > > >                                 busy = true;
> > > >                                 break;
> > > >                         }
> > > >                         bh = bh->b_this_page;
> > > >                 } while (bh != head);
> > >
> > > Correct. Currently pages with journal heads attached cannot be migrated
> > > mostly out of pure caution that the generic code isn't sure what's
> > > happening with them. As I wrote in [1] we could make pages with jhs on
> > > checkpoint list only migratable as for them the buffer lock is enough to
> > > stop anybody from touching the bh data. Bhs which are part of a running /
> > > committing transaction are not realistically migratable but then these
> > > states are more shortlived so it shouldn't be a big problem.
> > By observing from our test case, the jh remains there for a long time
> > when journal->j_free is bigger than j_max_transaction_buffers which
> > failed cma_alloc. So you think this is rare or abnormal?
> >
> > [6] j_free & j_max_transaction_buffers
> > crash_arm64_v8.0.4++> struct
> > journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
> >   j_free = 0x3f1,
> >   j_max_transaction_buffers = 0x100,
>
> So jh can stay attached to the bh for a very long time (basically only
> memory pressure will evict it) and this is what blocks migration. But what
> I meant is that in fact, most of the time we can migrate bh with jh
> attached just fine. There are only relatively short moments (max 5s) where
> a buffer (jh) is part of a running or committing transaction and then we
> cannot really migrate.
Please correct me if I am wrong. According to __buffer_migrate_folio,
the bh can not be migrated as long as it has jh attached which could
remain until the next cp transaction is launched. In my case, the
jbd2' log space is big enough( j_free = 0x3f1 >
j_max_transaction_buffers = 0x100) to escape the launch.

static int __buffer_migrate_folio(struct address_space *mapping,
                struct folio *dst, struct folio *src, enum migrate_mode mode,
                bool check_refs)
{
 ...
recheck_buffers:
                busy = false;
                spin_lock(&mapping->i_private_lock);
                bh = head;
                do {
                        if (atomic_read(&bh->b_count)) {
//migrate will fail here as bh->jh attached
                                busy = true;
                                break;
                        }
                        bh = bh->b_this_page;
                } while (bh != head);

>
> > > > > > > Just curious, because in general I'm blessed by not having to use CMA
> > > > > > > in the first place (not having I/O devices too primitive so they can't
> > > > > > > do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> > > > > > > missing some of the design considerations behind CMA.  I thought in
> > > > > > > general CMA tends to used in early boot to allocate things like frame
> > > > > > > buffers, and after that CMA doesn't tend to get used at all?  That's
> > > > > > > clearly not the case for you, apparently?
> > > > > >
> > > > > > Yes. CMA is designed for contiguous physical memory and has been used
> > > > > > via cma_alloc during the whole lifetime especially on the system
> > > > > > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
> > > > > > blocks, they also could have compaction path retry for many times
> > > > > > which is common during high-order alloc_pages.
> > > > >
> > > > > But then what's the point of using CMA-eligible memory for the buffer
> > > > > cache, as opposed to just always using !__GFP_MOVEABLE for all buffer
> > > > > cache allocations?  After all, that's what is being proposed for
> > > > > ext4's ext4_getblk().  What's the downside of avoiding the use of
> > > > > CMA-eligible memory for ext4's buffer cache?  Why not do this for
> > > > > *all* buffers in the buffer cache?
> > > > Since migration which arised from alloc_pages or cma_alloc always
> > > > happens, we need appropriate users over MOVABLE pages. AFAIU, buffer
> > > > cache pages under regular files are the best candidate for migration
> > > > as we just need to modify page cache and PTE. Actually, all FSs apply
> > > > GFP_MOVABLE on their regular files via the below functions.
> > > >
> > > > new_inode
> > > >     alloc_inode
> > > >         inode_init_always(struct super_block *sb, struct inode *inode)
> > > >         {
> > > >          ...
> > > >             mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
> > >
> > > Here you speak about data page cache pages. Indeed they can be allocated
> > > from CMA area. But when Ted speaks about "buffer cache" he specifically
> > > means page cache of the block device inode and there I can see:
> > >
> > > struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
> > > {
> > > ...
> > >         mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> > > ...
> > > }
> > >
> > > so at this point I'm confused how come you can see block device pages in
> > > CMA area. Are you using data=journal mode of ext4 in your setup by any
> > > chance? That would explain it but then that is a horrible idea as well...
> > The page of 'fffffffe01a51c00'[1] which has bh attached comes from
> > creating bitmap_blk by ext4_getblk->sb_getblk within process[2] where
> > the gfpmask has GFP_MOVABLE. IMO, GFP_USER is used for regular file
> > pages under the super_block but not available for metadata, right?
>
> Ah, right, __getblk() overrides the GFP mode set in bdev's mapping_gfp_mask
> and sets __GFP_MOVABLE there. This behavior goes way back to 2007
> (769848c03895b ("Add __GFP_MOVABLE for callers to flag allocations from
> high memory that may be migrated")). So another clean and simple way to fix
> this (as Ted suggests) would be to stop setting __GFP_MOVABLE in
> __getblk(). For ext4 there would be no practical difference to current
> situation where metadata pages are practically unmovable due to attached
> jhs, other filesystems can set __GFP_MOVABLE in bdev's mapping_gfp_mask if
> they care (but I don't think there are other big buffer cache users on
> systems which need CMA).
>
>                                                                 Honza
> --
> Jan Kara <jack@...e.com>
> SUSE Labs, CR

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ