lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAGWkznFGDJsyMUhn5Y8DPmhba9h4GNkX_CaqEMev4z23xa-s6g@mail.gmail.com>
Date: Wed, 4 Sep 2024 14:56:29 +0800
From: Zhaoyang Huang <huangzhaoyang@...il.com>
To: "Theodore Ts'o" <tytso@....edu>
Cc: "zhaoyang.huang" <zhaoyang.huang@...soc.com>, Andreas Dilger <adilger.kernel@...ger.ca>, 
	Baolin Wang <baolin.wang@...ux.alibaba.com>, linux-fsdevel@...r.kernel.org, 
	linux-ext4@...r.kernel.org, steve.kang@...soc.com
Subject: Re: [RFC PATCHv2 1/1] fs: ext4: Don't use CMA for buffer_head

On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@....edu> wrote:
>
> On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > >
> > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > getting ejected from memory to try to make room for some contiguous
> > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > upside of using GFP_MOVEABLE for the buffer cache?
> >
> > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > extra IO as they just be migrated to free pages instead of ejected
> > directly when they are the target memory area. In terms of reclaiming,
> > all migrate types of page blocks possess the same position.
>
> Where is that being done?  I don't see any evidence of this kind of
> migration in fs/buffer.c.
The journaled pages which carry jh->bh are treated as file pages
during isolation of a range of PFNs in the callstack below[1]. The bh
will be migrated via each aops's migrate_folio and performs what you
described below such as copy the content and reattach the bh to a new
page. In terms of the journal enabled ext4 partition, the inode is a
blockdev inode which applies buffer_migrate_folio_norefs as its
migrate_folio[2].

[1]
cma_alloc/alloc_contig_range
    __alloc_contig_migrate_range
        migrate_pages
            migrate_folio_move
                move_to_new_folio

mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)

[2]
static int __buffer_migrate_folio(struct address_space *mapping,
                struct folio *dst, struct folio *src, enum migrate_mode mode,
                bool check_refs)
{
...
        if (check_refs) {
                bool busy;
                bool invalidated = false;

recheck_buffers:
                busy = false;
                spin_lock(&mapping->i_private_lock);
                bh = head;
                do {
                        if (atomic_read(&bh->b_count)) {
          //My case failed here as bh is referred by a journal head.
                                busy = true;
                                break;
                        }
                        bh = bh->b_this_page;
                } while (bh != head);

>
> It's *possile* I suppose, but you'd have to remove the buffer_head so
> it can't be found by getblk(), and then wait for bh->b_count to go to
> zero, and then allocate a new page, and then copy buffer_head's page,
> update the buffer_head, and then rechain the bh into the buffer cache.
> And as I said, I can't see any kind of code like that.  It would be
> much simpler to just try to eject the bh from the buffer cache.  And
> that's consistent which what you've observed, which is that if the
> buffer_head is prevented from being ejected because it's held by the
> jbd2 layer until the buffer has been checkpointed.
All of above is right except the buffer_head is going to be reattached
to a new page instead of being ejected as it still point to checkpoint
data.
>
> > > Just curious, because in general I'm blessed by not having to use CMA
> > > in the first place (not having I/O devices too primitive so they can't
> > > do scatter-gather :-).  So I don't tend to use CMA, and obviously I'm
> > > missing some of the design considerations behind CMA.  I thought in
> > > general CMA tends to used in early boot to allocate things like frame
> > > buffers, and after that CMA doesn't tend to get used at all?  That's
> > > clearly not the case for you, apparently?
> >
> > Yes. CMA is designed for contiguous physical memory and has been used
> > via cma_alloc during the whole lifetime especially on the system
> > without SMMU, such as DRM driver. In terms of MIGRATE_MOVABLE page
> > blocks, they also could have compaction path retry for many times
> > which is common during high-order alloc_pages.
>
> But then what's the point of using CMA-eligible memory for the buffer
> cache, as opposed to just always using !__GFP_MOVEABLE for all buffer
> cache allocations?  After all, that's what is being proposed for
> ext4's ext4_getblk().  What's the downside of avoiding the use of
> CMA-eligible memory for ext4's buffer cache?  Why not do this for
> *all* buffers in the buffer cache?
Since migration which arised from alloc_pages or cma_alloc always
happens, we need appropriate users over MOVABLE pages. AFAIU, buffer
cache pages under regular files are the best candidate for migration
as we just need to modify page cache and PTE. Actually, all FSs apply
GFP_MOVABLE on their regular files via the below functions.

new_inode
    alloc_inode
        inode_init_always(struct super_block *sb, struct inode *inode)
        {
         ...
            mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);

static int filemap_create_folio(struct file *file,
                struct address_space *mapping, pgoff_t index,
                struct folio_batch *fbatch)
{
        struct folio *folio;
        int error;

        folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);

>
>                                         - Ted

Powered by blists - more mailing lists