linux-ext4 - Re: [RFC PATCHv2 1/1] fs: ext4: Don't use CMA for buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240913104938.onpgr3h6crtbmsmc@quack3>
Date: Fri, 13 Sep 2024 12:49:38 +0200
From: Jan Kara <jack@...e.cz>
To: Zhaoyang Huang <huangzhaoyang@...il.com>
Cc: Jan Kara <jack@...e.cz>, Theodore Ts'o <tytso@....edu>,
	"zhaoyang.huang" <zhaoyang.huang@...soc.com>,
	Andreas Dilger <adilger.kernel@...ger.ca>,
	Baolin Wang <baolin.wang@...ux.alibaba.com>,
	linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
	steve.kang@...soc.com
Subject: Re: [RFC PATCHv2 1/1] fs: ext4: Don't use CMA for buffer_head

On Fri 13-09-24 09:39:57, Zhaoyang Huang wrote:
> On Thu, Sep 12, 2024 at 6:16 PM Jan Kara <jack@...e.cz> wrote:
> >
> > On Thu 12-09-24 17:10:44, Zhaoyang Huang wrote:
> > > On Thu, Sep 12, 2024 at 4:41 PM Jan Kara <jack@...e.cz> wrote:
> > > >
> > > > On Wed 04-09-24 14:56:29, Zhaoyang Huang wrote:
> > > > > On Wed, Sep 4, 2024 at 10:44 AM Theodore Ts'o <tytso@....edu> wrote:
> > > > > > On Wed, Sep 04, 2024 at 08:49:10AM +0800, Zhaoyang Huang wrote:
> > > > > > > >
> > > > > > > > After all, using GFP_MOVEABLE memory seems to mean that the buffer
> > > > > > > > cache might get thrashed a lot by having a lot of cached disk buffers
> > > > > > > > getting ejected from memory to try to make room for some contiguous
> > > > > > > > frame buffer memory, which means extra I/O overhead.  So what's the
> > > > > > > > upside of using GFP_MOVEABLE for the buffer cache?
> > > > > > >
> > > > > > > To my understanding, NO. using GFP_MOVEABLE memory doesn't introduce
> > > > > > > extra IO as they just be migrated to free pages instead of ejected
> > > > > > > directly when they are the target memory area. In terms of reclaiming,
> > > > > > > all migrate types of page blocks possess the same position.
> > > > > >
> > > > > > Where is that being done?  I don't see any evidence of this kind of
> > > > > > migration in fs/buffer.c.
> > > > > The journaled pages which carry jh->bh are treated as file pages
> > > > > during isolation of a range of PFNs in the callstack below[1]. The bh
> > > > > will be migrated via each aops's migrate_folio and performs what you
> > > > > described below such as copy the content and reattach the bh to a new
> > > > > page. In terms of the journal enabled ext4 partition, the inode is a
> > > > > blockdev inode which applies buffer_migrate_folio_norefs as its
> > > > > migrate_folio[2].
> > > > >
> > > > > [1]
> > > > > cma_alloc/alloc_contig_range
> > > > >     __alloc_contig_migrate_range
> > > > >         migrate_pages
> > > > >             migrate_folio_move
> > > > >                 move_to_new_folio
> > > > >
> > > > > mapping->aops->migrate_folio(buffer_migrate_folio_norefs->__buffer_migrate_folio)
> > > > >
> > > > > [2]
> > > > > static int __buffer_migrate_folio(struct address_space *mapping,
> > > > >                 struct folio *dst, struct folio *src, enum migrate_mode mode,
> > > > >                 bool check_refs)
> > > > > {
> > > > > ...
> > > > >         if (check_refs) {
> > > > >                 bool busy;
> > > > >                 bool invalidated = false;
> > > > >
> > > > > recheck_buffers:
> > > > >                 busy = false;
> > > > >                 spin_lock(&mapping->i_private_lock);
> > > > >                 bh = head;
> > > > >                 do {
> > > > >                         if (atomic_read(&bh->b_count)) {
> > > > >           //My case failed here as bh is referred by a journal head.
> > > > >                                 busy = true;
> > > > >                                 break;
> > > > >                         }
> > > > >                         bh = bh->b_this_page;
> > > > >                 } while (bh != head);
> > > >
> > > > Correct. Currently pages with journal heads attached cannot be migrated
> > > > mostly out of pure caution that the generic code isn't sure what's
> > > > happening with them. As I wrote in [1] we could make pages with jhs on
> > > > checkpoint list only migratable as for them the buffer lock is enough to
> > > > stop anybody from touching the bh data. Bhs which are part of a running /
> > > > committing transaction are not realistically migratable but then these
> > > > states are more shortlived so it shouldn't be a big problem.
> > > By observing from our test case, the jh remains there for a long time
> > > when journal->j_free is bigger than j_max_transaction_buffers which
> > > failed cma_alloc. So you think this is rare or abnormal?
> > >
> > > [6] j_free & j_max_transaction_buffers
> > > crash_arm64_v8.0.4++> struct
> > > journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x
> > >   j_free = 0x3f1,
> > >   j_max_transaction_buffers = 0x100,
> >
> > So jh can stay attached to the bh for a very long time (basically only
> > memory pressure will evict it) and this is what blocks migration. But what
> > I meant is that in fact, most of the time we can migrate bh with jh
> > attached just fine. There are only relatively short moments (max 5s) where
> > a buffer (jh) is part of a running or committing transaction and then we
> > cannot really migrate.
> Please correct me if I am wrong. According to __buffer_migrate_folio,
> the bh can not be migrated as long as it has jh attached which could
> remain until the next cp transaction is launched. In my case, the
> jbd2' log space is big enough( j_free = 0x3f1 >
> j_max_transaction_buffers = 0x100) to escape the launch.

Currently yes. My proposal was to make bh migratable even with jh attached
(which obviously needs some tweaks to __buffer_migrate_folio()).

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR