lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zf1B1cPj/aO21pjZ@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com>
Date: Fri, 22 Mar 2024 14:01:17 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Frederick Lawler <fred@...udflare.com>
Cc: "Theodore Ts'o" <tytso@....edu>, linux-ext4@...r.kernel.org,
        Ritesh Harjani <ritesh.list@...il.com>, linux-kernel@...r.kernel.org,
        Jan Kara <jack@...e.cz>, glandvador@...oo.com, bugzilla@...l.emu.id.au,
        kernel-team@...udflare.com
Subject: Re: [PATCH 0/1] Fix for recent bugzilla reports related to long
 halts during block allocation

On Wed, Mar 20, 2024 at 11:52:58AM -0500, Frederick Lawler wrote:
> Hi Theodore and Ojaswin,
> 
> On Mon, Jan 08, 2024 at 09:53:18PM -0500, Theodore Ts'o wrote:
> > 
> > On Fri, 15 Dec 2023 16:49:49 +0530, Ojaswin Mujoo wrote:
> > > This patch intends to fix the recent bugzilla [1] report where the
> > > kworker flush thread seemed to be taking 100% CPU utilizationa and was
> > > slowing down the whole system. The backtrace indicated that we were
> > > stuck in mballoc allocation path. The issue was only seen kernel 6.5+
> > > and when ext4 was mounted with -o stripe (or stripe option was
> > > implicitly added due us mkfs flags used).
> > > 
> > > [...]
> > 
> > Applied, thanks!
> 
> I backported this patch to at least 6.6 and tested on our fleet of
> software RAID 0 NVME SSD nodes. This change worked very nicely
> for us. We're interested in backporting this to at least 6.6.
> 
> I tried looking at xfstests, and didn't really see a good match
> (user error?) to validate the fix via that. So I'm a little unclear what
> the path forward here is.
> 
> Although we experienced this issue in 6.1, I didn't backport to 6.1 and
> test to verify this also works there, however, setting stripe to 0 did in
> the 6.1 case.
> 
> Best,
> Fred

Hi Fred,

If I understand correctly, you are looking for a test case which you
could use to confirm if the issue exists and if the backport is solving
it, right?

Actually, I was never able to replicate this at my end so I had to rely
on people hitting the bug to confirm if it works. I did set out to write
a testcase that could help us reliably replicate this issue but it needs
a very specially crafted FS that is a bit difficult to achieve from user
space. I was using debugfs to create an FS that could hit it but I kept 
running into issues where it won't mount etc. Maybe there's a better 
way to craft such an FS that I'm not aware of.

One more option is that maybe we can have KUnit test for this in the
mballoc code but I'd need to read some more about the kunit
infrastructure to see if it's possible/feasible.

Regards,
ojaswin
> 
> > 
> > [1/1] ext4: fallback to complex scan if aligned scan doesn't work
> >       commit: a26b6faf7f1c9c1ba6edb3fea9d1390201f2ed50
> > 
> > Best regards,
> > -- 
> > Theodore Ts'o <tytso@....edu>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ