lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <D49C9D27-7523-41C9-8B8D-82B2A7CBE97B@flyingcircus.io>
Date: Thu, 19 Sep 2024 12:19:19 +0200
From: Christian Theune <ct@...ingcircus.io>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Dave Chinner <david@...morbit.com>,
 Matthew Wilcox <willy@...radead.org>,
 Chris Mason <clm@...a.com>,
 Jens Axboe <axboe@...nel.dk>,
 linux-mm@...ck.org,
 "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
 linux-fsdevel@...r.kernel.org,
 linux-kernel@...r.kernel.org,
 Daniel Dao <dqminh@...udflare.com>,
 regressions@...ts.linux.dev,
 regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)



> On 19. Sep 2024, at 08:57, Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> 
> Yeah, right now Jens is still going to run some more testing, but I
> think the plan is to just backport
> 
>  a4864671ca0b ("lib/xarray: introduce a new helper xas_get_order")
>  6758c1128ceb ("mm/filemap: optimize filemap folio adding")
> 
> and I think we're at the point where you might as well start testing
> that if you have the cycles for it. Jens is mostly trying to confirm
> the root cause, but even without that, I think you running your load
> with those two changes back-ported is worth it.
> 
> (Or even just try running it on plain 6.10 or 6.11, both of which
> already has those commits)

I’ve discussed this with my team and we’re preparing to switch all our 
non-prod machines as well as those production machines that have shown
the error before.

This will require a bit of user communication and reboot scheduling.
Our release prep will be able to roll this out starting early next week
and the production machines in question around Sept 30.

We would run with 6.11 as our understanding so far is that running the
most current kernel would generate the most insight and is easier to
work with for you all?

(Generally we run the mostly vanilla LTS that has surpassed x.y.50+ so
we might later downgrade to 6.6 when this is fixed.)

> So considering how well the reproducer works for Jens and Chris, my
> main worry is whether your load might have some _additional_ issue.
> 
> Unlikely, but still .. The two commits fix the repproducer, so I think
> the important thing to make sure is that it really fixes the original
> issue too.
> 
> And yeah, I'd be surprised if it doesn't, but at the same time I would
> _not_ suggest you try to make your load look more like the case we
> already know gets fixed.
> 
> So yes, it will be "weeks of not seeing crashes" until we'd be
> _really_ confident it's all the same thing, but I'd rather still have
> you test that, than test something else than what caused issues
> originally, if you see what I mean.

Agreed, I’m all onboard with that.

Liebe Grüße,
Christian Theune

-- 
Christian Theune · ct@...ingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ