lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wg51pT_b+tgHuAaO6O0PT19WY9p3CXEqTn=LO8Zjaf=7A@mail.gmail.com>
Date: Thu, 12 Sep 2024 15:56:17 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Jens Axboe <axboe@...nel.dk>
Cc: Matthew Wilcox <willy@...radead.org>, Christian Theune <ct@...ingcircus.io>, linux-mm@...ck.org, 
	"linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, Daniel Dao <dqminh@...udflare.com>, 
	Dave Chinner <david@...morbit.com>, clm@...a.com, regressions@...ts.linux.dev, 
	regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

On Thu, 12 Sept 2024 at 15:30, Jens Axboe <axboe@...nel.dk> wrote:
>
> It might be an iomap thing... Other file systems do use it, but to
> various degrees, and XFS is definitely the primary user.

I have to say, I looked at the iomap code, and it's disgusting.

The "I don't support large folios" check doesn't even say "don't do
large folios". That's what the regular __filemap_get_folio() code does
for reads, and that's the sane thing to do. But that's not what the
iomap code does. AT ALL.

No, the iomap code limits "len" of a write in iomap_write_begin() to
be within one page, and then magically depends on

 (a) __iomap_get_folio() using that length to decide how big a folio to allocate

 (b) iomap_write_begin() doing its own "what is the real length:" based on that.

 (c) the *caller* then having to do the same thing, to see what length
iomap_write_begin() _actually_ used (because it wasn't the 'bytes'
that was passed in).

Honestly, the iomap code is just odd. Having these kinds of subtle
interdependencies doesn't make sense. The two code sequences don't
even use the same logic, with iomap_write_begin() doing

        if (!mapping_large_folio_support(iter->inode->i_mapping))
                len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
        [... alloc folio ...]
        if (pos + len > folio_pos(folio) + folio_size(folio))
                len = folio_pos(folio) + folio_size(folio) - pos;

and the caller (iomap_write_iter) doing

                offset = offset_in_folio(folio, pos);
                if (bytes > folio_size(folio) - offset)
                        bytes = folio_size(folio) - offset;

and yes, the two completely different ways of picking 'len' (called
'bytes' in the second case) had *better* match.

I do think they match, but code shouldn't be organized this way.

It's not just the above kind of odd thing either, it's things like
iomap_get_folio() using that fgf_set_order(len), which does

        unsigned int shift = ilog2(size);

        if (shift <= PAGE_SHIFT)
                return 0;

so now it has done that potentially expensive ilog2() for the common
case of "len < PAGE_SIZE", but dammit, it should never have even
bothered looking at 'len' if the inode didn't support large folios in
the first place, and we shouldn't have had that special odd 'len =
min_t(..)" magic rule to force an order-0 thing, because

Yeah, yeah, modern CPU's all have reasonably cheap bit finding
instructions. But the code simply shouldn't have this kind of thing in
the first place.

The folio should have been allocated *before* iomap_write_begin(), the
"no large folios" should just have fixed the order to zero there, and
the actual real-life length of the write should have been limited in
*one* piece of code after the allocation point instead of then having
two different pieces of code depending on matching (subtle and
undocumented) logic.

Put another way: I most certainly don't see the bug here - it may look
_odd_, but not wrong - but at the same time, looking at that code
doesn't make me get the warm and fuzzies about the iomap large-folio
situation either.

                Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ