linux-kernel - Raid5 Reshape Data Corruption Bug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <f79f67ff-abb0-123d-064b-a826ec2c6666@gunth.ca>
Date:   Fri, 10 Jun 2022 16:05:48 -0600
From:   Logan Gunthorpe <logan@...th.ca>
To:     LKML <linux-kernel@...r.kernel.org>,
        linux-raid <linux-raid@...r.kernel.org>,
        Song Liu <song@...nel.org>
Cc:     Donald Buczek <buczek@...gen.mpg.de>,
        Guoqing Jiang <guoqing.jiang@...ux.dev>,
        Xiao Ni <xni@...hat.com>, Stephen Bates <sbates@...thlin.com>,
        Martin Oliveira <Martin.Oliveira@...eticom.com>,
        David Sloan <David.Sloan@...eticom.com>
Subject: Raid5 Reshape Data Corruption Bug

Hey,

I've diagnosed a bug in the reshape code that corrupts data, however I
don't have a good solution to the problem and a solution may be quite
complicated. I suspect this is the cause of random failures I see with
01r5integ and 01raid6integ. (Though I can't say for certain as I have a
quicker reproduction method.)

The bug is that during reshape: EXPAND_SOURCE stripes are not in the
correct order when the data comes back from the disk and if a latter
stripe comes back sooner than an earlier EXPAND_SOURCE stripe has read
the disk then an EXPAND_READY stripe might write a block before the
source block was able to read it; so it overwrites data before that data
has been moved and result in corrupt data on the disk. This happens
reasonably frequently.

I suspect this is made worse with modern SSDs and spinning disks would
be less likely to exhibit this problem as it would naturally try to
order the reads by sector.

So somehow there needs to be a way to prevent an EXPAND_READY stripe
from writing the data for a specific device block before the
corresponding EXPAND_SOURCE block has read the data. And I don't see an
trivial way to get that done.

If anyone has any clever solutions it would be good to hear that.
Otherwise, I don't think I'll have time to find a solution myself.

Logan