linux-kernel - Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150917164054.GA28032@kmo-pixel>
Date:	Thu, 17 Sep 2015 08:40:54 -0800
From:	Kent Overstreet <kent.overstreet@...il.com>
To:	Denis Bychkov <manover@...il.com>
Cc:	Vojtech Pavlik <vojtech@...e.com>, linux-kernel@...r.kernel.org,
	linux-bcache@...r.kernel.org,
	Emmanuel Florac <eflorac@...ellique.com>,
	Jiri Kosina <jkosina@...e.com>, Jens Axboe <axboe@...nel.dk>
Subject: Re: [PATCH] bcache: Fix writeback_thread never writing back
 incomplete stripes.

On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote:
> Well, it turns out my celebration was a bit premature.
> 
> PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech
> posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION.
> 
> The interesting thing is that it somehow damaged the partition that
> was not supposed to receive any writes (the file system was mounted
> read-only), so my guess is that the patch causes the blocks residing
> in the write-back cache to flush to the wrong blocks on the backing
> device.
> Everything was going great until I rebooted and saw this in the log:
> 
> [   19.639082] attempt to access beyond end of device
> [   19.643984] md1p2: rw=1, want=75497520, limit=62914560
> [   19.659033] attempt to access beyond end of device
> [   19.663929] md1p2: rw=1, want=75497624, limit=62914560
> [   19.669447] attempt to access beyond end of device
> [   19.674338] md1p2: rw=1, want=75497752, limit=62914560
> [   19.679195] attempt to access beyond end of device
> [   19.679199] md1p2: rw=1, want=75498080, limit=62914560
> [   19.689007] attempt to access beyond end of device
> [   19.689011] md1p2: rw=1, want=75563376, limit=62914560
> [   19.699055] attempt to access beyond end of device
> [   19.699059] md1p2: rw=1, want=79691816, limit=62914560
> [   19.719246] attempt to access beyond end of device
> [   19.724144] md1p2: rw=1, want=79691928, limit=62914560
> ......
> (it's a small example, the list was much longer)
> And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :)
> Oh well, it's a good thing I have backups.
> I knew what I was doing when trying the untested patches. I should
> have made the RAID md partition read-only, not the file system. I kind
> of expected that something could have gone wrong with the file system
> I was testing, just did not expect it would fire nukes at the innocent
> bystanders.

Aw, shit. That's just _bizzare_.

I have a theory - it appears that last_scanned isn't getting initialized before
it's used, so it's going to be all 0s the very first time... which it appears
could cause it to slurp up keys from the wrong device (and if that device was
bigger than the correct device, that could explain the accesses beyond the end
of the device).

Currently just a theory though, and I have no clue why it would only be exposed
with my patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/