lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 27 Oct 2012 19:47:00 +0100
From:	Nix <nix@...eri.org.uk>
To:	"Theodore Ts'o" <tytso@....edu>
Cc:	Eric Sandeen <sandeen@...hat.com>, linux-ext4@...r.kernel.org,
	linux-kernel@...r.kernel.org, gregkh@...uxfoundation.org
Subject: Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

On 27 Oct 2012, Theodore Ts'o said:

> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
>> Ah! it's turned on by journal_async_commit. OK, that alone argues
>> against use of journal_async_commit, tested or not, and I'd not have
>> turned it on if I'd noticed that.
>> 
>> (So, the combinations I'll be trying for effect on this bug are:
>> 
>>  journal_async_commit (as now)
>>  journal_checksum
>>  none
>
> Can you also check and see whether the presence or absence of
> "nobarrier" makes a difference?

Done. (Also checked the effect of your patches posted earlier this week:
no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
I was carrying out, umount -l'ing /var as the very last thing I did
before /sbin/reboot -f.)

nobarrier makes a difference that I, at least, did not expect:

[no options]                    No corruption

nobarrier                       No corruption

          journal_checksum      Corruption
                                Corrupted transaction, journal aborted
                                
nobarrier,journal_checksum      Corruption
                                Corrupted transaction, journal aborted

          journal_async_commit  Corruption
                                Corrupted transaction, journal aborted

nobarrier,journal_async_commit  Corruption
                                No corrupted transaction or aborted journal

I didn't expect the last case at all, and it adequately explains why you
are mostly seeing corrupted journal messages in your tests but I was
not. It also explains why when I saw this for the first time I was able
to mount the resulting corrupted filesystem read-write and corrupt it
further before I noticed that anything was wrong.

It is also clear that journal_checksum and all that relies on it is
worse than useless right now, as Eric reported while I was testing this.
It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
kernels, if CONFIG_BROKEN existed anymore, which it doesn't.

It's a shame journal_async_commit depends on a broken feature: it might
be notionally unsafe but on some of my systems (without nobarrier or
flashy caching controllers) it was associated with a noticeable speedup
of metadata-heavy workloads -- though that was way back in 2009...
however, "safety first" definitely applies in this case.

-- 
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ