lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130617122518.GA24403@thunk.org>
Date:	Mon, 17 Jun 2013 08:25:18 -0400
From:	Theodore Ts'o <tytso@....edu>
To:	Lukáš Czerner <lczerner@...hat.com>
Cc:	linux-ext4@...r.kernel.org
Subject: Re: [PATCH v4 15/20] ext4: use ext4_zero_partial_blocks in punch_hole

On Mon, Jun 17, 2013 at 11:08:32AM +0200, Lukáš Czerner wrote:
> > Correction...  reverting patches #15 through #19 (which is what I did in
> > the dev-with-revert branch found on ext4.git) causes the problem to go
> > away in the nojournal case, but it causes a huge number of other
> > problems.  Some of the reverts weren't clean, so it's possible I
> > screwed up one of the reverts.  It's also possible that only applying
> > part of this series leaves the tree in an unstable state.
> > 
> > I'd much rather figure out how to fix the problem on the dev branch,
> > so thank you for looking into this!
> 
> Wow, this looks bad. Theoretically reverting patches %15 through
> #19 should not have any real impact. So far I do not see what is
> causing that, but I am looking into this.

I've been looking into this more intensively over the weekend.  I'm
now beginning to think we have had a pre-existing race, and the
changes in question has simply changed the timing.  I tried a version
of the dev branch (you can find it as the branch dev2 in my
kernel.org's ext4.git tree) which only had patches 1 through 10 of the
invalidate page range patches (dropping patches 11 through 20), and I
found that generic/300 was failing in the configuration ext3 (a file
system with nodelalloc, no flex_bg, and no extents).  I also found
the same failure with a 3.10-rc2 configuration.

The your changes seem to make generic/300 failure consistently for me
using the nojournal configuration, but looking at patches in question,
I don't think they could have directly caused the problem.  Instead, I
think they just changed the timing to unmask the problem.

Given that I've seen generic/300 test failures in various different
baselines going all the way back to 3.9-rc4, this isn't a recent
regression.  And given that it does seem to be timing sensitive,
bisecting it is going to be difficult.  On the other hand, given that
using the dev (or master) branch, generic/300 is failing with a
greater than 70% probability using kvm with 2 cpu's, 2 megs of RAM and
5400 rpm laptop drives in nojournal mode, the fact that it's
reproducing relatively reliably hopefully will make it easier to find
the problem.

> I see that there are problems in other mode, not just nojournal. Are
> those caused by this as well, or are you seeing those even without
> the patchset ?

I think the other problems in my dev-with-revert branch was caused by
some screw up on my part when did the revert using git.  I found that
dropping the patches from a copy of the guilt patch stack, and then
applying all of the patches except for the last half of the invalidate
page range patch series, resulted in a clean branch that didn't have
any of these failures.  It's what I should have done late last week,
instead of trying to use "git revert".

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists