lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Sat, 18 Aug 2012 12:09:27 -0700 (PDT)
From:	Dan Magenheimer <dan.magenheimer@...cle.com>
To:	Seth Jennings <sjenning@...ux.vnet.ibm.com>
Cc:	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Nitin Gupta <ngupta@...are.org>,
	Minchan Kim <minchan@...nel.org>,
	Konrad Wilk <konrad.wilk@...cle.com>,
	Robert Jennings <rcj@...ux.vnet.ibm.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, devel@...verdev.osuosl.org
Subject: RE: [PATCH 0/4] promote zcache from staging

> From: Seth Jennings [mailto:sjenning@...ux.vnet.ibm.com]
> Sent: Friday, August 17, 2012 5:33 PM
> To: Dan Magenheimer
> Cc: Greg Kroah-Hartman; Andrew Morton; Nitin Gupta; Minchan Kim; Konrad Wilk; Robert Jennings; linux-
> mm@...ck.org; linux-kernel@...r.kernel.org; devel@...verdev.osuosl.org; Kurt Hackel
> Subject: Re: [PATCH 0/4] promote zcache from staging
> 
> >
> > Sorry to beat a dead horse, but I meant to report this
> > earlier in the week and got tied up by other things.
> >
> > I finally got my test scaffold set up earlier this week
> > to try to reproduce my "bad" numbers with the RHEL6-ish
> > config file.
> >
> > I found that with "make -j28" and "make -j32" I experienced
> > __DATA CORRUPTION__.  This was repeatable.
> 
> I actually hit this for the first time a few hours ago when
> I was running performance for your rewrite.  I didn't know
> what to make of it yet.  The 24-thread kernel build failed
> when both frontswap and cleancache were enabled.
> 
> > The type of error led me to believe that the problem was
> > due to concurrency of cleancache reclaim.  I did not try
> > with cleancache disabled to prove/support this theory
> > but it is consistent with the fact that you (Seth) have not
> > seen a similar problem and has disabled cleancache.
> >
> > While this problem is most likely in my code and I am
> > suitably chagrined, it re-emphasizes the fact that
> > the current zcache in staging is 20-month old "demo"
> > code.  The proposed new zcache codebase handles concurrency
> > much more effectively.
> 
> I imagine this can be solved without rewriting the entire
> codebase.  If your new code contains a fix for this, can we
> just pull it as a single patch?

Hi Seth --

I didn't even observe this before this week, let alone fix this
as an individual bug.  The redesign takes into account LRU ordering
and zombie pageframes (which have valid pointers to the contained
zbuds and possibly valid data, so can't be recycled yet),
taking races and concurrency carefully into account.

The demo codebase is pretty dumb about concurrency, really
a hack that seemed to work.  Given the above, I guess the
hack only works _most_ of the time... when it doesn't
data corruption can occur.

It would be an interesting challenge, but likely very
time-consuming, to fix this one bug while minimizing other
changes so that the fix could be delivered as a self-contained
incremental patch.  I suspect if you try, you will learn why
the rewrite was preferable and necessary.

(Away from email for a few days very soon now.)
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ