lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200712201219.13873.nickpiggin@yahoo.com.au>
Date:	Thu, 20 Dec 2007 12:19:13 +1100
From:	Nick Piggin <nickpiggin@...oo.com.au>
To:	Jan Kara <jack@...e.cz>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Krzysztof Oledzki <olel@....pl>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Osterried <osterried@...se.de>, protasnb@...il.com,
	bugme-daemon@...zilla.kernel.org
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

On Thursday 20 December 2007 12:05, Jan Kara wrote:
> > On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> > > I'll confirm this tomorrow but it seems that even switching to
> > > data=ordered (AFAIK default o ext3) is indeed enough to cure this
> > > problem.
> >
> > Ok, do we actually have any ext3 expert following this? I have no idea
> > about what the journalling code does, but I have painful memories of ext3
> > doing really odd buffer-head-based IO and totally bypassing all the
> > normal page dirty logic.
> >
> > Judging by the symptoms (sorry for not following this well, it came up
> > while I was mostly away travelling), something probably *does* clear the
> > dirty bit on the pages, but the dirty *accounting* is not done properly,
> > so the kernel keeps thinking it has dirty pages.
> >
> > Now, a simple "grep" shows that ext3 does not actually do any
> > ClearPageDirty() or similar on its own, although maybe I missed some
> > other subtle way this can happen. And the *normal* VFS routines that do
> > ClearPageDirty should all be doing the proper accounting.
> >
> > So I see a couple of possible cases:
> >
> >  - actually clearing the PG_dirty bit somehow, without doing the
> >    accounting.
> >
> >    This looks very unlikely. PG_dirty is always cleared by some variant
> > of "*ClearPageDirty()", and that bit definition isn't used for anything
> > else in the whole kernel judging by "grep" (the page allocator tests the
> > bit, that's it).
> >
> >    And there aren't that many hits for ClearPageDirty, and they all seem
> >    to do the proper "dec_zone_page_state(page, NR_FILE_DIRTY);" etc if
> > the mapping has dirty state accounting.
> >
> >    The exceptions seem to be:
> >     - the page freeing path, but that path checks that "mapping" is NULL
> >       (so no accounting), and would complain loudly if it wasn't
> >     - the swap state stuff ("move_from_swap_cache()"), but that should
> >       only ever trigger for swap cache pages (we have a BUG_ON() in that
> >       path), and those don't do dirty accounting anyway.
> >     - pageout(), but again only for pages that have a NULL mapping.
> >
> >  - ext3 might be clearing (probably indirectly) the "page->mapping" thing
> >    or similar, which in turn will make the VFS think that even a dirty
> >    page isn't actually to be accounted for - so when the page *turned*
> >    dirty, it was accounted as a dirty page, but then, when it was
> > cleaned, the accounting wasn't reversed because ->mapping had become
> > NULL.
> >
> >    This would be some interaction with the truncation logic, and quite
> >    frankly, that should be all shared with the non-journal case, so I
> > find this all very unlikely.
> >
> > However, that second case is interesting, because the pageout case
> > actually has a comment like this:
> >
> > 	/*
> > 	 * Some data journaling orphaned pages can have
> > 	 * page->mapping == NULL while being dirty with clean buffers.
> > 	 */
> >
> > which really sounds like the case in question.
> >
> > I may know the VM, but that special case was added due to insane
> > journaling filesystems, and I don't know what insane things they do.
> > Which is why I'm wondering if there is any ext3 person who knows the
> > journaling code?
>
>   Yes, I'm looking into the problem... I think those orphan pages
> without mapping are created because we cannot drop truncated
> buffers/pages immediately.  There can be a committing transaction that
> still needs the data in those buffers and until it commits we have to
> keep the pages (and even maybe write them to disk etc.). But eventually,
> we should write the buffers, call try_to_free_buffers() which calls
> cancel_dirty_page() and everything should be happy... in theory ;)

If mapping is NULL, then try_to_free_buffers won't call cancel_dirty_page,
I think?

I don't know whether ext3 can be changed to not require/allow these dirty
pages, but I would rather Linus's dirty page accounting fix to go into that
path (the /* can this still happen? */ in try_to_free_buffers()), if possible.

Then you could also have a WARN_ON in __remove_from_page_cache().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ