[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.0.9999.0712200753040.21557@woody.linux-foundation.org>
Date: Thu, 20 Dec 2007 08:25:56 -0800 (PST)
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Bj?rn Steinbrink <B.Steinbrink@....de>
cc: Krzysztof Oledzki <olel@....pl>,
Andrew Morton <akpm@...ux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Nick Piggin <nickpiggin@...oo.com.au>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Osterried <osterried@...se.de>, protasnb@...il.com,
bugme-daemon@...zilla.kernel.org
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)
On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
>
> OK, so I looked for PG_dirty anyway.
>
> In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
> bail out if the page is dirty.
>
> Then in 3e67c0987d7567ad666641164a153dca9a43b11d, Andrew fixed
> truncate_complete_page, because it called cancel_dirty_page (and thus
> cleared PG_dirty) after try_to_free_buffers was called via
> do_invalidatepage.
>
> Now, if I'm not mistaken, we can end up as follows.
>
> truncate_complete_page()
> cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
> do_invalidatepage()
> ext3_invalidatepage()
> journal_invalidatepage()
> journal_unmap_buffer()
> __dispose_buffer()
> __journal_unfile_buffer()
> __journal_temp_unlink_buffer()
> mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
Good, this seems to be the exact path that actually triggers it. I got to
journal_unmap_buffer(), but was too lazy to actually then bother to follow
it all the way down - I decided that I didn't actually really even care
what the low-level FS layer did, I had already convinced myself that it
obviously must be dirtying the page some way, since that matched the
symptoms exactly (ie only the journaling case was impacted, and this was
all about the journal).
But perhaps more importantly: regardless of what the low-level filesystem
did at that point, the VM accounting shouldn't care, and should be robust
in the face of a low-level filesystem doing strange and wonderful things.
But thanks for bothering to go through the whole history and figure out
what exactly is up.
> As try_to_free_buffers got its ext3 hack back in
> ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
> 3e67c0987d7567ad666641164a153dca9a43b11d should be reverted? (Except for
> the accounting fix in cancel_dirty_page, of course).
Yes, I think we have room for cleanups now, and I agree: we ended up
reinstating some questionable code in the VM just because we didn't really
know or understand what was going on in the ext3 journal code.
Of course, it may well be that there is something *else* going on too, but
I do believe that this whole case is what it was all about, and the hacks
end up just (a) making the VM harder to understand (because they cause
non-obvious VM code to work around some very specific filesystem
behaviour) and (b) the hacks obviously hid the _real_ issue, but I think
we've established the real cause, and the hacks clearly weren't enough to
really hide it 100% anyway.
However, there's no way I'll play with that right now (I'm planning on an
-rc6 today), but it might be worth it to make a test-cleanup patch for -mm
which does some VM cleanups:
- don't touch dirty pages in fs/buffer.c (ie undo the meat of commit
ecdfc9787fe527491baefc22dce8b2dbd5b2908d, but not resurrecting the
debugging code)
- remove the calling of "cancel_dirty_page()" entirely from
"truncate_complete_page()", and let "remove_from_page_cache()" just
always handle it (and probably just add a "ClearPageDirty()" to match
the "ClearPageUptodate()").
- remove "cancel_dirty_page()" from "truncate_huge_page()", which seems
to be the exact same issue (ie we should just use the logic in
remove_from_page_cache()).
at that point "cancel_dirty_page()" literally is only used for what its
name implies, and the only in-tree use of it seems to be NFS for when
the filesystem gets called for ->invalidatepage - which makes tons of
conceptual sense, but I suspect we could drop it from there too, since the
VM layer _will_ cancel the dirtiness at a VM level when it then later
removes it from the page cache.
So we essentially seem to be able to simplify things a bit by getting rid
of a hack in try_to_free_buffers(), and potentially getting rid of
cancel_dirty_page() entirely.
It would imply that we need to do the task_io_account_cancelled_write()
inside "remove_from_page_cache()", but that should be ok (I don't see any
locking issues there).
> On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task
> io accounting for cancelled writes happened always happened if the page
> was dirty, regardless of page->mapping. This was also already true for
> the old test_clear_page_dirty code, and the commit log for
> 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic
> change either, so maybe the "if (account_size)" block should be moved
> out of the if "(mapping && ...)" block?
I think the "if (account_size)" thing was *purely* for me being worried
about hugetlb entries, and I think that's the only thing that passes in a
zero account size.
But hugetlbfs already has BDI_CAP_NO_ACCT_DIRTY set (exactly because we
cannot account for those pages *anyway*), so I think we could go further
than move the account_size outside of the test, I think we could probably
remove that test entirely and drop the whole thing.
The thing is, task_io_account_cancelled_write() doesn't make sense on
mappings that don't do dirty accounting, since those mappings are all
special anyway: they don't actually do any real IO, they are all in-ram
things. So I think it should stay inside the
if (mapping && mapping_cap_account_dirty(mapping))
..
test, simply because I don't think it makes any conceptual sense outside
of it.
Hmm?
But none of this seems really critical - just simple cleanups.
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists