linux-ext4 - Re: memory leak: data=journal and {collapse,insert,zero}

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151119094240.GB18581@quack.suse.cz>
Date:	Thu, 19 Nov 2015 10:42:40 +0100
From:	Jan Kara <jack@...e.cz>
To:	Namjae Jeon <namjae.jeon@...sung.com>
Cc:	'Jan Kara' <jack@...e.cz>, 'Theodore Ts'o' <tytso@....edu>,
	linux-ext4@...r.kernel.org
Subject: Re: memory leak: data=journal and {collapse,insert,zero}_range

On Wed 18-11-15 22:36:51, Jan Kara wrote:
> On Tue 17-11-15 13:47:50, Namjae Jeon wrote:
> > > On Mon 09-11-15 14:21:11, Namjae Jeon wrote:
> > > > >
> > > > > On Wed, Oct 21, 2015 at 06:44:10PM +0900, Namjae Jeon wrote:
> > > > > > > Interestingly we're not seeing these memory leaks on the truncate
> > > > > > > path, so I suspect the issue is in how collapse range is clearing
> > > > > > > pages from the page cache, especially pages that were freshly written
> > > > > > > to the journal by the commit but which hadn't yet been writtten to
> > > > > > > disk and then marked as complete so we can allow the relevant
> > > > > > > transaction to be checkpointed.  (Although we're not leaking the
> > > > > > > journal head structures, but only the buffer heads, so the story most
> > > > > > > be a bit more complicated than that.)
> > > > > >
> > > > > > Okay, Thanks for sharing your view and points !!
> > > > > >
> > > > > > Currently I can reproduce memory leak issue without collase/insert/zero range.
> > > > > > conditions like the following.(collase/insert/zero range are disable with -I -C -z option and
> > > add -y
> > > > > option instead of -W)
> > > > > >   1. small size parition(1GB)
> > > > > >   2. run fsx with these options "./fsx -N 30000 -o 128000 -l 500000 -r 4096 -t 512 -w 512 -Z -R
> > > -y -
> > > > > I -C -z testfile"
> > > > > > And same result with generic/091 is showing (buffer_head leak)
> > > > > >
> > > > > > So I am starting to find root-cause base on your points.
> > > > > > I will share the result or the patch.
> > > > >
> > > > > Thanks, that's very interesting data point.  So this makes it appear
> > > > > that the problem *is* probably with how we deal with checkpointing
> > > > > buffers after the pages get discarded using either a truncate or a
> > > > > collapse_range, since the 'y' option causes a lot fsync's, and hence
> > > > > commits, some of which are happening after a truncate command.
> > > > >
> > > > > Thanks for a taking a look at this.  I really appreciate it.
> > > > >
> > > > > Cheers,
> > > >
> > > >
> > > > Hi Ted,
> > > >
> > > > Could you review this patch?
> > > >
> > > > Thanks!
> > > > ---------------------------------------------------------------------------------
> > > > Subject: [PATCH] jbd2: try to free buffers from truncated page after
> > > >  checkpoint
> > > >
> > > > when ext4 is mounted in data=journal mode, and truncate operation
> > > > such as settatr(size), collopse, insert and zero range are used, there are
> > > > are many truncated pages with NULL page->mapping. Such truncated pages
> > > > pile up quickly due to truncate_pagecache on data pages associated with journal.
> > > > As page->mapping is NULL for such truncated pages, they are not freed
> > > > by drop cache(3) or umount. As a result, MemFree in /proc/meminfo decreases
> > > > quickly and active buffer_head slab objects grow in /proc/slabinfo.
> > > > This patch attempts to free buffers from such pages at the end of jbd2
> > > > checkpoint, if pages do not have any busy buffers and NULL mapping.
> > > 
> > > Hum, why such pages didn't get freed by release_buffer_page() call
> > > happening when processing transaction's forget list? Because the idea is
> > > that such pages should be discarded at that point...
> > Hi Jan,
> > 
> > When I checked this code, release_buffer_page is not called
> > when buffer_jbddirty is true. Such buffers with JBD_Dirty are added
> > to new checkpoint.
> > 
> >                 if (buffer_jbddirty(bh)) {
> >                        JBUFFER_TRACE(jh, "add to new checkpointing trans");
> >                        __jbd2_journal_insert_checkpoint(jh, commit_transaction);
> >                        if (is_journal_aborted(journal))
> >                                clear_buffer_jbddirty(bh);
> >                 } else {
> >                        J_ASSERT_BH(bh, !buffer_dirty(bh));
> >                        /*
> >                         * The buffer on BJ_Forget list and not jbddirty means
> >                         * it has been freed by this transaction and hence it
> >                         * could not have been reallocated until this
> >                         * transaction has committed. *BUT* it could be
> >                         * reallocated once we have written all the data to
> >                         * disk and before we process the buffer on BJ_Forget
> >                         * list.
> >                         */
> >                        if (!jh->b_next_transaction)
> >                                try_to_free = 1;
> >                 }
> >                 JBUFFER_TRACE(jh, "refile or unfile buffer");
> >                 __jbd2_journal_refile_buffer(jh);
> >                 jbd_unlock_bh_state(bh);
> > 
> > Next buffer was unfiled by __jbd2_journal_refile_buffer, JBD_Dirty cleared and
> > BH_Dirty was set in same function. Later it must have been written back as
> > BH_Dirty was cleared.
> > And ext4_wait_for_tail_page_commit-> __ext4_journalled_invalidatepage ->
> > journal_unmap_buffer zaps the buffer:
> > 
> >                 if (!buffer_dirty(bh)) {
> >                         /* bdflush has written it.  We can drop it now */
> >                         goto zap_buffer;
> >                 }
> > 
> > next, truncate_pagecache is called, which clears the page mapping.
> > Eventually, remove checkpoint is called, but such page with NULL mapping was
> > not freed. So, I had added release_buffer_page at the end of remove checkpoint
> > to attempt to free such free buffer pages. Please let me know your opinion.
> 
> OK, thanks for the detailed analysis. But when the buffer gets truncated,
> jbd2_journal_invalidatepage() either removes the buffer from the
> transaction (obviously didn't happen here) or it sets buffer_freed and
> buffer_jbddirty should get cleared when processing the BJ_Forget list. So
> why that didn't happen? Can you have a look into what
> jbd2_journal_invalidatepage() did to buffer in that page?
> 
> Generally truncated buffers should not enter checkpoint list since writing
> them is just a waste of disk bandwidth...

I thought a bit more about this and my guess is that
jbd2_journal_invalidatepage() fails to invalidate the partial page and
returns EBUSY. However we should see warnings about that from
ext4_journalled_invalidatepage(). Can you see them? This would perfectly
match Ted's observation that the problem happens only for fallocate
operations but not for truncate which does the dance with
ext4_wait_for_tail_page_commit() but fallocate operations don't do it...

Now if I'm right and this is indeed the path we are hitting, we need to put
more thought into how we handle truncation of partial pages in data=journal
mode.

BTW: The ext4_force_commit() call in fallocate operations is definitely
racy as pages can get dirtied and journalled in the running transaction
anytime while we wait for transaction commit (which I suspect is happening
in your testcase)... So that needs some more thought as well.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html