linux-ext4 - [RFC][PATCH] JBD: release checkpoint journal heads through try_to_release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20081017.223716.147444348.00960188@stratos.soft.fujitsu.com>
Date:	Fri, 17 Oct 2008 22:37:16 +0900 (JST)
From:	Toshiyuki Okajima <toshi.okajima@...fujitsu.com>
To:	linux-ext4@...r.kernel.org, sct@...hat.com,
	akpm@...ux-foundation.org
Subject: [RFC][PATCH] JBD: release checkpoint journal heads through
 try_to_release_page when the memory is exhausted

Hi.

I found the situation where OOM-Killer happens easily.
I will inform you of it. 
I tried to fix this problem to make OOM-Killer not happen easily as much as 
possible.
As a result, I made a reference patch to fix it. 

Any comments are welcome.
(The comments for making much simpler or epoch-making approach are 
very welcome.)

------------------------------------------------------------------------------

If the following is satisfied, OOM-Killer happens easily.
(1) A quarter of a summation of each total log size of all filesystems which 
   use jbd exceeds the memory size of Normal Zone.
(2) We commit a huge number of data which include many metadata to each 
   filesystem and then we stop committing data to them. 
    For example, a process creates many files whose size are huge and 
   which have a huge number of indirect blocks. Then all processes stop I/O 
   to all filesystems which use jbd.
(3) After (2), we request to get a big size memory.
(NOTE: A oom-killer can happen easily on a system whose architecture is x86. 
Because a x86 system can have only a small Normal Zone of less than 1GB.)

The reason is that jbd does not positively release journal heads(jh-s)
  even if there are many jh-s which can be released.

Releasing jh-s is only executed at the following timing:
- if free log space becomes a quarter of the total log size 
  (log_do_checkpoint())
- if a transaction begins to commit (journal_cleanup_checkpoint_list() 
 which is called by journal_commit_transaction())
(NOTE: A jh-s which corresponds to buffer heads (bh-s) which is a direct block 
      can be released at journal_try_to_free_buffers() which is called 
      by try_to_release_page())   

Therefore,  if we let filesystems do above (2), jh-s remains because 
new transaction isn't generated. 
However, when the system memory is exhausted, try_to_release_page() can be 
called, but it cannot release bh-s which are metadata (indirect blocks 
and so on).  
Because the mapping to the page is owned by a block device not a filesystem 
(ext3).

If the mapping is owned by a block device, try_to_release_page() calls 
try_to_free_buffers(). It can release generic bh, but cannot release the bh 
which is referring by the jh. Because the reference counter of the bh is 
larger than 0.
Therefore it is necessary to release the jh before the bh is released.

To achieve it, I added a new member function into buffer head structure.
The function releases the bh which correspond to a page whose mapping
is block device. And the release target of the bh has private data 
(journal head).
The function resembles journal_try_to_free_buffers().
Then I changed try_to_release_page(), which calls try_to_free_buffers()
after the new function.

As a result, I think it becomes difficult for oom-killer to happen 
than before because try_to_free_buffers() via try_to_release_page() 
which is called when the system memory is exhausted can release bh-s. 

Signed-off-by: Toshiyuki Okajima <toshi.okajima@...fujitsu.com>
---
 fs/buffer.c                 |   23 ++++++++++++++++++++++-
 fs/jbd/journal.c            |    7 +++++++
 fs/jbd/transaction.c        |   39 +++++++++++++++++++++++++++++++++++++++
 include/linux/buffer_head.h |    7 +++++++
 include/linux/jbd.h         |    1 +
 5 files changed, 76 insertions(+), 1 deletion(-) 

diff -Nurp linux-2.6.27-rc8.org/fs/buffer.c linux-2.6.27-rc8/fs/buffer.c
--- linux-2.6.27-rc8.org/fs/buffer.c	2008-09-30 07:24:02.000000000 +0900
+++ linux-2.6.27-rc8/fs/buffer.c	2008-10-17 08:59:37.000000000 +0900
@@ -3111,7 +3111,7 @@ failed:
 	return 0;
 }
 
-int try_to_free_buffers(struct page *page)
+static int __try_to_free_buffers(struct page *page)
 {
 	struct address_space * const mapping = page->mapping;
 	struct buffer_head *buffers_to_free = NULL;
@@ -3158,6 +3158,26 @@ out:
 	}
 	return ret;
 }
+
+int try_to_free_buffers(struct page *page)
+{
+	struct address_space * const mapping = page->mapping;
+	struct buffer_head *head, *bh;
+	const struct buffer_head_operations *bops;
+		
+	if (mapping == NULL) 
+		return __try_to_free_buffers(page);
+	bh = head = page_buffers(page);
+	do {
+		if ((bops = bh->b_ops) != NULL
+			&& !(bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock)))
+			&& atomic_read(&bh->b_count) > 0)
+			if (bops->release_buffer != NULL)
+				bops->release_buffer(bh);
+		bh = bh->b_this_page;
+	} while (bh != head);
+	return __try_to_free_buffers(page);
+}
 EXPORT_SYMBOL(try_to_free_buffers);
 
 void block_sync_page(struct page *page)
@@ -3234,6 +3254,7 @@ struct buffer_head *alloc_buffer_head(gf
 {
 	struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
 	if (ret) {
+		ret->b_ops = NULL;
 		INIT_LIST_HEAD(&ret->b_assoc_buffers);
 		get_cpu_var(bh_accounting).nr++;
 		recalc_bh_state();
diff -Nurp linux-2.6.27-rc8.org/fs/jbd/journal.c linux-2.6.27-rc8/fs/jbd/journal.c
--- linux-2.6.27-rc8.org/fs/jbd/journal.c	2008-09-30 07:24:02.000000000 +0900
+++ linux-2.6.27-rc8/fs/jbd/journal.c	2008-10-17 08:59:37.000000000 +0900
@@ -79,11 +79,16 @@ EXPORT_SYMBOL(journal_wipe);
 EXPORT_SYMBOL(journal_blocks_per_page);
 EXPORT_SYMBOL(journal_invalidatepage);
 EXPORT_SYMBOL(journal_try_to_free_buffers);
+EXPORT_SYMBOL(journal_try_to_free_one_buffer);
 EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
 
+static const struct buffer_head_operations jbd_bh_ops = {
+	.release_buffer = journal_try_to_free_one_buffer
+};
+
 /*
  * Helper function used to manage commit timeouts
  */
@@ -1749,6 +1754,7 @@ repeat:
 		set_buffer_jbd(bh);
 		bh->b_private = jh;
 		jh->b_bh = bh;
+		bh->b_ops = &jbd_bh_ops;
 		get_bh(bh);
 		BUFFER_TRACE(bh, "added journal_head");
 	}
@@ -1803,6 +1809,7 @@ static void __journal_remove_journal_hea
 						__func__);
 				jbd_free(jh->b_committed_data, bh->b_size);
 			}
+			bh->b_ops = NULL;
 			bh->b_private = NULL;
 			jh->b_bh = NULL;	/* debug, really */
 			clear_buffer_jbd(bh);
diff -Nurp linux-2.6.27-rc8.org/fs/jbd/transaction.c linux-2.6.27-rc8/fs/jbd/transaction.c
--- linux-2.6.27-rc8.org/fs/jbd/transaction.c	2008-09-30 07:24:02.000000000 +0900
+++ linux-2.6.27-rc8/fs/jbd/transaction.c	2008-10-17 09:11:49.000000000 +0900
@@ -1648,6 +1648,45 @@ out:
 	return;
 }
 
+void journal_try_to_free_one_buffer(struct buffer_head *bh)
+{
+	journal_t *journal;
+	struct journal_head *jh;
+	const struct buffer_head_operations *bops;
+
+	jh = journal_grab_journal_head(bh);
+	if (!jh)
+		return;
+	if (!jbd_trylock_bh_state(bh)) {
+		journal_put_journal_head(jh);
+		return;	
+	}
+	if ((bops = bh->b_ops) == NULL || bops->release_buffer == NULL)
+		goto skip;
+	if (buffer_locked(bh) || buffer_dirty(bh))
+		goto skip;
+	if (jh->b_next_transaction != NULL)
+		goto skip;
+	journal = (jh->b_cp_transaction != NULL ? jh->b_cp_transaction->t_journal: NULL);
+	if (journal != NULL) {
+		if (spin_trylock(&journal->j_list_lock)) {
+			if (jh->b_cp_transaction != NULL && jh->b_transaction == NULL) {
+				/* written-back checkpointed metadata buffer */
+				if (jh->b_jlist == BJ_None) {
+					JBUFFER_TRACE(jh, "remove from checkpoint list");
+					__journal_remove_checkpoint(jh);
+					journal_remove_journal_head(bh);
+					__brelse(bh);
+				}
+			}
+			spin_unlock(&journal->j_list_lock);
+		}
+	}
+skip:
+	journal_put_journal_head(jh);
+	jbd_unlock_bh_state(bh);
+}
+
 /*
  * journal_try_to_free_buffers() could race with journal_commit_transaction()
  * The latter might still hold the a count on buffers when inspecting
diff -Nurp linux-2.6.27-rc8.org/include/linux/buffer_head.h linux-2.6.27-rc8/include/linux/buffer_head.h
--- linux-2.6.27-rc8.org/include/linux/buffer_head.h	2008-09-30 07:24:02.000000000 +0900
+++ linux-2.6.27-rc8/include/linux/buffer_head.h	2008-10-17 08:59:37.000000000 +0900
@@ -43,6 +43,10 @@ enum bh_state_bits {
 
 #define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
 
+struct buffer_head_operations {
+	void (*release_buffer)(struct buffer_head*);
+};
+
 struct page;
 struct buffer_head;
 struct address_space;
@@ -73,6 +77,9 @@ struct buffer_head {
 	struct address_space *b_assoc_map;	/* mapping this buffer is
 						   associated with */
 	atomic_t b_count;		/* users using this buffer_head */
+	const struct buffer_head_operations *b_ops; 
+					/* for a private buffer head
+					   to do special work */
 };
 
 /*
diff -Nurp linux-2.6.27-rc8.org/include/linux/jbd.h linux-2.6.27-rc8/include/linux/jbd.h
--- linux-2.6.27-rc8.org/include/linux/jbd.h	2008-09-30 07:24:02.000000000 +0900
+++ linux-2.6.27-rc8/include/linux/jbd.h	2008-10-17 08:59:37.000000000 +0900
@@ -890,6 +890,7 @@ extern void	 journal_sync_buffer (struct
 extern void	 journal_invalidatepage(journal_t *,
 				struct page *, unsigned long);
 extern int	 journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
+extern void	 journal_try_to_free_one_buffer(struct buffer_head *);
 extern int	 journal_stop(handle_t *);
 extern int	 journal_flush (journal_t *);
 extern void	 journal_lock_updates (journal_t *);
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html