lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20081020160249.ff41f762.akpm@linux-foundation.org>
Date:	Mon, 20 Oct 2008 16:02:49 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Toshiyuki Okajima <toshi.okajima@...fujitsu.com>
Cc:	linux-ext4@...r.kernel.org, sct@...hat.com
Subject: Re: [RFC][PATCH] JBD: release checkpoint journal heads through
 try_to_release_page when the memory is exhausted

On Fri, 17 Oct 2008 22:37:16 +0900 (JST)
Toshiyuki Okajima <toshi.okajima@...fujitsu.com> wrote:

> Hi.
> 
> I found the situation where OOM-Killer happens easily.
> I will inform you of it. 
> I tried to fix this problem to make OOM-Killer not happen easily as much as 
> possible.
> As a result, I made a reference patch to fix it. 
> 
> Any comments are welcome.
> (The comments for making much simpler or epoch-making approach are 
> very welcome.)
> 
> ------------------------------------------------------------------------------
> 
> If the following is satisfied, OOM-Killer happens easily.
> (1) A quarter of a summation of each total log size of all filesystems which 
>    use jbd exceeds the memory size of Normal Zone.
> (2) We commit a huge number of data which include many metadata to each 
>    filesystem and then we stop committing data to them. 
>     For example, a process creates many files whose size are huge and 
>    which have a huge number of indirect blocks. Then all processes stop I/O 
>    to all filesystems which use jbd.
> (3) After (2), we request to get a big size memory.
> (NOTE: A oom-killer can happen easily on a system whose architecture is x86. 
> Because a x86 system can have only a small Normal Zone of less than 1GB.)
> 
> The reason is that jbd does not positively release journal heads(jh-s)
>   even if there are many jh-s which can be released.
> 
> Releasing jh-s is only executed at the following timing:
> - if free log space becomes a quarter of the total log size 
>   (log_do_checkpoint())
> - if a transaction begins to commit (journal_cleanup_checkpoint_list() 
>  which is called by journal_commit_transaction())
> (NOTE: A jh-s which corresponds to buffer heads (bh-s) which is a direct block 
>       can be released at journal_try_to_free_buffers() which is called 
>       by try_to_release_page())   
> 
> Therefore,  if we let filesystems do above (2), jh-s remains because 
> new transaction isn't generated. 
> However, when the system memory is exhausted, try_to_release_page() can be 
> called, but it cannot release bh-s which are metadata (indirect blocks 
> and so on).  
> Because the mapping to the page is owned by a block device not a filesystem 
> (ext3).
> 
> If the mapping is owned by a block device, try_to_release_page() calls 
> try_to_free_buffers(). It can release generic bh, but cannot release the bh 
> which is referring by the jh. Because the reference counter of the bh is 
> larger than 0.
> Therefore it is necessary to release the jh before the bh is released.
> 
> To achieve it, I added a new member function into buffer head structure.
> The function releases the bh which correspond to a page whose mapping
> is block device. And the release target of the bh has private data 
> (journal head).
> The function resembles journal_try_to_free_buffers().
> Then I changed try_to_release_page(), which calls try_to_free_buffers()
> after the new function.
> 
> As a result, I think it becomes difficult for oom-killer to happen 
> than before because try_to_free_buffers() via try_to_release_page() 
> which is called when the system memory is exhausted can release bh-s. 
> 

OK.

> ---
>  fs/buffer.c                 |   23 ++++++++++++++++++++++-
>  fs/jbd/journal.c            |    7 +++++++
>  fs/jbd/transaction.c        |   39 +++++++++++++++++++++++++++++++++++++++
>  include/linux/buffer_head.h |    7 +++++++
>  include/linux/jbd.h         |    1 +
>  5 files changed, 76 insertions(+), 1 deletion(-) 

The patch is fairly complex, and increasing the buffer_head size can be
rather costly.  An alternative might be to implement a shrinker
callback function for the journal_head slab cache.  Did you consider
this?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ