linux-kernel - Re: Fwd: Control page reclaim granularity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120313025108.GB7125@barrios>
Date:	Tue, 13 Mar 2012 11:51:08 +0900
From:	Minchan Kim <minchan@...nel.org>
To:	Minchan Kim <minchan@...nel.org>,
	Konstantin Khlebnikov <khlebnikov@...nvz.org>,
	linux-mm <linux-mm@...ck.org>,
	linux-kernel <linux-kernel@...r.kernel.org>, riel@...hat.com,
	kosaki.motohiro@...fujitsu.com
Subject: Re: Fwd: Control page reclaim granularity

On Mon, Mar 12, 2012 at 11:15:43PM +0800, Zheng Liu wrote:
> On Mon, Mar 12, 2012 at 10:42:26PM +0900, Minchan Kim wrote:
> > On Mon, Mar 12, 2012 at 04:14:14PM +0800, Zheng Liu wrote:
> > > On 03/12/2012 02:20 PM, Konstantin Khlebnikov wrote:
> > > > Minchan Kim wrote:
> > > >> On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
> > > >>> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
> > > >>>> I forgot to Ccing you.
> > > >>>> Sorry.
> > > >>>>
> > > >>>> ---------- Forwarded message ----------
> > > >>>> From: Minchan Kim<minchan@...nel.org>
> > > >>>> Date: Mon, Mar 12, 2012 at 9:28 AM
> > > >>>> Subject: Re: Control page reclaim granularity
> > > >>>> To: Minchan Kim<minchan@...nel.org>, linux-mm<linux-mm@...ck.org>,
> > > >>>> linux-kernel<linux-kernel@...r.kernel.org>, Konstantin Khlebnikov<
> > > >>>> khlebnikov@...nvz.org>, riel@...hat.com, kosaki.motohiro@...fujitsu.com
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
> > > >>>>> Hi Minchan,
> > > >>>>>
> > > >>>>> Sorry, I forgot to say that I don't subscribe linux-mm and
> > > >>>>> linux-kernel
> > > >>>>> mailing list.  So please Cc me.
> > > >>>>>
> > > >>>>> IMHO, maybe we should re-think about how does user use mmap(2).  I
> > > >>>>> describe the cases I known in our product system.  They can be
> > > >>>>> categorized into two cases.  One is mmaped all data files into memory
> > > >>>>> and sometime it uses write(2) to append some data, and another uses
> > > >>>>> mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files.  In
> > > >>>>> the
> > > >>>>> second case,  the application wants to keep mmaped page into memory
> > > >>>>> and
> > > >>>>> let file pages to be reclaimed firstly.  So, IMO, when application
> > > >>>>> uses
> > > >>>>> mmap(2) to manipulate files, it is possible to imply that it wants
> > > >>>>> keep
> > > >>>>> these mmaped pages into memory and do not be reclaimed.  At least
> > > >>>>> these
> > > >>>>> pages do not be reclaimed early than file pages.  I think that
> > > >>>>> maybe we
> > > >>>>> can recover that routine and provide a sysctl parameter to let the
> > > >>>>> user
> > > >>>>> to set this ratio between mmaped pages and file pages.
> > > >>>>
> > > >>>> I am not convinced why we should handle mapped page specially.
> > > >>>> Sometimem, someone may use mmap by reducing buffer copy compared to
> > > >>>> read
> > > >>>> system call.
> > > >>>> So I think we can't make sure mmaped pages are always win.
> > > >>>>
> > > >>>> My suggestion is that it would be better to declare by user explicitly.
> > > >>>> I think we can implement it by madvise and fadvise's WILLNEED option.
> > > >>>> Current implementation is just readahead if there isn't a page in
> > > >>>> memory
> > > >>>> but I think
> > > >>>> we can promote from inactive to active if there is already a page in
> > > >>>> memory.
> > > >>>>
> > > >>>> It's more clear and it couldn't be affected by kernel page reclaim
> > > >>>> algorithm change
> > > >>>> like this.
> > > >>>
> > > >>> Thank you for your advice.  But I still have question about this
> > > >>> solution.  If we improve the madvise(2) and fadvise(2)'s WILLNEED
> > > >>> option,  it will cause an inconsistently status for pages that be
> > > >>> manipulated by madvise(2) and/or fadvise(2).  For example, when I call
> > > >>> madvise with WILLNEED flag, some pages will be moved into active list if
> > > >>> they already have been in memory, and other pages will be read into
> > > >>> memory and be saved in inactive list if they don't be in memory.  Then
> > > >>> pages that are in inactive list are possible to be reclaim.  So from the
> > > >>> view of users, it is inconsistent because some pages are in memory and
> > > >>> some pages are reclaimed.  But actually the user hopes that all of pages
> > > >>> can be kept in memory.  IMHO, this inconsistency is weird and makes
> > > >>> users
> > > >>> puzzled.
> > > >>
> > > >> Now problem is that
> > > >>
> > > >> 1. User want to keep pages which are used once in a while in memory.
> > > >> 2. Kernel want to reclaim them because they are surely reclaim target
> > > >>     pages in point of view by LRU.
> > > >>
> > > >> The most desriable approach is that user should use mlock to guarantee
> > > >> them in memory. But mlock is too big overhead and user doesn't want to
> > > >> keep
> > > >> memory all pages all at once.(Ie, he want demand paging when he need
> > > >> the page)
> > > >> Right?
> > > >>
> > > >> madvise, it's a just hint for kernel and kernel doesn't need to make
> > > >> sure madvise's behavior.
> > > >> In point of view, such inconsistency might not be a big problem.
> > > >>
> > > >> Big problem I think now is that user should use madvise(WILLNEED)
> > > >> periodically because such
> > > >> activation happens once when user calls madvise. If user doesn't use
> > > >> page frequently after
> > > >> user calls it, it ends up moving into inactive list and even could be
> > > >> reclaimed.
> > > >> It's not good. :-(
> > > >>
> > > >> Okay. How about adding new VM_WORKINGSET?
> > > >> And reclaimer would give one more round trip in active/inactive list
> > > >> erwhen reclaim happens
> > > >> if the page is referenced.
> > > >>
> > > >> Sigh. We have no room for new VM_FLAG in 32 bit.
> > > > p
> > > > It would be nice to mark struct address_space with this flag and export
> > > > AS_UNEVICTABLE somehow.
> > > > Maybe we can reuse file-locking engine for managing these bits =)
> > > 
> > > Make sense to me.  We can mark this flag in struct address_space and check
> > > it in page_refereneced_file().  If this flag is set, it will be cleard and
> > 
> > Disadvantage is that we could set reclaim granularity as per-inode.
> > I want to set it as per-vma, not per-inode.
> 
> I don't think this is a disadvantage.  This per-inode reclaim
> granularity is useful for us.  Actually I have thought to implement a
> per-inode memcg to let different file sets to be reclaimed separately.
> So maybe we can provide two mechanisms to let the user to choose how to
> use them.

I don't oppose supporting both mechanism but I don't want to give only per-inode
approach.

> 
> > 
> > > the function returns referenced > 1.  Then this page can be promoted into
> > > activate list.  But I prefer to set/clear this flag in madvise.
> > 
> > Hmm, My idea is following as,
> > If we can set new VM flag into VMA or something, reclaimer can check it when shrink_[in]active_list
> > and he can prevent to deactivate/reclaim if he takes a look the page is in VMA which
> > are set by new VM flag and the page is referenced recently at least once.
> > It means it gives one more round trip in his list(ie, active/inactive list)
> > rather than activation so that the page would become less reclaimable.
> 
> No matter what the page is given one more round trip or is promoted into
> active list, it can satisfy our current requirement.  So now the
> question is which is better.  If we add a new VM flag, as you said
> before, vma->vm_flags has no room for it in 32 bit.  I have noticed that
> this topic has been discussed [1] and the result is that vm_flags is
> still a unsigned long type.  So we need to use a tricky technique to solve
> it.  If we add a new flag in struct addpress_space, it might be easy to
> implement it.

In case of per-inode, it's good but it doesn't work for per-vma and file-range.

> 
> 1. http://lkml.indiana.edu/hypermail/linux/kernel/1104.1/00975.html
> 
> Regards,
> Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/