linux-kernel - Re: [RFC v5 0/3] mm: make swapin readahead to gain more thp performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160225231644.GA13782@debian>
Date:	Fri, 26 Feb 2016 01:16:44 +0200
From:	Ebru Akagunduz <ebru.akagunduz@...il.com>
To:	Hugh Dickins <hughd@...gle.com>
Cc:	akpm@...ux-foundation.org, linux-mm@...ck.org,
	kirill.shutemov@...ux.intel.com, n-horiguchi@...jp.nec.com,
	aarcange@...hat.com, riel@...hat.com, iamjoonsoo.kim@....com,
	xiexiuqi@...wei.com, gorcunov@...nvz.org,
	linux-kernel@...r.kernel.org, mgorman@...e.de, rientjes@...gle.com,
	vbabka@...e.cz, aneesh.kumar@...ux.vnet.ibm.com,
	hannes@...xchg.org, mhocko@...e.cz, boaz@...xistor.com,
	raindel@...lanox.com
Subject: Re: [RFC v5 0/3] mm: make swapin readahead to gain more thp
 performance

On Wed, Feb 24, 2016 at 11:36:30PM -0800, Hugh Dickins wrote:
> On Mon, 14 Sep 2015, Andrew Morton wrote:
> > On Mon, 14 Sep 2015 22:31:42 +0300 Ebru Akagunduz <ebru.akagunduz@...il.com> wrote:
> > 
> > > This patch series makes swapin readahead up to a
> > > certain number to gain more thp performance and adds
> > > tracepoint for khugepaged_scan_pmd, collapse_huge_page,
> > > __collapse_huge_page_isolate.
> > 
> > I'll merge this series for testing.  Hopefully Andrea and/or Hugh will
> > find time for a quality think about the issue before 4.3 comes around.
> > 
> > It would be much better if we didn't have that sysfs knob - make the
> > control automatic in some fashion.
> > 
> > If we can't think of a way of doing that then at least let's document
> > max_ptes_swap very carefully.  Explain to our users what it does, why
> > they should care about it, how they should set about determining (ie:
> > measuring) its effect upon their workloads.
> 
> Ebru, I don't know whether you realize, but your THP swapin work has
> been languishing in mmotm for five months now, without getting any
> nearer to Linus's tree.
> 
> That's partly my fault - sorry - for not responding to Andrew's nudge
> above.  But I think you also got caught up in conference, and in the
> end did not get around to answering outstanding issues: please take a
> look at your mailbox from last September, to see what more is needed.
> 
I've seen my patch series in mmotm mails but I thought,
other parts of thp have problem so those are going to be
forwarded to Linus's tree when other parts fixed.

I did not know the file: http://www.ozlabs.org/~akpm/mmotm/series
It shows explicitly the problem of patches.
Thank you for summarizing it below.

> Here's what mmotm's series file says...
> 
> #mm-add-tracepoint-for-scanning-pages.patch+2: Andrea/Hugh review?. 2 Fengguang warnings, one "kernel test robot" oops
> #mm-make-optimistic-check-for-swapin-readahead.patch: TBU (docs)
I've sent doc patch: http://lkml.iu.edu/hypermail/linux/kernel/1509.2/01783.html
> mm-make-optimistic-check-for-swapin-readahead.patch
> mm-make-optimistic-check-for-swapin-readahead-fix-2.patch
> #mm-make-swapin-readahead-to-improve-thp-collapse-rate.patch: Hugh/Kirill want collapse_huge_page() rework
> mm-make-swapin-readahead-to-improve-thp-collapse-rate.patch
> mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix.patch
> mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix-2.patch
> #mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix-3.patch: Ebru to test?
I've tested my whole patch series and could not produce the fault
again. I've also seen Tested-by tag from Sergey so I did not sent
the tag.
> mm-make-swapin-readahead-to-improve-thp-collapse-rate-fix-3.patch
> 
> ...but I think some of that is stale.  There were a few little bugs
> when it first went into mmotm, which Kirill very swiftly fixed up,
> and I don't think it has given anybody any trouble since then.
> 
> But do I want to see this work go in?  Yes and no.  The problem it
> fixes (that although we give out a THP to someone who faults a single
> page of it, after swapout the THP cannot be recovered until they have
> faulted in every page of it) is real and embarrassing; the code is good;
> and I don't mind the max_ptes_swap tunable that concerns Andrew above;
> but Kirill and Vlastimil made important points that still trouble me.
> 
> I can't locate Kirill's mail right now, perhaps I'm misremembering:
> but wasn't he concerned by your __collapse_huge_page_swapin() (likely
> to be allocating many small pages) being called under down_write of
> mmap_sem?  That's usually something we soon regret, and even down_read
> of mmap_sem across many memory allocations would be unfortunate
> (khugepaged used to allocate its THP that way, but we have
> Vlastimil to thank for stopping that in his 8b1645685acf).
> 
> And didn't Vlastimil (9/4/15) make some other unanswered
> observations about the call to __collapse_huge_page_swapin():
> 
> > Hmm it seems rather wasteful to call this when no swap entries were detected.
> > Also it seems pointless to try continue collapsing when we have just only issued
> > async swap-in? What are the chances they would finish in time?
> > 
> > I'm less sure about the relation vs khugepaged_alloc_page(). At this point, we
> > have already succeeded the hugepage allocation. It makes sense not to swap-in if
> > we can't allocate a hugepage. It makes also sense not to allocate a hugepage if
> > we will just issue async swap-ins and then free the hugepage back. Swap-in means
> > disk I/O that's best avoided if not useful. But the reclaim for hugepage
> > allocation might also involve disk I/O. At worst, it could be creating new swap
> > pte's in the very pmd we are scanning... Thoughts?
> 
I did not take enough responsibility, you're right. I should have
asked regarding the patch at least.
> Doesn't this imply that __collapse_huge_page_swapin() will initiate all
> the necessary swapins for a THP, then (given the FAULT_FLAG_ALLOW_RETRY)
> not wait for them to complete, so khugepaged will give up on that extent
> and move on to another; then after another full circuit of all the mms
> it needs to examine, it will arrive back at this extent and build a THP
> from the swapins it arranged last time.
> 
> Which may work well when a system transitions from busy+swappingout
> to idle+swappingin, but isn't that rather a special case?  It feels
> (meaning, I've not measured at all) as if the inbetween busyish case
> will waste a lot of I/O and memory on swapins that have to be discarded
> again before khugepaged has made its sedate way back to slotting them in.
> 
> So I wonder how useful this is in its present form.  The problem being,
> not with your code as such, but the whole nature of khugepaged.  When
> I had to solve a similar problem with recovering huge tmpfs pages (not
> yet posted), I did briefly consider whether to hook in to use khugepaged;
> but rejected that, and have never regretted using a workqueue item for
> the extent instead.  Did Vlastimil (argh, him again!) propose something
> similar to replace khugepaged?  Or should khugepaged fire off workqueue
> items for THP extents needing swapin?
> 
> Hugh