[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53A7F8DC.8090106@suse.cz>
Date: Mon, 23 Jun 2014 11:52:28 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Zhang Yanfei <zhangyanfei@...fujitsu.com>,
"Kirill A. Shutemov" <kirill@...temov.name>
CC: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>,
David Rientjes <rientjes@...gle.com>,
Minchan Kim <minchan@...nel.org>, Mel Gorman <mgorman@...e.de>,
Joonsoo Kim <iamjoonsoo.kim@....com>,
Michal Nazarewicz <mina86@...a86.com>,
Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
Christoph Lameter <cl@...ux.com>,
Rik van Riel <riel@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when
allocating THP
On 06/23/2014 07:39 AM, Zhang Yanfei wrote:
> Hello
>
> On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
>> On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
>>> When allocating huge page for collapsing, khugepaged currently holds mmap_sem
>>> for reading on the mm where collapsing occurs. Afterwards the read lock is
>>> dropped before write lock is taken on the same mmap_sem.
>>>
>>> Holding mmap_sem during whole huge page allocation is therefore useless, the
>>> vma needs to be rechecked after taking the write lock anyway. Furthemore, huge
>>> page allocation might involve a rather long sync compaction, and thus block
>>> any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
>>> or mprotect oterations.
>>>
>>> This patch simply releases the read lock before allocating a huge page. It
>>> also deletes an outdated comment that assumed vma must be stable, as it was
>>> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
>>> ("mm: thp: khugepaged: add policy for finding target node").
>>
>> There is no point in touching ->mmap_sem in khugepaged_alloc_page() at
>> all. Please, move up_read() outside khugepaged_alloc_page().
>>
Well there's also currently no point in passing several parameters to
khugepaged_alloc_page(). So I could clean it up as well, but I imagine
later we would perhaps reintroduce them back, as I don't think the
current situation is ideal for at least two reasons.
1. If you read commit 9f1b868a13 ("mm: thp: khugepaged: add policy for
finding target node"), it's based on a report where somebody found that
mempolicy is not observed properly when collapsing THP's. But the
'policy' introduced by the commit isn't based on real mempolicy, it
might just under certain conditions results in an interleave, which
happens to be what the reporter was trying.
So ideally, it should be making node allocation decisions based on where
the original 4KB pages are located. For example, allocate a THP only if
all the 4KB pages are on the same node. That would also automatically
obey any policy that has lead to the allocation of those 4KB pages.
And for this, it will need again the parameters and mmap_sem in read
mode. It would be however still a good idea to drop mmap_sem before the
allocation itself, since compaction/reclaim might take some time...
2. (less related) I'd expect khugepaged to first allocate a hugepage and
then scan for collapsing. Yes there's khugepaged_prealloc_page, but that
only does something on !NUMA systems and these are not the future.
Although I don't have the data, I expect allocating a hugepage is a
bigger issue than finding something that could be collapsed. So why scan
for collapsing if in the end I cannot allocate a hugepage? And if I
really cannot find something to collapse, would e.g. caching a single
hugepage per node be a big hit? Also, if there's really nothing to
collapse, then it means khugepaged won't compact. And since khugepaged
is becoming the only source of sync compaction that doesn't give up
easily and tries to e.g. migrate movable pages out of unmovable
pageblocks, this might have bad effects on fragmentation.
I believe this could be done smarter.
> I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round again
> do the for loop to get the next vma and handle it. Does we do this without holding
> the mmap_sem in any mode?
>
> And if the loop end, we have another up_read in breakouterloop. What if we have
> released the mmap_sem in collapse_huge_page()?
collapse_huge_page() is only called from khugepaged_scan_pmd() in the if
(ret) condition. And khugepaged_scan_mm_slot() has similar if (ret) for
the return value of khugepaged_scan_pmd() to break out of the loop (and
not doing up_read() again). So I think this is correct and moving
up_read from khugepaged_alloc_page() to collapse_huge_page() wouldn't
change this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists