[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <BF7AF910-28E1-42F2-A5FE-09B9D337C4A4@oracle.com>
Date: Mon, 22 Jan 2018 16:41:11 -0800
From: Henry Willard <henry.willard@...cle.com>
To: Christopher Lameter <cl@...ux.com>
Cc: Mel Gorman <mgorman@...e.de>, akpm@...ux-foundation.org,
kstewart@...uxfoundation.org, zi.yan@...rutgers.edu,
pombredanne@...b.com, aarcange@...hat.com,
gregkh@...uxfoundation.org, aneesh.kumar@...ux.vnet.ibm.com,
kirill.shutemov@...ux.intel.com, jglisse@...hat.com,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: numa: Do not trap faults on shared data section
pages.
> On Jan 19, 2018, at 6:12 PM, Christopher Lameter <cl@...ux.com> wrote:
>
> On Thu, 18 Jan 2018, Henry Willard wrote:
>
>> If MPOL_MF_LAZY were allowed and specified things would not work
>> correctly. change_pte_range() is unaware of and can’t honor the
>> difference between MPOL_MF_MOVE_ALL and MPOL_MF_MOVE.
>
> Not sure how that relates to what I said earlier... Sorry.
Only that CAP_SYS_NICE is not relevant to this patch.
>
>>
>> For the case of auto numa balancing, it may be undesirable for shared
>> pages to be migrated whether they are also copy-on-write or not. The
>> copy-on-write test was added to restrict the effect of the patch to the
>> specific situation we observed. Perhaps I should remove it, I don’t
>> understand why it would be desirable to modify the behavior via sysfs.
>
> I think the most common case of shared pages occurs for pages that contain
> code. In that case a page may be mapped into hundreds if not thousands of
> processes. In particular that is often the case for basic system libraries
> like the c library which may actually be mapped into every binary that is
> running.
That is true, but auto numa balancing skips these and similar pages before it calls change_prot_numa(). They don’t even have to be actually shared to be skipped.
>
> It is very difficult and expensive to unmap these pages from all the
> processes in order to migrate them. So some sort of limit would be useful
> to avoid unnecessary migration attempts. One example would be to forbid
> migrating pages that are mapped in more than 5 processes. Some sysctl know
> would be useful here to set the boundary.
>
> Your patch addresses a special case here by forbidding migration of any
> page mapped by more than a single process (mapcount !=1).
The current patch skips pages that are in copy-on-write VMAs and still shared. These include pages in the program’s data segment that are writable. but have not been written to. Once the pages are modified they are no longer shared and can be migrated. The problem is that in some cases, the pages are never modified and remain shared.
Prior to commit 4b10e7d562c90d0a72f324832c26653947a07381, change_prot_numa() called change_prot_numa_range(), which tested for (page_mapcount(page) != 1) and bailed out for any shared pages. This patch is more selective. A simple test for shared or not seems to be common.
>
> That would mean f.e. that the complete migration of a set of processes
> that rely on sharing data via a memory segment is impossible because those
> shared pages can never be moved.
>
> By setting the limit higher that migration would still be possible.
>
> Maybe we can set that limit by default at 5 and allow a higher setting
> if users have applications that require a higher mapcoun? F.e. a
> common construct is a shepherd task and N worker threads. If those
> tasks each have their own address space and only communicate via
> a shared data segment then one may want to set the limit higher than N
> in order to allow the migration of the group of processes.
This example would be unaffected by this patch, because the patch does not affect explicitly shared memory. A process with the necessary capabilities is still able to migrate all pages.
Henry
Powered by blists - more mailing lists