linux-kernel - Re: [RFC PATCH 00/19] Foundation for automatic NUMA balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <509A2970.9000408@redhat.com>
Date:	Wed, 07 Nov 2012 17:27:12 +0800
From:	Zhouping Liu <zliu@...hat.com>
To:	Mel Gorman <mgorman@...e.de>
CC:	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Ingo Molnar <mingo@...nel.org>, Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Hugh Dickins <hughd@...gle.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>,
	CAI Qian <caiqian@...hat.com>
Subject: Re: [RFC PATCH 00/19] Foundation for automatic NUMA balancing

On 11/06/2012 05:14 PM, Mel Gorman wrote:
> There are currently two competing approaches to implement support for
> automatically migrating pages to optimise NUMA locality.  Performance results
> are available for both but review highlighted different problems in both.
> They are not compatible with each other even though some fundamental
> mechanics should have been the same.
>
> For example, schednuma implements many of its optimisations before the code
> that benefits most from these optimisations are introduced obscuring what the
> cost of schednuma might be and if the optimisations can be used elsewhere
> independant of the series. It also effectively hard-codes PROT_NONE to be
> the hinting fault even though it should be an achitecture-specific decision.
> On the other hand, it is well integrated and implements all its work in the
> context of the process that benefits from the migration.
>
> autonuma goes straight to kernel threads for marking PTEs pte_numa to
> capture the necessary statistics it depends on. This obscures the cost of
> autonuma in a manner that is difficult to measure and hard to retro-fit
> to put in the context of the process. Some of these costs are in paths the
> scheduler folk traditionally are very wary of making heavier, particularly
> if that cost is difficult to measure.  On the other hand, performance
> tests indicate it is the best perfoming solution.
>
> As the patch sets do not share any code, it is difficult to incrementally
> develop one to take advantage of the strengths of the other. Many of the
> patches would be code churn that is annoying to review and fairly measuring
> the results would be problematic.
>
> This series addresses part of the integration and sharing problem by
> implementing a foundation that either the policy for schednuma or autonuma
> can be rebased on. The actual policy it implements is a very stupid
> greedy policy called "Migrate On Reference Of pte_numa Node (MORON)".
> While stupid, it can be faster than the vanilla kernel and the expectation
> is that any clever policy should be able to beat MORON. The advantage is
> that it still defines how the policy needs to hook into the core code --
> scheduler and mempolicy mostly so many optimisations (such as native THP
> migration) can be shared between different policy implementations.
>
> This series steals very heavily from both autonuma and schednuma with very
> little original code. In some cases I removed the signed-off-bys because
> the result was too different. I have noted in the changelog where this
> happened but the signed-offs can be restored if the original authors agree.
>
> Patches 1-3 move some vmstat counters so that migrated pages get accounted
> 	for. In the past the primary user of migration was compaction but
> 	if pages are to migrate for NUMA optimisation then the counters
> 	need to be generally useful.
>
> Patch 4 defines an arch-specific PTE bit called _PAGE_NUMA that is used
> 	to trigger faults later in the series. A placement policy is expected
> 	to use these faults to determine if a page should migrate.  On x86,
> 	the bit is the same as _PAGE_PROTNONE but other architectures
> 	may differ.
>
> Patch 5-7 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and
> 	friends. It implements them for x86, handles GUP and preserves
> 	the _PAGE_NUMA bit across THP splits.
>
> Patch 8 creates the fault handler for p[te|md]_numa PTEs and just clears
> 	them again.
>
> Patches 9-11 add a migrate-on-fault mode that applications can specifically
> 	ask for. Applications can take advantage of this if they wish. It
> 	also meanst that if automatic balancing was broken for some workload
> 	that the application could disable the automatic stuff but still
> 	get some advantage.
>
> Patch 12 adds migrate_misplaced_page which is responsible for migrating
> 	a page to a new location.
>
> Patch 13 migrates the page on fault if mpol_misplaced() says to do so.
>
> Patch 14 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
> 	On the next reference the memory should be migrated to the node that
> 	references the memory.
>
> Patch 15 sets pte_numa within the context of the scheduler.
>
> Patch 16 adds some vmstats that can be used to approximate the cost of the
> 	scheduling policy in a more fine-grained fashion than looking at
> 	the system CPU usage.
>
> Patch 17 implements the MORON policy.
>
> Patches 18-19 note that the marking of pte_numa has a number of disadvantages and
> 	instead incrementally updates a limited range of the address space
> 	each tick.
>
> The obvious next step is to rebase a proper placement policy on top of this
> foundation and compare it to MORON (or any other placement policy). It
> should be possible to share optimisations between different policies to
> allow meaningful comparisons.
>
> For now, I am going to compare this patchset with the most recent posting
> of schednuma and autonuma just to get a feeling for where it stands. I
> only ran the autonuma benchmark and specjbb tests.
>
> The baseline kernel has stat patches 1-3 applied.

Hello Mel,

my 2 nodes machine hit a panic fault after applied the patch set(based 
on kernel-3.7.0-rc4), please review it:

.....
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.7.0-rc4+ 
root=UUID=a557cd78-962e-48a2-b606-c77b3d8d22dd console=ttyS0,115200 
console=tty0 ro rd.md=0 rd.lvm=0 rd.dm=0 rd.luks=0 init 3 debug 
earlyprintk=ttyS0,115200 LANG=en_US.UTF-8
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.000000] __ex_table already sorted, skipping sort
[ 0.000000] Checking aperture...
[ 0.000000] No AGP bridge found
[ 0.000000] Memory: 8102020k/10485760k available (6112k kernel code, 
2108912k absent, 274828k reserved, 3823k data, 1176k init)
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] kernel BUG at mm/mempolicy.c:1785!
[ 0.000000] invalid opcode: 0000 [#1] SMP
[ 0.000000] Modules linked in:
[ 0.000000] CPU 0
[ 0.000000] Pid: 0, comm: swapper Not tainted 3.7.0-rc4+ #9 IBM IBM 
System x3400 M3 Server -[7379I08]-/69Y4356
[ 0.000000] RIP: 0010:[<ffffffff81175b0e>] [<ffffffff81175b0e>] 
policy_zonelist+0x1e/0xa0
[ 0.000000] RSP: 0000:ffffffff818afe68  EFLAGS: 00010093
[ 0.000000] RAX: 0000000000000000 RBX: ffffffff81cbfe00 RCX: 
000000000000049d
[ 0.000000] RDX: 0000000000000000 RSI: ffffffff81cbfe00 RDI: 
0000000000008000
[ 0.000000] RBP: ffffffff818afe78 R08: 203a79726f6d654d R09: 
0000000000000179
[ 0.000000] R10: 303138203a79726f R11: 30312f6b30323032 R12: 
0000000000008000
[ 0.000000] R13: 0000000000000000 R14: ffffffff818c1420 R15: 
ffffffff818c1420
[ 0.000000] FS:  0000000000000000(0000) GS:ffff88017bc00000(0000) 
knlGS:0000000000000000
[ 0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[    0.000000] CR2: 0000000000000000 CR3: 00000000018b9000 CR4: 
00000000000006b0
[    0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[    0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[    0.000000] Process swapper (pid: 0, threadinfo ffffffff818ae000, 
task ffffffff818c1420)
[    0.000000] Stack:
[    0.000000]  ffff88027ffbe8c0 ffffffff81cbfe00 ffffffff818afec8 
ffffffff81176966
[    0.000000]  0000000000000000 0000000000000030 ffffffff818afef8 
0000000000000100
[    0.000000]  ffffffff81a12000 0000000000000000 ffff88027ffbe8c0 
000000007b5d69a0
[    0.000000] Call Trace:
[    0.000000] [<ffffffff81176966>] alloc_pages_current+0xa6/0x170
[    0.000000] [<ffffffff81137a44>] __get_free_pages+0x14/0x50
[    0.000000] [<ffffffff819efd9b>] kmem_cache_init+0x53/0x2d2
[    0.000000] [<ffffffff819caa53>] start_kernel+0x1e0/0x3c7
[    0.000000] [<ffffffff819ca672>] ? repair_env_string+0x5e/0x5e
[    0.000000] [<ffffffff819ca356>] x86_64_start_reservations+0x131/0x135
[    0.000000] [<ffffffff819ca45a>] x86_64_start_kernel+0x100/0x10f
[    0.000000] Code: e4 17 00 48 89 e5 5d c3 0f 1f 44 00 00 e8 cb e2 47 
00 55 48 89 e5 53 48 83 ec 08 0f b7 46 04 66 83 f8 01 74 08 66 83 f8 02 
74 42 <0f> 0b 89 fb 81 e3 00 00 04 00 f6 46 06 02 75 04 0f bf 56 08 31
[    0.000000] RIP [<ffffffff81175b0e>] policy_zonelist+0x1e/0xa0
[    0.000000]  RSP <ffffffff818afe68>
[    0.000000] ---[ end trace ce62cfec816bb3fe ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
......

the config file is attached
and no such issue found in mainline, please let me know if you need 
further info.

Thanks,
Zhouping

View attachment "config_mel" of type "text/plain" (111289 bytes)