linux-kernel - [PATCH 0/10] Replace _PAGE_NUMA with PAGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1416478790-27522-1-git-send-email-mgorman@suse.de>
Date:	Thu, 20 Nov 2014 10:19:40 +0000
From:	Mel Gorman <mgorman@...e.de>
To:	Linux Kernel <linux-kernel@...r.kernel.org>
Cc:	Linux-MM <linux-mm@...ck.org>,
	LinuxPPC-dev <linuxppc-dev@...ts.ozlabs.org>,
	Aneesh Kumar <aneesh.kumar@...ux.vnet.ibm.com>,
	Hugh Dickins <hughd@...gle.com>, Dave Jones <davej@...hat.com>,
	Rik van Riel <riel@...hat.com>,
	Ingo Molnar <mingo@...hat.com>,
	Kirill Shutemov <kirill.shutemov@...ux.intel.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	Benjamin Herrenschmidt <benh@...nel.crashing.org>,
	Paul Mackerras <paulus@...ba.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Mel Gorman <mgorman@...e.de>
Subject: [PATCH 0/10] Replace _PAGE_NUMA with PAGE_NONE protections v2

V1 failed while running under kvm-tools very quickly and a second report
indicated that it happens on bare metal as well. This version survived
an overnight run of trinity running under kvm-tools here but verification
from Sasha would be appreciated.

Changelog since V1
o ppc64 paranoia checks and clarifications			(aneesh)
o Fix trinity regression (hopefully)
o Reduce unnecessary TLB flushes				(mel)

Automatic NUMA balancing depends on being able to protect PTEs to trap a
fault and gather reference locality information. Very broadly speaking it
would mark PTEs as not present and use another bit to distinguish between
NUMA hinting faults and other types of faults. It was universally loved
by everybody and caused no problems whatsoever. That last sentence might
be a lie.

This series is very heavily based on patches from Linus and Aneesh to
replace the existing PTE/PMD NUMA helper functions with normal change
protections. I did alter and add parts of it but I consider them relatively
minor contributions. At their suggestion, acked-bys are in there but I've
no problem converting them to Signed-off-by if requested.

AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh for
that. I tested trinity under kvm-tool and passed and ran a few other basic
tests. In most cases I'm leaving out detail as it's not that interesting.

specjbb single JVM: There was negligible performance difference in the
	benchmark itself for short and long runs. However, system activity
	is higher and interrupts are much higher over time -- possibly
	TLB flushes. Migrations are also higher. Overall, this is more
	overhead but considering the problems faced with the old approach
	I think we just have to suck it up and find another way of reducing
	the overhead.

specjbb multi JVM: Negligible performance difference to the actual benchmarm
	but like the single JVM case, the system overhead is noticably
	higher.  Again, interrupts are a major factor.

autonumabench: This was all over the place and about all that can be
	reasonably concluded is that it's different but not necessarily
	better or worse.

autonumabench
                                     3.18.0-rc4            3.18.0-rc4
                                        vanilla         protnone-v2r5
User    NUMA01               32806.01 (  0.00%)    20250.67 ( 38.27%)
User    NUMA01_THEADLOCAL    23910.28 (  0.00%)    22734.37 (  4.92%)
User    NUMA02                3176.85 (  0.00%)     3082.68 (  2.96%)
User    NUMA02_SMT            1600.06 (  0.00%)     1547.08 (  3.31%)
System  NUMA01                 719.07 (  0.00%)     1344.39 (-86.96%)
System  NUMA01_THEADLOCAL      916.26 (  0.00%)      180.90 ( 80.26%)
System  NUMA02                  20.92 (  0.00%)       17.34 ( 17.11%)
System  NUMA02_SMT               8.76 (  0.00%)        7.24 ( 17.35%)
Elapsed NUMA01                 728.27 (  0.00%)      519.28 ( 28.70%)
Elapsed NUMA01_THEADLOCAL      589.15 (  0.00%)      554.73 (  5.84%)
Elapsed NUMA02                  81.20 (  0.00%)       81.72 ( -0.64%)
Elapsed NUMA02_SMT              80.49 (  0.00%)       79.58 (  1.13%)
CPU     NUMA01                4603.00 (  0.00%)     4158.00 (  9.67%)
CPU     NUMA01_THEADLOCAL     4213.00 (  0.00%)     4130.00 (  1.97%)
CPU     NUMA02                3937.00 (  0.00%)     3793.00 (  3.66%)
CPU     NUMA02_SMT            1998.00 (  0.00%)     1952.00 (  2.30%)


System CPU usage of NUMA01 is worse but it's an adverse workload on this
machine so I'm reluctant to conclude that it's a problem that matters. On
the other workloads that are sensible on this machine, system CPU usage
is great.  Overall time to complete the benchmark is comparable

          3.18.0-rc4  3.18.0-rc4
             vanillaprotnone-v2r5
User        61493.38    47615.01
System       1665.17     1550.07
Elapsed      1480.79     1236.74

NUMA alloc hit                 4739774     5328362
NUMA alloc miss                      0           0
NUMA interleave hit                  0           0
NUMA alloc local               4664980     5328351
NUMA base PTE updates        556489407   444119981
NUMA huge PMD updates          1086000      866680
NUMA page range updates     1112521407   887860141
NUMA hint faults               1538964     1242142
NUMA hint local faults          835871      814313
NUMA hint local percent             54          65
NUMA pages migrated            7329212    59883854

The NUMA pages migrated look terrible but when I looked at a graph of the
activity over time I see that the massive spike in migration activity was
during NUMA01. This correlates with high system CPU usage and could be simply
down to bad luck but any modifications that affect that workload would be
related to scan rates and migrations, not the protection mechanism. For
all other workloads, migration activity was comparable.

Overall, headline performance figures are comparable but the overhead
is higher, mostly in interrupts. To some extent, higher overhead from
this approach was anticipated but not to this degree. It's going to be
necessary to reduce this again with a separate series in the future. It's
still worth going ahead with this series though as it's likely to avoid
constant headaches with Xen and is probably easier to maintain.

 arch/powerpc/include/asm/pgtable.h    |  53 ++----------
 arch/powerpc/include/asm/pte-common.h |   5 --
 arch/powerpc/include/asm/pte-hash64.h |   6 --
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   2 +-
 arch/powerpc/mm/copro_fault.c         |   8 +-
 arch/powerpc/mm/fault.c               |  25 ++----
 arch/powerpc/mm/gup.c                 |   4 +-
 arch/powerpc/mm/pgtable.c             |   8 +-
 arch/powerpc/mm/pgtable_64.c          |   3 +-
 arch/x86/include/asm/pgtable.h        |  46 +++++-----
 arch/x86/include/asm/pgtable_64.h     |   5 --
 arch/x86/include/asm/pgtable_types.h  |  41 +--------
 arch/x86/mm/gup.c                     |   4 +-
 include/asm-generic/pgtable.h         | 152 ++--------------------------------
 include/linux/migrate.h               |   4 -
 include/linux/swapops.h               |   2 +-
 include/uapi/linux/mempolicy.h        |   2 +-
 mm/gup.c                              |   8 +-
 mm/huge_memory.c                      |  50 ++++++-----
 mm/memory.c                           |  18 ++--
 mm/mempolicy.c                        |   2 +-
 mm/migrate.c                          |   8 +-
 mm/mprotect.c                         |  48 +++++------
 mm/pgtable-generic.c                  |   2 -
 24 files changed, 131 insertions(+), 375 deletions(-)

-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/