lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150306002102.GU30405@awork2.anarazel.de>
Date:	Fri, 6 Mar 2015 01:21:02 +0100
From:	Andres Freund <andres@...razel.de>
To:	Vlastimil Babka <vbabka@...e.cz>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Davidlohr Bueso <dave@...olabs.net>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, Hugh Dickins <hughd@...gle.com>,
	Andrea Arcangeli <aarcange@...hat.com>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Rik van Riel <riel@...hat.com>, Mel Gorman <mgorman@...e.de>,
	Michal Hocko <mhocko@...e.cz>,
	Ebru Akagunduz <ebru.akagunduz@...il.com>,
	Alex Thorlton <athorlton@....com>,
	David Rientjes <rientjes@...gle.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	Robert Haas <robertmhaas@...il.com>,
	Josh Berkus <josh@...iodbs.com>
Subject: Re: [RFC 0/6] the big khugepaged redesign

Long mail ahead, sorry for that.

TL;DR: THP is still noticeable, but not nearly as bad.

On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
> That however means the workload is based on hugetlbfs and shouldn't trigger THP
> page fault activity, which is the aim of this patchset. Some more googling made
> me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
> patchset should help, but I obviously won't be able to measure this before LSF/MM...

Just as a reference, this is how some the more extreme profiles looked
like in the past:

>     96.50%    postmaster  [kernel.kallsyms]         [k] _spin_lock_irq
>               |
>               --- _spin_lock_irq
>                  |
>                  |--99.87%-- compact_zone
>                  |          compact_zone_order
>                  |          try_to_compact_pages
>                  |          __alloc_pages_nodemask
>                  |          alloc_pages_vma
>                  |          do_huge_pmd_anonymous_page
>                  |          handle_mm_fault
>                  |          __do_page_fault
>                  |          do_page_fault
>                  |          page_fault
>                  |          0x631d98
>                   --0.13%-- [...]

That specific profile is from a rather old kernel as you probably
recognize.

> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
> psql performance with THPs enabled/disabled on recent kernels, where e.g.
> compaction is no longer synchronous for THP page faults?

So, I've managed to get a machine upgraded to 3.19. 4 x E5-4620, 256GB
RAM.

First of: It's noticeably harder to trigger problems than it used to
be. But, I can still trigger various problems that are much worse with
THP enabled than without.

There seem to be various different bottlenecks; I can get somewhat
different profiles.

In a somewhat artificial workload, that tries to simulate what I've seen
trigger the problem at a customer, I can quite easily trigger large
differences between THP=enable and THP=never.  There's two types of
tasks running, one purely OLTP, another doing somewhat more complex
statements that require a fair amount of process local memory.

(ignore the absolute numbers for progress, I just waited for somewhat
stable results while doing other stuff)

THP off:
Task 1 solo:
progress: 200.0 s, 391442.0 tps, 0.654 ms lat
progress: 201.0 s, 394816.1 tps, 0.683 ms lat
progress: 202.0 s, 409722.5 tps, 0.625 ms lat
progress: 203.0 s, 384794.9 tps, 0.665 ms lat

combined:
Task 1:
progress: 144.0 s, 25430.4 tps, 10.067 ms lat
progress: 145.0 s, 22260.3 tps, 11.500 ms lat
progress: 146.0 s, 24089.9 tps, 10.627 ms lat
progress: 147.0 s, 25888.8 tps, 9.888 ms lat

Task 2:
progress: 24.4 s, 30.0 tps, 2134.043 ms lat
progress: 26.5 s, 29.8 tps, 2150.487 ms lat
progress: 28.4 s, 29.7 tps, 2151.557 ms lat
progress: 30.4 s, 28.5 tps, 2245.304 ms lat

flat profile:
     6.07%      postgres  postgres            [.] heap_form_minimal_tuple
     4.36%      postgres  postgres            [.] heap_fill_tuple
     4.22%      postgres  postgres            [.] ExecStoreMinimalTuple
     4.11%      postgres  postgres            [.] AllocSetAlloc
     3.97%      postgres  postgres            [.] advance_aggregates
     3.94%      postgres  postgres            [.] advance_transition_function
     3.94%      postgres  postgres            [.] ExecMakeTableFunctionResult
     3.33%      postgres  postgres            [.] heap_compute_data_size
     3.30%      postgres  postgres            [.] MemoryContextReset
     3.28%      postgres  postgres            [.] ExecScan
     3.04%      postgres  postgres            [.] ExecProject
     2.96%      postgres  postgres            [.] generate_series_step_int4
     2.94%      postgres  [kernel.kallsyms]   [k] clear_page_c

(i.e. most of it postgres, cache miss bound)

THP on:
Task 1 solo:
progress: 140.0 s, 390458.1 tps, 0.656 ms lat
progress: 141.0 s, 391174.2 tps, 0.654 ms lat
progress: 142.0 s, 394828.8 tps, 0.648 ms lat
progress: 143.0 s, 398156.2 tps, 0.643 ms lat

Task 1:
progress: 179.0 s, 23963.1 tps, 10.683 ms lat
progress: 180.0 s, 22712.9 tps, 11.271 ms lat
progress: 181.0 s, 21211.4 tps, 12.069 ms lat
progress: 182.0 s, 23207.8 tps, 11.031 ms lat

Task 2:
progress: 28.2 s, 19.1 tps, 3349.747 ms lat
progress: 31.0 s, 19.8 tps, 3230.589 ms lat
progress: 34.3 s, 21.5 tps, 2979.113 ms lat
progress: 37.4 s, 20.9 tps, 3055.143 ms lat

flat profile:
    21.36%      postgres  [kernel.kallsyms]   [k] pageblock_pfn_to_page
     4.93%      postgres  postgres            [.] ExecStoreMinimalTuple
     4.02%      postgres  postgres            [.] heap_form_minimal_tuple
     3.55%      postgres  [kernel.kallsyms]   [k] clear_page_c
     2.85%      postgres  postgres            [.] heap_fill_tuple
     2.60%      postgres  postgres            [.] ExecMakeTableFunctionResult
     2.57%      postgres  postgres            [.] AllocSetAlloc
     2.44%      postgres  postgres            [.] advance_transition_function
     2.43%      postgres  postgres            [.] generate_series_step_int4

callgraph:
    18.23%      postgres  [kernel.kallsyms]   [k] pageblock_pfn_to_page
                |
                --- pageblock_pfn_to_page
                   |
                   |--99.05%-- isolate_migratepages
                   |          compact_zone
                   |          compact_zone_order
                   |          try_to_compact_pages
                   |          __alloc_pages_direct_compact
                   |          __alloc_pages_nodemask
                   |          alloc_pages_vma
                   |          do_huge_pmd_anonymous_page
                   |          __handle_mm_fault
                   |          handle_mm_fault
                   |          __do_page_fault
                   |          do_page_fault
                   |          page_fault
....
                   |
                    --0.95%-- compact_zone
                              compact_zone_order
                              try_to_compact_pages
                              __alloc_pages_direct_compact
                              __alloc_pages_nodemask
                              alloc_pages_vma
                              do_huge_pmd_anonymous_page
                              __handle_mm_fault
                              handle_mm_fault
                              __do_page_fault
     4.98%      postgres  postgres            [.] ExecStoreMinimalTuple
                |
     4.20%      postgres  postgres            [.] heap_form_minimal_tuple
                |
     3.69%      postgres  [kernel.kallsyms]   [k] clear_page_c
                |
                --- clear_page_c
                   |
                   |--58.89%-- __do_huge_pmd_anonymous_page
                   |          do_huge_pmd_anonymous_page
                   |          __handle_mm_fault
                   |          handle_mm_fault
                   |          __do_page_fault
                   |          do_page_fault
                   |          page_fault

As you can see THP on/off makes a noticeable difference, especially for
Task 2. Compaction suddenly takes a significant amount of time. But:
It's a relatively gradual slowdown, at pretty extreme concurrency. So
I'm pretty happy already.


In the workload tested here most non-shared allocations are short
lived. So it's not surprising that it's not worth compacting pages. I do
wonder whether it'd be possible to keep some running statistics about
THP being worthwhile or not.


This is just one workload, and I saw some different profiles while
playing around. But I've already invested more time in this today than I
should have... :)


BTW, parallel process exits with large shared mappings isn't
particularly fun:

    80.09%      postgres  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
                |
                --- _raw_spin_lock_irqsave
                   |
                   |--99.97%-- pagevec_lru_move_fn
                   |          |
                   |          |--65.51%-- activate_page
                   |          |          mark_page_accessed.part.23
                   |          |          mark_page_accessed
                   |          |          zap_pte_range
                   |          |          unmap_page_range
                   |          |          unmap_single_vma
                   |          |          unmap_vmas
                   |          |          exit_mmap
                   |          |          mmput.part.27
                   |          |          mmput
                   |          |          exit_mm
                   |          |          do_exit
                   |          |          do_group_exit
                   |          |          sys_exit_group
                   |          |          system_call_fastpath
                   |          |
                   |           --34.49%-- lru_add_drain_cpu
                   |                     lru_add_drain
                   |                     free_pages_and_swap_cache
                   |                     tlb_flush_mmu_free
                   |                     zap_pte_range
                   |                     unmap_page_range
                   |                     unmap_single_vma
                   |                     unmap_vmas
                   |                     exit_mmap
                   |                     mmput.part.27
                   |                     mmput
                   |                     exit_mm
                   |                     do_exit
                   |                     do_group_exit
                   |                     sys_exit_group
                   |                     system_call_fastpath
                    --0.03%-- [...]

     9.75%      postgres  [kernel.kallsyms]  [k] zap_pte_range
                |
                --- zap_pte_range
                    unmap_page_range
                    unmap_single_vma
                    unmap_vmas
                    exit_mmap
                    mmput.part.27
                    mmput
                    exit_mm
                    do_exit
                    do_group_exit
                    sys_exit_group
                    system_call_fastpath

     1.93%      postgres  [kernel.kallsyms]  [k] release_pages
                |
                --- release_pages
                   |
                   |--77.09%-- free_pages_and_swap_cache
                   |          tlb_flush_mmu_free
                   |          zap_pte_range
                   |          unmap_page_range
                   |          unmap_single_vma
                   |          unmap_vmas
                   |          exit_mmap
                   |          mmput.part.27
                   |          mmput
                   |          exit_mm
                   |          do_exit
                   |          do_group_exit
                   |          sys_exit_group
                   |          system_call_fastpath
                   |
                   |--22.64%-- pagevec_lru_move_fn
                   |          |
                   |          |--63.88%-- activate_page
                   |          |          mark_page_accessed.part.23
                   |          |          mark_page_accessed
                   |          |          zap_pte_range
                   |          |          unmap_page_range
                   |          |          unmap_single_vma
                   |          |          unmap_vmas
                   |          |          exit_mmap
                   |          |          mmput.part.27
                   |          |          mmput
                   |          |          exit_mm
                   |          |          do_exit
                   |          |          do_group_exit
                   |          |          sys_exit_group
                   |          |          system_call_fastpath
                   |          |
                   |           --36.12%-- lru_add_drain_cpu
                   |                     lru_add_drain
                   |                     free_pages_and_swap_cache
                   |                     tlb_flush_mmu_free
                   |                     zap_pte_range
                   |                     unmap_page_range
                   |                     unmap_single_vma
                   |                     unmap_vmas
                   |                     exit_mmap
                   |                     mmput.part.27
                   |                     mmput
                   |                     exit_mm
                   |                     do_exit
                   |                     do_group_exit
                   |                     sys_exit_group
                   |                     system_call_fastpath
                    --0.27%-- [...]

     1.91%      postgres  [kernel.kallsyms]  [k] page_remove_file_rmap
                |
                --- page_remove_file_rmap
                   |
                   |--98.18%-- page_remove_rmap
                   |          zap_pte_range
                   |          unmap_page_range
                   |          unmap_single_vma
                   |          unmap_vmas
                   |          exit_mmap
                   |          mmput.part.27
                   |          mmput
                   |          exit_mm
                   |          do_exit
                   |          do_group_exit
                   |          sys_exit_group
                   |          system_call_fastpath
                   |
                    --1.82%-- zap_pte_range
                              unmap_page_range
                              unmap_single_vma
                              unmap_vmas
                              exit_mmap
                              mmput.part.27
                              mmput
                              exit_mm
                              do_exit
                              do_group_exit
                              sys_exit_group
                              system_call_fastpath



Greetings,

Andres Freund

--
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ