[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <571565F0.9070203@linaro.org>
Date: Mon, 18 Apr 2016 15:55:44 -0700
From: "Shi, Yang" <yang.shi@...aro.org>
To: "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Hugh Dickins <hughd@...gle.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: Dave Hansen <dave.hansen@...el.com>,
Vlastimil Babka <vbabka@...e.cz>,
Christoph Lameter <cl@...two.org>,
Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
Jerome Marchand <jmarchan@...hat.com>,
Sasha Levin <sasha.levin@...cle.com>,
Andres Lagar-Cavilla <andreslc@...gle.com>,
Ning Qu <quning@...il.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages
Hi Kirill,
Finally, I got some time to look into and try yours and Hugh's patches,
got two problems.
1. A quick boot up test on my ARM64 machine with your v7 tree shows some
unexpected error:
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:16863: No space left on device
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:16865: No space left on device
Starting DNS forwarder and DHCP server.systemd-journald[285]:
Failed to save stream data /run/systemd/journal/streams/8:16867: No
space left on device
..
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:16869: No space left on device
Starting Postfix Mail Transport Agent...
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:16871: No space left on device
Starting Berkeley Internet Name Domain (DNS)...
Starting Wait for Network to be Configured...
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:2422: No space left on device
[ OK ] Started /etc/rc.local Compatibility.
[FAILED] Failed to start DNS forwarder and DHCP server.
See 'systemctl status dnsmasq.service' for details.
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:2425: No space left on device
[ OK ] Started Serial Getty on ttyS1.
[ OK ] Started Serial Getty on ttyS0.
[ OK ] Started Getty on tty1.
systemd-journald[285]: Failed to save stream data
/run/systemd/journal/streams/8:2433: No space left on device
[FAILED] Failed to start Berkeley Internet Name Domain (DNS).
See 'systemctl status named.service' for details.
The /run dir is mounted as tmpfs.
x86 boot doesn't get such error. And, Hugh's patches don't have such
problem.
2. I ran my THP test (generated a program with 4MB text section) on both
x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got
the program execution time reduced by ~12% on x86-64, it looks very
impressive.
But, on ARM64, there is just ~3% change, and sometimes huge tmpfs may
show even worse data than non-hugepage.
Both yours and Hugh's patches has the same behavior.
Any idea?
Thanks,
Yang
On 4/15/2016 5:23 PM, Kirill A. Shutemov wrote:
> This is probably the last update before the mm summit. Main forcus is on
> khugepaged stability.
>
> khugepaged is in more reasonable shape now. I missed quite a few corner
> cases on first try. I run this version via LTP, trinity and syzkaller
> without crashes so far.
>
> The patchset is on top of v4.6-rc3 plus Hugh's "easy preliminaries to
> THPagecache" and Ebru's khugepaged swapin patches form -mm tree.
>
> Git tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v7
>
> == Changelog ==
>
> v7:
> - khugepaged updates:
> + fix page leak/page cache corruption on collapse fail;
> + filter out VMAs not suitable for huge pages due misaligned vm_pgoff;
> + fix build without CONFIG_SHMEM;
> + drop few over-protective checks;
> - fix bogus VM_BUG_ON() in __delete_from_page_cache();
>
> v6:
> - experimental collapse support;
> - fix swapout mapped huge pages;
> - fix page leak in faularound code;
> - fix exessive huge page allocation with huge=within_size;
> - rename VM_NO_THP to VM_NO_KHUGEPAGED;
> - fix condition in hugepage_madvise();
> - accounting reworked again;
>
> v5:
> - add FileHugeMapped to /proc/PID/smaps;
> - make FileHugeMapped in meminfo aligned with other fields;
> - Documentation/vm/transhuge.txt updated;
>
> v4:
> - first four patch were applied to -mm tree;
> - drop pages beyond i_size on split_huge_pages;
> - few small random bugfixes;
>
> v3:
> - huge= mountoption now can have values always, within_size, advice and
> never;
> - sysctl handle is replaced with sysfs knob;
> - MADV_HUGEPAGE/MADV_NOHUGEPAGE is now respected on page allocation via
> page fault;
> - mlock() handling had been fixed;
> - bunch of smaller bugfixes and cleanups.
>
> == Design overview ==
>
> Huge pages are allocated by shmem when it's allowed (by mount option) and
> there's no entries for the range in radix-tree. Huge page is represented by
> HPAGE_PMD_NR entries in radix-tree.
>
> MM core maps a page with PMD if ->fault() returns huge page and the VMA is
> suitable for huge pages (size, alignment). There's no need into two
> requests to file system: filesystem returns huge page if it can,
> graceful fallback to small pages otherwise.
>
> As with DAX, split_huge_pmd() is implemented by unmapping the PMD: we can
> re-fault the page with PTEs later.
>
> Basic scheme for split_huge_page() is the same as for anon-THP.
> Few differences:
>
> - File pages are on radix-tree, so we have head->_count offset by
> HPAGE_PMD_NR. The count got distributed to small pages during split.
>
> - mapping->tree_lock prevents non-lockless access to pages under split
> over radix-tree;
>
> - Lockless access is prevented by setting the head->_count to 0 during
> split, so get_page_unless_zero() would fail;
>
> - After split, some pages can be beyond i_size. We drop them from
> radix-tree.
>
> - We don't setup migration entries. Just unmap pages. It helps
> handling cases when i_size is in the middle of the page: no need
> handle unmap pages beyond i_size manually.
>
> COW mapping handled on PTE-level. It's not clear how beneficial would be
> allocation of huge pages on COW faults. And it would require some code to
> make them work.
>
> I think at some point we can consider teaching khugepaged to collapse
> pages in COW mappings, but allocating huge on fault is probably overkill.
>
> As with anon THP, we mlock file huge page only if it mapped with PMD.
> PTE-mapped THPs are never mlocked. This way we can avoid all sorts of
> scenarios when we can leak mlocked page.
>
> As with anon THP, we split huge page on swap out.
>
> Truncate and punch hole that only cover part of THP range is implemented
> by zero out this part of THP.
>
> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> inconsistent results depending what pages happened to be allocated.
> I don't think this will be a problem.
>
> == Patchset overview ==
>
> [01/29]
> Update documentation on THP vs. mlock. I've posted it separately
> before. It can go in.
>
> [02-04/29]
> Rework fault path and rmap to handle file pmd. Unlike DAX with
> vm_ops->pmd_fault, we don't need to ask filesystem twice -- first
> for huge page and then for small. If ->fault happened to return
> huge page and VMA is suitable for mapping it as huge, we would
> do so.
> [05/29]
> Add support for huge file pages in rmap;
>
> [06-15/29]
> Various preparation of THP core for file pages.
>
> [16-20/29]
> Various preparation of MM core for file pages.
>
> [21-24/29]
> And finally, bring huge pages into tmpfs/shmem.
>
> [25/29]
> Wire up madvise() existing hints for file THP.
> We can implement fadvise() later.
>
> [26/29]
> Documentation update.
>
> [27-29/29]
> Extend khugepaged to support shmem/tmpfs.
> Hugh Dickins (1):
> shmem: get_unmapped_area align huge page
>
> Kirill A. Shutemov (28):
> thp, mlock: update unevictable-lru.txt
> mm: do not pass mm_struct into handle_mm_fault
> mm: introduce fault_env
> mm: postpone page table allocation until we have page to map
> rmap: support file thp
> mm: introduce do_set_pmd()
> thp, vmstats: add counters for huge file pages
> thp: support file pages in zap_huge_pmd()
> thp: handle file pages in split_huge_pmd()
> thp: handle file COW faults
> thp: skip file huge pmd on copy_huge_pmd()
> thp: prepare change_huge_pmd() for file thp
> thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
> thp: file pages support for split_huge_page()
> thp, mlock: do not mlock PTE-mapped file huge pages
> vmscan: split file huge pages before paging them out
> page-flags: relax policy for PG_mappedtodisk and PG_reclaim
> radix-tree: implement radix_tree_maybe_preload_order()
> filemap: prepare find and delete operations for huge pages
> truncate: handle file thp
> mm, rmap: account shmem thp pages
> shmem: prepare huge= mount option and sysfs knob
> shmem: add huge pages support
> shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
> thp: update Documentation/vm/transhuge.txt
> thp: extract khugepaged from mm/huge_memory.c
> khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
> khugepaged: add support of collapse for tmpfs/shmem pages
>
> Documentation/filesystems/Locking | 10 +-
> Documentation/vm/transhuge.txt | 130 ++-
> Documentation/vm/unevictable-lru.txt | 21 +
> arch/alpha/mm/fault.c | 2 +-
> arch/arc/mm/fault.c | 2 +-
> arch/arm/mm/fault.c | 2 +-
> arch/arm64/mm/fault.c | 2 +-
> arch/avr32/mm/fault.c | 2 +-
> arch/cris/mm/fault.c | 2 +-
> arch/frv/mm/fault.c | 2 +-
> arch/hexagon/mm/vm_fault.c | 2 +-
> arch/ia64/mm/fault.c | 2 +-
> arch/m32r/mm/fault.c | 2 +-
> arch/m68k/mm/fault.c | 2 +-
> arch/metag/mm/fault.c | 2 +-
> arch/microblaze/mm/fault.c | 2 +-
> arch/mips/mm/fault.c | 2 +-
> arch/mn10300/mm/fault.c | 2 +-
> arch/nios2/mm/fault.c | 2 +-
> arch/openrisc/mm/fault.c | 2 +-
> arch/parisc/mm/fault.c | 2 +-
> arch/powerpc/mm/copro_fault.c | 2 +-
> arch/powerpc/mm/fault.c | 2 +-
> arch/s390/mm/fault.c | 2 +-
> arch/score/mm/fault.c | 2 +-
> arch/sh/mm/fault.c | 2 +-
> arch/sparc/mm/fault_32.c | 4 +-
> arch/sparc/mm/fault_64.c | 2 +-
> arch/tile/mm/fault.c | 2 +-
> arch/um/kernel/trap.c | 2 +-
> arch/unicore32/mm/fault.c | 2 +-
> arch/x86/mm/fault.c | 2 +-
> arch/xtensa/mm/fault.c | 2 +-
> drivers/base/node.c | 13 +-
> drivers/char/mem.c | 24 +
> drivers/iommu/amd_iommu_v2.c | 3 +-
> drivers/iommu/intel-svm.c | 2 +-
> fs/proc/meminfo.c | 7 +-
> fs/proc/task_mmu.c | 10 +-
> fs/userfaultfd.c | 22 +-
> include/linux/huge_mm.h | 36 +-
> include/linux/khugepaged.h | 6 +
> include/linux/mm.h | 51 +-
> include/linux/mmzone.h | 4 +-
> include/linux/page-flags.h | 19 +-
> include/linux/radix-tree.h | 1 +
> include/linux/rmap.h | 2 +-
> include/linux/shmem_fs.h | 29 +-
> include/linux/userfaultfd_k.h | 8 +-
> include/linux/vm_event_item.h | 7 +
> include/trace/events/huge_memory.h | 3 +-
> ipc/shm.c | 6 +-
> lib/radix-tree.c | 68 +-
> mm/Makefile | 2 +-
> mm/filemap.c | 226 ++--
> mm/gup.c | 7 +-
> mm/huge_memory.c | 2028 ++++++----------------------------
> mm/internal.h | 4 +-
> mm/khugepaged.c | 1772 +++++++++++++++++++++++++++++
> mm/ksm.c | 5 +-
> mm/memory.c | 859 +++++++-------
> mm/mempolicy.c | 4 +-
> mm/migrate.c | 5 +-
> mm/mmap.c | 26 +-
> mm/nommu.c | 3 +-
> mm/page-writeback.c | 1 +
> mm/page_alloc.c | 21 +
> mm/rmap.c | 78 +-
> mm/shmem.c | 689 ++++++++++--
> mm/swap.c | 2 +
> mm/truncate.c | 22 +-
> mm/util.c | 6 +
> mm/vmscan.c | 6 +
> mm/vmstat.c | 4 +
> 74 files changed, 3919 insertions(+), 2395 deletions(-)
> create mode 100644 mm/khugepaged.c
>
Powered by blists - more mailing lists