lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161021050118.GR23194@dastard>
Date:   Fri, 21 Oct 2016 16:01:18 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Andi Kleen <ak@...ux.intel.com>
Cc:     "Kirill A. Shutemov" <kirill@...temov.name>,
        Hugh Dickins <hughd@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] shmem: avoid huge pages for small files

On Thu, Oct 20, 2016 at 07:01:16PM -0700, Andi Kleen wrote:
> > Ugh, no, please don't use mount options for file specific behaviours
> > in filesystems like ext4 and XFS. This is exactly the sort of
> > behaviour that should either just work automatically (i.e. be
> > completely controlled by the filesystem) or only be applied to files
> 
> Can you explain what you mean? How would the file system control it?

There's no point in asking for huge pages when populating the page
cache if the file is:

	- significantly smaller than the huge page size
	- largely sparse
	- being randomly accessed in small chunks
	- badly fragmented and so takes hundreds of IO to read/write
	  a huge page
	- able to optimise delayed allocation to match huge page
	  sizes and alignments

These are all constraints the filesystem knows about, but the
application and user don't. None of these aspects can be optimised
sanely by a single threshold, especially when considering the
combination of access patterns vs file layout.

Further, we are moving the IO path to a model where we use extents
for mapping, not blocks.  We're optimising for the fact that modern
filesystems use extents and so massively reduce the number of block
mapping lookup calls we need to do for a given IO.

i.e. instead of doing "get page, map block to page" over and over
again until we've alked over the entire IO range, we're doing
"map extent for entire IO range" once, then iterating "get page"
until we've mapped the entire range.

Hence if we have a 2MB IO come in from userspace, and the iomap
returned is a covers that entire range, it's a no-brainer to ask the
page cache for a huge page instead of iterating 512 times to map all
the 4k pages needed.

> > specifically configured with persistent hints to reliably allocate
> > extents in a way that can be easily mapped to huge pages.
> 
> > e.g. on XFS you will need to apply extent size hints to get large
> > page sized/aligned extent allocation to occur, and so this
> 
> It sounds like you're confusing alignment in memory with alignment
> on disk here? I don't see why on disk alignment would be needed
> at all, unless we're talking about DAX here (which is out of 
> scope currently) Kirill's changes are all about making the memory
> access for cached data more efficient, it's not about disk layout
> optimizations.

No, I'm not confusing this with DAX. However, this automatic use
model for huge pages fits straight into DAX as well.  Same
mechanisms, same behaviours, slightly stricter alignment
characteristics. All stuff the filesystem already knows about.

Mount options are, quite frankly, a terrible mechanism for
specifying filesystem policy. Setting up DAX this way was a mistake,
and it's a mount option I plan to remove from XFS once we get nearer
to having DAX feature complete and stablised. We've already got
on-disk "use DAX for this file" flags in XFS, so we can easier and
cleanly support different methods of accessing PMEM from the same
filesystem.

As such, there is no way we should be considering different
interfaces and methods for configuring the /same functionality/ just
because DAX is enabled or not. It's the /same decision/ that needs
to be made, and the filesystem knows an awful lot more about whether
huge pages can be used efficiently at the time of access than just
about any other actor you can name....

> > persistent extent size hint should trigger the filesystem to use
> > large pages if supported, the hint is correctly sized and aligned,
> > and there are large pages available for allocation.
> 
> That would be ioctls and similar?

You can, but existing filesystem admin tools can already set up
allocation policies without the apps being aware that they even
exist. If you want to use huge page mappings with DAX you'll already
need to do this because of the physical alignment requirements of
DAX.

Further, such techniques are already used by many admins for things
like limiting fragmentation of sparse vm image files. So while you
may not know it, extent size hints and per-file inheritable
attributes are quire widely used already to manage filesystem
behaviour without users or applications even being aware that the
filesystem policies have been modified by the admin...

> That would imply that every application wanting to use large pages
> would need to be especially enabled. That would seem awfully limiting
> to me and needlessly deny benefits to most existing code.

No change to applications will be necessary (see above), though
there's no reason why couldn't directly use the VFS interfaces to
explicitly ask for such behaviour themselves....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ