lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 24 Mar 2015 15:54:00 -0700 (PDT)
From:	Hugh Dickins <hughd@...gle.com>
To:	"Kirill A. Shutemov" <kirill@...temov.name>
cc:	Hugh Dickins <hughd@...gle.com>,
	Dave Hansen <dave.hansen@...el.com>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Christoph Lameter <cl@...two.org>,
	Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
	Steve Capper <steve.capper@...aro.org>,
	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Michal Hocko <mhocko@...e.cz>,
	Jerome Marchand <jmarchan@...hat.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	Jaroslav Kysela <perex@...ex.cz>, Takashi Iwai <tiwai@...e.de>,
	alsa-devel@...a-project.org
Subject: Re: [PATCH 05/16] page-flags: define behavior of FS/IO-related flags
 on compound pages

On Mon, 23 Mar 2015, Kirill A. Shutemov wrote:
> On Sun, Mar 22, 2015 at 05:02:58PM -0700, Hugh Dickins wrote:
> > On Thu, 19 Mar 2015, Kirill A. Shutemov wrote:
> > > On Thu, Mar 19, 2015 at 11:29:52AM -0700, Dave Hansen wrote:
> > > > On 03/19/2015 10:08 AM, Kirill A. Shutemov wrote:
> > > > > The odd exception is PG_dirty: sound uses compound pages and maps them
> > > > > with PTEs. NO_COMPOUND triggers VM_BUG_ON() in set_page_dirty() on
> > > > > handling shared fault. Let's use HEAD for PG_dirty.
> > 
> > It really depends on what you do with PageDirty of the head, when you
> > get to support 4k pagecache with subpages of a huge compound page.
> > 
> > HEAD will be fine, so long as PageDirty on the head means the whole
> > huge page must be written back.  I expect that's what you will choose;
> > but one could consider that if a huge page is only mapped read-only,
> > but a few subpages of it writable, then only the few need be written
> > back, in which case ANY would be more appropriate.  NO_COMPOUND is
> > certainly wrong.
> > 
> > But that does illustrate that I consider this patch series premature:
> > it belongs with your huge pagecache implementation.  You seem to be
> > "tidying up" and adding overhead to things that are fine as they are.
> 
> I agree, it can be ANY too, since we don't use PG_dirty anywhere at the
> moment. My first thought was that it's better to match PG_dirty behaviour
> with LRU-related, but it's not necessary should be the case.

No, yes, we do treat Dirty differently from LRU.

> 
> BTW, do we make any use of PG_dirty on pages with ->mapping == NULL?

No use that I can recall; but I suppose it's possible there's some
driver which does make use of it (if so, then you should choose ANY).

> Should we avoid dirtying them in the first place?

I don't think so: to do so would add more branches in hot paths,
just to avoid a rare case which works fine without them; and
prevent a driver from using it, in the unlikely case that's so.

> 
> > > > Can we get the sound guys to look at this, btw?  It seems like an odd
> > > > thing that we probably don't want to keep around, right?
> > > 
> > > CC: +sound guys
> > 
> > I don't think this is peculiar to sound at all: there are other users
> > of __GFP_COMP in the tree, aren't there?  And although some of them
> > might turn out not to need it any more, I expect most of them still
> > need it for the same reason they did originally.
> 
> I haven't seen any other __GFP_COMP user which get it mapped to user-space
> with PTEs. Do you? Probably I haven't just stepped on it.

I don't know why a driver would use __GFP_COMP if it cannot get mapped
into user-space (except copy-and-paste from a driver that needed it to
a driver that did not): if there's no chance of mapping into userspace,
then an ordinary >0-order allocation is good enough, isn't it?

> 
> ... looking into code a bit more: at least one fb-drivers has compound
> pages mapped with PTEs..

Good, you've saved me from looking for them.  I would expect every
__GFP_COMP allocation to be mappable into user-space, with silly
exceptions.

> 
> > > I'm not sure what is right fix here. At the time adding __GFP_COMP was a
> > > fix: see f3d48f0373c1.
> > 
> > The only thing special about this one, was that I failed to add
> > __GFP_COMP at first.
> > 
> > The purpose of __GFP_COMP is to allow a >0-order page (originally, just
> > a hugetlb page: see 2.5.60) to be mapped into userspace, and parts of it
> > then subjected to get_user_pages (ptrace, futex, direct I/O, infiniband
> > etc), and now even munmap, without destroying the integrity of the
> > underlying >0-order page.
> > 
> > We don't bother with __GFP_COMP when a >0-order page cannot be mapped
> > into userspace (except through /dev/mem or suchlike); we add __GFP_COMP
> > when it might be, to get the right reference counting.
> 
> Wouldn't non-compound >0-order page allocation + split_page() work too?

That works very well for me in huge tmpfs, yes :)

But I think the typical __GFP_COMP-using driver wants one large
contiguous area that it holds as a single piece, without worrying
about the ref-counting implications of when it's mapped into
user-space, then partially unmapped, or accessed via get_user_pages.
It can't risk losing parts of its buffer at the whim of its users.

I expect you're right that drivers could be converted over to
manage their buffers differently, without __GFP_COMP.  But __GFP_COMP
existed already for hugetlbfs, and was easy for drivers to use safely:
the whole being held until the head is freed.  (And split_page() was
added later in history - I think so the surplus tail end of a high
order page could be freed immediately.)

> 
> > It's normal for set_page_dirty() to be called in the course of
> > get_user_pages(), and it's normal for set_page_dirty() to be called
> > when releasing the get_user_pages() references, and it's normal for
> > set_page_dirty() to be called when munmap'ing a pte_dirty().
> > 
> > > 
> > > Other odd part about __GFP_COMP here is that we have ->_mapcount in tail
> > > pages to be used for both: mapcount of the individual page and for gup
> > > pins. __compound_tail_refcounted() doesn't recognize that we don't need
> > > tail page accounting for these pages.
> > 
> > So page->_mapcount of the tails is being used for both their mapcount
> > and their reference count: that's certainly funny, and further reason
> > to pursue your aim of simplifying the way THPs are refcounted.  But
> > not responsible for any actual bug, I think?
> 
> GUP pin would screw up page_mapcount() on these pages. It would affect
> memory stats for the process and probably something else.

Yes, the GUP pin would increment page_mapcount() without an additional
mapping - but can only happen once the page has already been mapped,
so FILE_MAPPED stats unaffected?  I'm not sure; but surely it wouldn't
work as well when unmapped before unpinned, since the unmapping will
see "still mapped" and the unpinning won't do anything with FILE_MAPPED.

Unmapping before unpinning is an uncommon path; but it can't be ignored,
it is the path which demanded __GFP_COMP in the first place.

Looks like extending THP by-mapcount refcounting to other compound pages
was not such a good idea.  But since nobody has noticed, we may not need
a more urgent fix than your simplification of THP refcounting.

> 
> I think we can get __compound_tail_refcounted() ignore these pages by
> checking if page->mapping is NULL.

I forget what's in page->mapping on the THP tails.  Or do you mean
page->mapping of head?  It would be better not to rely on that, I'm
not certain that no driver could set page->mapping of compound head.
There's probably some field or flag on the tails that you could use;
but I don't know that it's needed in a hurry.

> 
> > > Hugh, I tried to ask you about the situation several times (last time on
> > > the summit). Any comments?
> > 
> > I do remember we began a curtailed conversation about this at LSF/MM.
> > I do not remember you asking about it earlier: when was that?
> 
> http://lkml.kernel.org/g/20141217004734.GA23150@node.dhcp.inet.fi

Hmm, curious: never reached me (and I should have seen that on linux-mm
even if not Cc'ed); unless I deleted it by accident, that's not unknown.

And in that you explain as I've said above, so you didn't really need
me anyway.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists