linux-kernel - Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <504d77b87c81b7027157e0c7b5286e17123c59d9.camel@linux.intel.com>
Date:   Mon, 09 Nov 2020 09:44:02 +0100
From:   Thomas Hellström 
        <thomas.hellstrom@...ux.intel.com>
To:     Jason Gunthorpe <jgg@...pe.ca>, Daniel Vetter <daniel@...ll.ch>
Cc:     John Hubbard <jhubbard@...dia.com>,
        Christoph Hellwig <hch@...radead.org>,
        J??r??me Glisse <jglisse@...hat.com>,
        linux-samsung-soc <linux-samsung-soc@...r.kernel.org>,
        Jan Kara <jack@...e.cz>, Pawel Osciak <pawel@...iak.com>,
        KVM list <kvm@...r.kernel.org>,
        Mauro Carvalho Chehab <mchehab@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        DRI Development <dri-devel@...ts.freedesktop.org>,
        Tomasz Figa <tfiga@...omium.org>,
        Linux MM <linux-mm@...ck.org>,
        Kyungmin Park <kyungmin.park@...sung.com>,
        Daniel Vetter <daniel.vetter@...el.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Marek Szyprowski <m.szyprowski@...sung.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Linux ARM <linux-arm-kernel@...ts.infradead.org>,
        "open list:DMA BUFFER SHARING FRAMEWORK" 
        <linux-media@...r.kernel.org>
Subject: Re: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM

On Fri, 2020-11-06 at 08:55 -0400, Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 11:27:59AM +0100, Daniel Vetter wrote:
> > On Fri, Nov 6, 2020 at 11:01 AM Daniel Vetter <daniel@...ll.ch>
> > wrote:
> > > On Fri, Nov 6, 2020 at 5:08 AM John Hubbard <jhubbard@...dia.com>
> > > wrote:
> > > > On 11/5/20 4:49 AM, Jason Gunthorpe wrote:
> > > > > On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter
> > > > > wrote:
> > > > > > > /*
> > > > > > >   * If we can't determine whether or not a pte is
> > > > > > > special, then fail immediately
> > > > > > >   * for ptes. Note, we can still pin HugeTLB and THP as
> > > > > > > these are guaranteed not
> > > > > > >   * to be special.
> > > > > > >   *
> > > > > > >   * For a futex to be placed on a THP tail page,
> > > > > > > get_futex_key requires a
> > > > > > >   * get_user_pages_fast_only implementation that can pin
> > > > > > > pages. Thus it's still
> > > > > > >   * useful to have gup_huge_pmd even if we can't operate
> > > > > > > on ptes.
> > > > > > >   */
> > > > > > 
> > > > > > We support hugepage faults in gpu drivers since recently,
> > > > > > and I'm not
> > > > > > seeing a pud_mkhugespecial anywhere. So not sure this
> > > > > > works, but probably
> > > > > > just me missing something again.
> > > > > 
> > > > > It means ioremap can't create an IO page PUD, it has to be
> > > > > broken up.
> > > > > 
> > > > > Does ioremap even create anything larger than PTEs?
> > > 
> > > gpu drivers also tend to use vmf_insert_pfn* directly, so we can
> > > do
> > > on-demand paging and move buffers around. From what I glanced for
> > > lowest level we to the pte_mkspecial correctly (I think I
> > > convinced
> > > myself that vm_insert_pfn does that), but for pud/pmd levels it
> > > seems
> > > just yolo.
> > 
> > So I dug around a bit more and ttm sets PFN_DEV | PFN_MAP to get
> > past
> > the various pft_t_devmap checks (see e.g.
> > vmf_insert_pfn_pmd_prot()).
> > x86-64 has ARCH_HAS_PTE_DEVMAP, and gup.c seems to handle these
> > specially, but frankly I got totally lost in what this does.
> 
> The fact vmf_insert_pfn_pmd_prot() has all those BUG_ON's to prevent
> putting VM_PFNMAP pages into the page tables seems like a big red
> flag.
> 
> The comment seems to confirm what we are talking about here:
> 
> 	/*
> 	 * If we had pmd_special, we could avoid all these
> restrictions,
> 	 * but we need to be consistent with PTEs and architectures
> that
> 	 * can't support a 'special' bit.
> 	 */
> 
> ie without the ability to mark special we can't block fast gup and
> anyone who does O_DIRECT on these ranges will crash the kernel when
> it
> tries to convert a IO page into a struct page.
> 
> Should be easy enough to directly test?
> 
> Putting non-struct page PTEs into a VMA without setting VM_PFNMAP
> just
> seems horribly wrong to me.

Although core mm special huge-page support is currently quite limited,
some time ago, I extended the pre-existing vma_is_dax() to
vma_is_special_huge():

/**
 * vma_is_special_huge - Are transhuge page-table entries considered
special?
 * @vma: Pointer to the struct vm_area_struct to consider
 *
 * Whether transhuge page-table entries are considered "special"
following
 * the definition in vm_normal_page().
 *
 * Return: true if transhuge page-table entries should be considered
special,
 * false otherwise.
 */
static inline bool vma_is_special_huge(const struct vm_area_struct
*vma)
{
	return vma_is_dax(vma) || (vma->vm_file &&
				   (vma->vm_flags & (VM_PFNMAP |
VM_MIXEDMAP)));
}

meaning that currently all transhuge page-table-entries in a PFNMAP or
MIXEDMAP vma are considered "special". The number of calls to this
function (mainly in the page-splitting code) is quite limited so
replacing it with a more elaborate per-page-table-entry scheme would, I
guess, definitely be possible. Although all functions using it would
need to require a fallback path for architectures not supporting it.

/Thomas



> 
> Jason