linux-kernel - Re: [PATCHv3 15/41] filemap: handle huge pages in do_generic_file

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20161031181035.GA7007@node.shutemov.name>
Date:   Mon, 31 Oct 2016 21:10:35 +0300
From:   "Kirill A. Shutemov" <kirill@...temov.name>
To:     Jan Kara <jack@...e.cz>
Cc:     "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Theodore Ts'o <tytso@....edu>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        Jan Kara <jack@...e.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Hugh Dickins <hughd@...gle.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Matthew Wilcox <willy@...radead.org>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>,
        linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-block@...r.kernel.org
Subject: Re: [PATCHv3 15/41] filemap: handle huge pages in
 do_generic_file_read()

[ My mail system got broken and original reply didn't get to through. Resent. ]

On Thu, Oct 13, 2016 at 11:33:13AM +0200, Jan Kara wrote:
> On Thu 15-09-16 14:54:57, Kirill A. Shutemov wrote:
> > Most of work happans on head page. Only when we need to do copy data to
> > userspace we find relevant subpage.
> > 
> > We are still limited by PAGE_SIZE per iteration. Lifting this limitation
> > would require some more work.
>
> Hum, I'm kind of lost.

The limitation here comes from how copy_page_to_iter() and
copy_page_from_iter() work wrt. highmem: it can only handle one small
page a time.

On write side, we also have problem with assuming small page: write length
and offset within page calculated before we know if small or huge page is
allocated. It's not easy to fix. Looks like it would require change in
->write_begin() interface to accept len > PAGE_SIZE.

> Can you point me to some design document / email that would explain some
> high level ideas how are huge pages in page cache supposed to work?

I'll elaborate more in cover letter to next revision.

> When are we supposed to operate on the head page and when on subpage?

It's case-by-case. See above explanation why we're limited to PAGE_SIZE
here.

> What is protected by the page lock of the head page?

Whole huge page. As with anon pages.

> Do page locks of subpages play any role?

lock_page() on any subpage would lock whole huge page.

> If understand right, e.g.  pagecache_get_page() will return subpages but
> is it generally safe to operate on subpages individually or do we have
> to be aware that they are part of a huge page?

I tried to make it as transparent as possible: page flag operations will
be redirected to head page, if necessary. Things like page_mapping() and
page_to_pgoff() know about huge pages.

Direct access to struct page fields must be avoided for tail pages as most
of them doesn't have meaning you would expect for small pages.

> If I understand the motivation right, it is mostly about being able to mmap
> PMD-sized chunks to userspace. So my naive idea would be that we could just
> implement it by allocating PMD sized chunks of pages when adding pages to
> page cache, we don't even have to read them all unless we come from PMD
> fault path.

Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
per-hugepage, one common list of buffer heads...

PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
it otherwise doesn't make sense) and handling it differently for file-THP
is nightmare from maintenance POV.

> Reclaim may need to be aware not to split pages unnecessarily
> but that's about it. So I'd like to understand what's wrong with this
> naive idea and why do filesystems need to be aware that someone wants to
> map in PMD sized chunks...

In addition to flags, THP uses some space in struct page of tail pages to
encode additional information. See compound_{mapcount,head,dtor,order},
page_deferred_list().

--
 Kirill A. Shutemov