lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.11.1902051109120.9007@eggly.anvils>
Date:   Tue, 5 Feb 2019 12:17:50 -0800 (PST)
From:   Hugh Dickins <hughd@...gle.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
cc:     Hugh Dickins <hughd@...gle.com>, Artem Savkov <asavkov@...hat.com>,
        Baoquan He <bhe@...hat.com>, Qian Cai <cai@....pw>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Michal Hocko <mhocko@...e.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>
Subject: Re: mm: race in put_and_wait_on_page_locked()

On Tue, 5 Feb 2019, Linus Torvalds wrote:
> On Mon, Feb 4, 2019 at 8:43 PM Hugh Dickins <hughd@...gle.com> wrote:
> >
> > Something I shall not be doing, is verifying the correctness of the
> > low-level get_page_unless_zero() versus page_ref_freeze() protocol
> > on arm64 and power - nobody has reported on x86, and I do wonder if
> > there's a barrier missing somewhere, that could manifest in this way -
> > but I'm unlikely to be the one to find that (and also think that any
> > weakness there should have shown up long before now).
> 
> Remind me what the page_ref_freeze() rules even _are_?
> 
> It's a very special thing, setting the page count down to zero if it
> matches the "expected" count.
> 
> Now, if another CPU does a put_page() at that point, that certainly
> will hit the "oops, we dropped the ref to something that was zero".
> 
> So the "expected" count had better be only references we have and own
> 100%, but some of those references aren't really necessarily private
> to our thread.
> 
> For example, what happens if
> 
>  (a) one CPU is doing migration_entry_wait() (counting expected page
> refs etc, before doing page_ref_freeze)

s/migration_entry_wait/migrate_page_move_mapping/

> 
>  (b) another CPU is dirtying a page that was in the swap cache and
> takes a reference to it, but drops it from the swap cache

This is reuse_swap_page() called from do_wp_page(), I presume.

> 
> Note how (b) does not change the refcount on the page at all, because
> it just moves the ref-count from "swap cache entry" to "I own the page
> in my page tables". Which means that when (a) does the "count expected
> count, and match it", it happily matches, and the page_ref_freeze()
> succeeds and makes the page count be zero.
> 
> But now (b) has a private reference to that page, and can drop it, so
> the "freeze" isn't a freeze at all.
> 
> Ok, so clearly the above cannot happen, and there's something I'm
> missing with the freezing. I think we hold the page lock while this is
> going on, which means those two things cannot happen at the same time.
> But maybe there is something else that does the above kind of "move
> page ref from one owner to another"?

You're right that the page lock prevents even getting there (and is
essential whenever mucking around with PageSwapCache), but more to
the point is that the expected_count passed to page_ref_freeze()
does not include any user mapping references (mapcount).

All user mappings (known of at that instant) have been unmapped before
migrate_page_move_mapping() is called, and if any got added since
(difficult without page lock, but I wouldn't assume impossible),
their associated page references are sure to make the page_ref_freeze()
fail (so long as the page refcounting has not been broken).

reuse_swap_page() is called while holding the page table lock: so
although do_wp_page() cannot quite be said to own the page, it is
making sure that it cannot be racily unmapped at that point.  So
until the pte_unmap_unlock() (by which time it has done its own
get_page()) it can treat the page reference associated with the
user mapping as safe, as if it were its own.  And no racing
page_ref_freeze() could succeed while it's there, page lock or not.

Page lock is more important at the "outer" level of the page
migration protocol: holding together the "atomic" switch from old
to new page with the copying of data and flags from old to new.
And more important with anon (swapless) pages, for which there's no
shared visible cache, so migrate_page_move_mapping() does not even
bother with a page_ref_freeze() (though sometimes I want it to).

> 
> The page_ref_freeze() rules don't seem to be documented anywhere.

I would not enjoy documenting what has to be done at what stage
in the page migration sequence: it has evolved, it is subtle,
and we're grateful just to have working code.

At the inner level (where I worried we might have some barrier problem),
the relation between page_ref_freeze() and get_page_unless_zero():
the best documentation I can think of on page_ref_freeze() indeed does
not mention it as such at all (I think it has gone through renamings):
the big comment Nick wrote above page_cache_get_speculative() in
include/linux/pagemap.h.  And there's a helpful little comment in
include/linux/mm.h get_page() too.

(Ah, those innocent days, when the word "speculative" filled us
with delight and hope, instead of horror and dread.)

By the way, I won't put this bug out of my mind, until I've done
an audit of PagePrivate: I've a feeling that fs/iomap.c is not the
first place to forget that PagePrivate must be associated with a
page reference (which is not a special demand of page migration:
page reclaim won't work without it either); and put_and_wait_blah
may now expose such incorrectness as crashes.

Hugh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ