linux-kernel - Re: [PATCH 24/31] mm/migrate_device: allow pte_offset_map

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <838a5172-f7f2-43db-e990-d38b36b544a2@google.com>
Date:   Tue, 23 May 2023 20:45:45 -0700 (PDT)
From:   Hugh Dickins <hughd@...gle.com>
To:     Alistair Popple <apopple@...dia.com>
cc:     Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...nel.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        David Hildenbrand <david@...hat.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Qi Zheng <zhengqi.arch@...edance.com>,
        Yang Shi <shy828301@...il.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Xu <peterx@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Will Deacon <will@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
        Ralph Campbell <rcampbell@...dia.com>,
        Ira Weiny <ira.weiny@...el.com>,
        Steven Price <steven.price@....com>,
        SeongJae Park <sj@...nel.org>,
        Naoya Horiguchi <naoya.horiguchi@....com>,
        Christophe Leroy <christophe.leroy@...roup.eu>,
        Zack Rusin <zackr@...are.com>, Jason Gunthorpe <jgg@...pe.ca>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Pasha Tatashin <pasha.tatashin@...een.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Minchan Kim <minchan@...nel.org>,
        Christoph Hellwig <hch@...radead.org>,
        Song Liu <song@...nel.org>,
        Thomas Hellstrom <thomas.hellstrom@...ux.intel.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH 24/31] mm/migrate_device: allow pte_offset_map_lock() to
 fail

On Tue, 23 May 2023, Alistair Popple wrote:
> Hugh Dickins <hughd@...gle.com> writes:
> 
> > migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after
> > splitting huge zero pmd, and the pmd_none() handling after successfully
> > splitting huge page: those are now managed inside pte_offset_map_lock(),
> > and by "goto again" when it fails.
> >
> > But the skip after unsuccessful split_huge_page() must stay: it avoids an
> > endless loop.  The skip when pmd_bad()?  Remove that: it will be treated
> > as a hole rather than a skip once cleared by pte_offset_map_lock(), but
> > with different timing that would be so anyway; and it's arguably best to
> > leave the pmd_bad() handling centralized there.
> 
> So for a pmd_bad() the sequence would be:
> 
> 1. pte_offset_map_lock() would return NULL and clear the PMD.
> 2. goto again marks the page as a migrating hole,
> 3. In migrate_vma_insert_page() a new PMD is created by pmd_alloc().
> 4. This leads to a new zero page getting mapped for the previously
>    pmd_bad() mapping.

Agreed.

> 
> I'm not entirely sure what the pmd_bad() case is used for but is that
> ok? I understand that previously it was all a matter of timing, but I
> wouldn't rely on the previous code being correct in this regard either.

The pmd_bad() case is for when the pmd table got corrupted (overwritten,
cosmic rays, whatever), and that pmd entry is easily recognized as
nonsense: we try not to crash on it, but user data may have got lost.

My "timing" remark may not be accurate: I seem to be living in the past,
when we had a lot more "pmd_none_or_clear_bad()"s around than today - I
was thinking that any one of them could be racily changing the bad to none.
Though I suppose I am now making my timing remark accurate, by changing
the bad to none more often again.

Since data is liable to be lost anyway (unless the corrupted entry was
actually none before it got corrupted), it doesn't matter greatly what
we do with it (some would definitely prefer a crash, but traditionally
we don't): issue a "pmd bad" message and not get stuck in a loop is
the main thing.

> 
> > migrate_vma_insert_page(): remove comment on the old pte_offset_map()
> > and old locking limitations; remove the pmd_trans_unstable() check and
> > just proceed to pte_offset_map_lock(), aborting when it fails (page has
> > now been charged to memcg, but that's so in other cases, and presumably
> > uncharged later).
> 
> Correct, the non-migrating page will be freed later via put_page() which
> will uncharge the page.

Thanks for confirming, yes, it was more difficult once upon a time,
but nowadays just a matter of reaching the final put_page()

Hugh