linux-kernel - Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190610235045.GB30991@hori.linux.bs1.fc.nec.co.jp>
Date:   Mon, 10 Jun 2019 23:50:38 +0000
From:   Naoya Horiguchi <n-horiguchi@...jp.nec.com>
To:     Mike Kravetz <mike.kravetz@...cle.com>
CC:     Wanpeng Li <kernellwp@...il.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Punit Agrawal <punit.agrawal@....com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "Michal Hocko" <mhocko@...nel.org>,
        "Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>,
        Anshuman Khandual <khandual@...ux.vnet.ibm.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        "linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>,
        kvm <kvm@...r.kernel.org>, Paolo Bonzini <pbonzini@...hat.com>,
        Xiao Guangrong <xiaoguangrong@...cent.com>,
        "lidongchen@...cent.com" <lidongchen@...cent.com>,
        "yongkaiwu@...cent.com" <yongkaiwu@...cent.com>
Subject: Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB
 hugepage

On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > Cc Paolo,
> > Hi all,
> > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz <mike.kravetz@...cle.com> wrote:
> >>
> >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> >>> Andrew Morton <akpm@...ux-foundation.org> writes:
> >>>
> >>>> On Thu, 08 Feb 2018 12:30:45 +0000 Punit Agrawal <punit.agrawal@....com> wrote:
> >>>>
> >>>>>>
> >>>>>> So I don't think that the above test result means that errors are properly
> >>>>>> handled, and the proposed patch should help for arm64.
> >>>>>
> >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> >>>>> would be easier to maintain and reason about if arm64 helpers are
> >>>>> consistent with expectations by core code.
> >>>>>
> >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> >>>>> it would be helpful if there was a clear expression of semantics for
> >>>>> pud_huge() for various cases. Is there any version that can be used as
> >>>>> reference?
> >>>>
> >>>> Is that an ack or tested-by?
> >>>>
> >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> >>>> but they remain steadfastly in hiding.
> >>>
> >>> Cc'ing linuxppc-dev is always a good idea :)
> >>>
> >>
> >> Thanks Michael,
> >>
> >> I was mostly concerned about use cases for soft/hard offline of huge pages
> >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> >> huge pages, and soft/hard offline support was specifically added for this.
> >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> >> at PGD level"
> >>
> >> This patch will disable that functionality.  So, at a minimum this is a
> >> 'heads up'.  If there are actual use cases that depend on this, then more
> >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> >> support, I can not tell if there is a real use case or this is just a
> >> 'nice to have'.
> > 
> > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > encounter gup_pud_range() panic several times in product environment.
> > Is there any plan to reenable and fix arch codes?
> 
> I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> Intel MMU capacity increases to handle more TLB entries there will be more
> and more interest.
> 
> Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> he know most about this code.

Thanks for forwarding this to me, I'm feeling that memory error handling
on 1GB hugepage is demanded as real use case.

> 
> > In addition, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > hwpoison entry. The guest will vmexit and retry endlessly when
> > accessing any memory in the guest which is backed by this 1GB poisoned
> > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > will be delivered to VM at page fault next time for the offensive
> > SPTE. Is this proposal acceptable?
> 
> I am not sure of the error handling design, but this does sound reasonable.

I agree that that's better.

> That block of code which potentially dissolves a huge page on memory error
> is hard to understand and I'm not sure if that is even the 'normal'
> functionality.  Certainly, we would hate to waste/poison an entire 1G page
> for an error on a small subsection.

Yes, that's not practical, so we need at first establish the code base for
2GB hugetlb splitting and then extending it to 1GB next.

Thanks,
Naoya Horiguchi