linux-kernel - Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8c4852e4-8f57-354b-630d-cea8176fc026@intel.com>
Date:   Thu, 8 Jul 2021 09:58:43 -0700
From:   Dave Hansen <dave.hansen@...el.com>
To:     Brijesh Singh <brijesh.singh@....com>, x86@...nel.org,
        linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        linux-efi@...r.kernel.org, platform-driver-x86@...r.kernel.org,
        linux-coco@...ts.linux.dev, linux-mm@...ck.org,
        linux-crypto@...r.kernel.org
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Joerg Roedel <jroedel@...e.de>,
        Tom Lendacky <thomas.lendacky@....com>,
        "H. Peter Anvin" <hpa@...or.com>, Ard Biesheuvel <ardb@...nel.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Andy Lutomirski <luto@...nel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Sergio Lopez <slp@...hat.com>, Peter Gonda <pgonda@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        David Rientjes <rientjes@...gle.com>,
        Dov Murik <dovmurik@...ux.ibm.com>,
        Tobin Feldman-Fitzthum <tobin@....com>,
        Borislav Petkov <bp@...en8.de>,
        Michael Roth <michael.roth@....com>,
        Vlastimil Babka <vbabka@...e.cz>, tony.luck@...el.com,
        npmccallum@...hat.com, brijesh.ksingh@...il.com
Subject: Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP
 entry on fault

On 7/8/21 9:48 AM, Brijesh Singh wrote:
> On 7/8/21 10:30 AM, Dave Hansen wrote:
>>> The reason for iterating through 2MB region is; if the faulting address
>>> is not assigned in the RMP table, and page table walk level is 2MB then
>>> one of entry within the large page is the root cause of the fault. Since
>>> we don't know which entry hence I dump all the non-zero entries.
>>
>> Logically you can figure this out though, right?  Why throw 511 entries
>> at the console when we *know* they're useless?
> 
> Logically its going to be tricky to figure out which exact entry caused
> the fault, hence I dump any non-zero entry. I understand it may dump
> some useless.

What's tricky about it?

Sure, there's a possibility that more than one entry could contribute to
a fault.  But, you always know *IF* an entry could contribute to a fault.

I'm fine if you run through the logic, don't find a known reason
(specific RMP entry) for the fault, and dump the whole table in that
case.  But, unconditionally polluting the kernel log with noise isn't
very nice for debugging.

>>> There are two cases which we need to consider:
>>>
>>> 1) the faulting page is a guest private (aka assigned)
>>> 2) the faulting page is a hypervisor (aka shared)
>>>
>>> We will be primarily seeing #1. In this case, we know its a assigned
>>> page, and we can decode the fields.
>>>
>>> The #2 will happen in rare conditions,
>>
>> What rare conditions?
> 
> One such condition is RMP "in-use" bit is set; see the patch 20/40.
> After applying the patch we should not see "in-use" bit set. If we run
> into similar issues, a full RMP dump will greatly help debug.

OK... so dump the "in-use" bit here if you see it.

>>> if it happens, one of the undocumented bit in the RMP entry can
>>> provide us some useful information hence we dump the raw values.
>> You're saying that there are things that can cause RMP faults that
>> aren't documented?  That's rather nasty for your users, don't you think?
> 
> The "in-use" bit in the RMP entry caught me off guard. The AMD APM does
> says that hardware sets in-use bit but it *never* explained in the
> detail on how to check if the fault was due to in-use bit in the RMP
> table. As I said, the documentation folks will be updating the RMP entry
> to document the in-use bit. I hope we will not see any other
> undocumented surprises, I am keeping my finger cross :)

Oh, ok.  That sounds fine.  Documentation is out of date all the time.