linux-kernel - Re: [PATCH v1 28/31] x86/resctrl: Drop __init/_

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a029cfe6-7526-44c1-a0db-88d2ad8bb827@arm.com>
Date: Fri, 14 Jun 2024 14:59:09 +0100
From: James Morse <james.morse@....com>
To: Dave Martin <Dave.Martin@....com>,
 Amit Singh Tomar <amitsinght@...vell.com>
Cc: Reinette Chatre <reinette.chatre@...el.com>, x86@...nel.org,
 linux-kernel@...r.kernel.org, Fenghua Yu <fenghua.yu@...el.com>,
 Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
 Borislav Petkov <bp@...en8.de>, H Peter Anvin <hpa@...or.com>,
 Babu Moger <Babu.Moger@....com>, shameerali.kolothum.thodi@...wei.com,
 D Scott Phillips OS <scott@...amperecomputing.com>,
 carl@...amperecomputing.com, lcherian@...vell.com,
 bobo.shaobowang@...wei.com, tan.shaopeng@...itsu.com,
 baolin.wang@...ux.alibaba.com, Jamie Iles <quic_jiles@...cinc.com>,
 Xin Hao <xhao@...ux.alibaba.com>, peternewman@...gle.com,
 dfustini@...libre.com, David Hildenbrand <david@...hat.com>,
 Rex Nie <rex.nie@...uarmicro.com>
Subject: Re: [PATCH v1 28/31] x86/resctrl: Drop __init/__exit on assorted
 symbols

Hi guys,

On 02/05/2024 16:58, Dave Martin wrote:
> On Wed, May 01, 2024 at 09:51:51PM +0530, Amit Singh Tomar wrote:
>>>>> I think James will need to comment on this, but I think that yes, it
>>>>> is probably appropriate to require a reboot.  I think an MPAM error
>>>>> interrupt should only happen if the software did something wrong, so
>>>>> it's a bit like hitting a BUG(): we don't promise that everything works
>>>>> 100% properly until the system is restarted.  Misbehaviour should be
>>>>> contained to MPAM though.

Indeed - all the reasons for the MPAM error interrupt being triggered indicate a software
bug, so re-mounting resctrl with the same buggy code isn't going to fix anything.

>>>> if "resctrl" is nonfunctional in this state, then this comment[1] here does
>>>> *not* make sense.
>>>>
>>>> "restore any modified controls to their reset values."

The MPAM driver goes on to reset all the MPAM hardware to the best of its ability.
These means everything gets set back to 100%, so its as if MPAM is not implemented.
This is better than throttling the wrong task because an out-of-range PARTID for
${important_task} is using the configuration of ${background_process}...

>>> Can you clarify what you mean here?
>>
>> What I meant was, What's the rationale behind restoring the modified
>> controls, if user is going to restart the system anyways (in order to use
>> MPAM again),  but later realized that it is needed so that *non* MPAM loads>> (user may still want to run other things even after MPAM error interrupt)
>> would not have any adverse effect with modified controls.
>>
>> Therefore, taking my statement back.
> 
> Ack: we can't force the system to restart without losing data.  Really,
> the decision about when and whether to attempt a graceful shutdown or
> reboot should be left to userspace.  But until userspace does shut down
> the system, we do our best to behave as if the broken part of the system
> (MPAM) were not present at all.

Dave's systemd choking on this angle is interesting - I'll go experiment with this.

The alternative here is to delete the __exit text completely as it can't be run, and
instead get MPAM's error interrupt to disable the static-keys and return -EIO for every
call into the arch code.
I didn't do this as its likely to cause extra churn to ensure that every arch helper can
propagate errors back to user-space, and this seemed like a good (re-)use of existing code.

The third option was to not do anything in MPAM, and just print a message to say bad
things might be happening. Given its extra work for hardware to detect the error
conditions, I previously assumed no-one would do this, and hardware would just 'go wrong'
instead... but as someone has built this, it would be good to try and react to it.

Thanks,

James