[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200714121955.GA2080@chrisdown.name>
Date: Tue, 14 Jul 2020 13:19:55 +0100
From: Chris Down <chris@...isdown.name>
To: Borislav Petkov <bp@...en8.de>
Cc: linux-kernel@...r.kernel.org, sean.j.christopherson@...el.com,
tony.luck@...el.com, torvalds@...ux-foundation.org, x86@...nel.org,
kernel-team@...com
Subject: Re: [PATCH -v2.1] x86/msr: Filter MSR writes
Hi Borislav,
This is certainly a good idea, but I wonder whether we should be more pragmatic
about the printk ratelimiting while we give userspace time to react and update
their methodologies.
As one example, there is a common MSR hack which is verging on essential if
you're doing thermally intensive work on some recent ThinkPads[0][1], and this
drastically reduces the signal-to-noise ratio in kmsg (and this is only about
five minutes after boot):
% dmesg | wc -l
2963
% dmesg | grep -c 'unrecognized MSR'
2411
That is, even with pr_err_ratelimited, we still end up logging on basically
every single write, even though it's from the same TGID writing to the same
MSRs, and end up becoming >80% of kmsg.
Of course, one can boot with `allow_writes=1` to avoid these messages at all,
but that then has the downfall that one doesn't get _any_ notification at all
about these problems in the first place, and so is much less likely to forget
to fix it. One might rather it was less binary: it was still logged, just less
often, so that application developers _do_ have the incentive to improve their
current methods, without us having to push other useful stuff out of the kmsg
buffer.
This one example isn't the point, of course: I'm sure there are plenty of other
non-ideal-but-pragmatic cases where people are writing to MSRs from userspace
right now, and it will take time for those people to find other solutions.
I completely agree with you that there should be a better solution for these
cases, and that writing to MSRs from userspace is really not a good idea.
However, going from zero to over 80% of dmesg in cases where these MSRs are
repeatedly used seems too fast to me.
Have you considered perhaps making the ramping up of error logging more gradual
by having this printk have its own, more conservative `struct ratelimit_state`,
as we do in some other places with similar noise concerns? Then we could
gradually make the warnings more aggressive as time goes on, up until the point
where we make allow_writes=0 the default.
Thanks,
Chris
0: Lenovo is supposedly fixing this since last year, but no news yet.
1: https://github.com/erpalma/throttled
Powered by blists - more mailing lists