linux-kernel - Re: short-circuit and over-current IRQs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <536ce0f7-b299-f39c-15e1-a7c1151d0fd8@somainline.org>
Date:   Sat, 30 Jan 2021 16:43:46 +0100
From:   AngeloGioacchino Del Regno 
        <angelogioacchino.delregno@...ainline.org>
To:     matti.vaittinen@...rohmeurope.com, Mark Brown <broonie@...nel.org>
Cc:     "lgirdwood@...il.com" <lgirdwood@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: short-circuit and over-current IRQs

Il 29/01/21 10:14, Matti Vaittinen ha scritto:
> 
> On Thu, 2021-01-28 at 12:10 +0000, Mark Brown wrote:
>> On Thu, Jan 28, 2021 at 09:23:08AM +0000, Vaittinen, Matti wrote:
>>> On Wed, 2021-01-27 at 16:32 +0000, Mark Brown wrote:
>>>> Note that the events the API currently has are expected to be for
>>>> the
>>>> actual error conditions, not for the warning ones - indicating
>>>> that
>>>> the
>>>> voltage is out of regulation for example.
>>> I am unsure how to interpret this. What is the criteria of issue
>>> being
>>> an error/warning. When I was talking about warning I meant that the
>>> issue which is detected is unexpected and abnormal (error?) - but
>>> might
>>> still be recoverable (warning?). I understand the regulator
>>> framework
>>> must not signal same events for different purposes - but I don't
>>> really
>>> know what the current events are used for - I am grateful for any
>>> guidance!
>>
>> What the majority of hardware interrupts on is situations where
>> things
>> have already gone out of spec and there are actual problems with the
>> output - for example with current limiting there's often an actual
>> limiter in there so the regulator simply won't supply any more
>> current
>> than is configured.  With a warning everything is still working fine
>> but
>> getting close to not doing so.
>>
>>>> Well, if these things are kicking in the hardware is in serious
>>>> trouble
>>>> anyway so it's unclear what the system would be likely to do in
>>>> software, and also unclear how safe it is to rely on software to
>>>> be
>>>> able
>>>> to take that action given that it let things get into such a bad
>>>> state
>>>> in the first place.
>>> Actually, bear with me but I am unsure why we have these
>>> notifications
>>> if we don't expect SW to be able to do anything? Wouldn't the panic
>>> print be all that is needed then? I think that setups which have
>>> dual
>>
>> You'll notice that there aren't any actual users of this stuff in
>> tree
>> at the minute - people don't generally put much effort into software
>> recovery as they're not expecting to be anywhere near limiting in
>> normal
>> operation.  What I'd expect people to do where they do implement
>> handling is something like shutting down all other supplies on the
>> device, possibly also trying to shut down the system as a
>> whole.  Things
>> more about preventing physical damage rather than being part of the
>> normal operation of the system.
>>
>> For thermal issues systems generally try to apply software limits
>> well
>> before an individual component starts flagging things up with an
>> interrupt, the limits that devices have are generally super high and
>> often there'll be issues at a system level (eg, a case getting
>> unusably
>> hot) earlier and it can take a while for responses to have an impact.
>>
>>> limits (one for initiating potential SW recovery - other for HW to
>>> forcing protection) actually make sense. So does implementing
>>> notifiers
>>> / error statuses for events where SW recovery is potentially
>>> helpful.
>>> But whether the existing event notifications / error flags are
>>> correct
>>> for these is something I can't decide :) Here I ask guidance for
>>> Mark &
>>> others who know what is the idea behind existing error-
>>> flags/events.
>>
>> It's not that we shouldn't implement support for warnings, it's that
>> they're not the common case for hardware and so won't line up with
>> behaviour for other users.

....but anyway, I have no idea what would you do when a warning is
triggered: make a good example, please...

I mean, take for example the "usual display": you are delivering power
to a DriverIC (which is most of the times undocumented), your display
starts drawing more than expected, so... uh.. so what?
You shut it down? You reboot the DrIC?
I can't imagine a DrIC reboot to restore the power draw...

> 
> I have been pondering this - and I guess evaluating the severity of
> these issues belong to board designer's table. (As when to expect
> things be already so wrong that something is broken "ERROR" - or when
> things are approaching the point where things are wrong - but might
> still be recoverable by <insert recovery action here> "WARNING").
> 
> So, what comes to my mind is pushing this decision to a board designer.
> And to me this sounds like device-tree configuration.
> 

That's 100% DT (or ACPI, whatever), since we're in 2020 :P

> For regulator framework that would mean supporting the warning level
> events (and regulator_get_error_flags warning-flags?) too.
> 
> For (my) driver this would mean getting the information from device-
> tree. I didn't yet check the existing DT properties (if any for these
> levels) - but I think we should have something like:
> 
> regulator-over-voltage-protection-microvolt = <>;
> regulator-over-voltage-error-microvolt = <>;
> regulator-over-voltage-warning-microvolt = <>;
> 
> *-protection-* would be point where HW (or driver) shuts down the
> outputs w/o asking questions. (if HW based protection limit can not be
> set but protection can be enabled/disabled then value '1' would mean
> enable, zero would mean disable).
> 

regulator-over-voltage-protection;
^^^ that would look better, since we already have a
"regulator-over-current-protection" property to enable OCP.

Then, my suggestion would be:
regulator-over-voltage-max-microvolt = <uint32>;

And to signal that the hardware will auto-shutdown the rail:
regulator-over-voltage-auto-shutdown;

> *-error-* would be point where driver punts the existing error events
> via notifications allowing consumers to implement what ever is needed
> in order to handle error (which now means that something is likely to
> be broken / out of spec already.
> 

This one would be replaced by the property that I've proposed above

> *-warning-* would be point where driver punts new to-be-invented events
> / error flags and consumers can implement recovery actions assuming the
> HW is still operable but getting to the point where things are going to
> be shut down.
> 

regulator-over-voltage-warn-microvolt = <uint32>;


Less is more, I think. "warn" is shorter, looks nicer to me, but then,
it's not a big deal, that was just a nit.

> The BD9576 would then allow giving either *-error-* or *-warning-*
> properties (not both at least for now) and then selects the flag /
> event based on given property.
> 
> Do you think this makes sense? I hope that having these properties and
> flags/events would help actually utilizing these IRQs from PMICs which
> support them - and also help inventing recovery actions in consumer
> drivers, eventually saving the world for sure :]
> 
> If no one disagrees with this plan, then I know what I am going to do
> next week :]
> 

But then, as I just said, there's already "something" for over-current,
so... if you implement only the OVP stuff, then the implementation would
look "incomplete".
At this point it would be nicer to add another 10 lines of code to
replicate the same on OCP.

Keep in mind, though, that at least the qcom-labibb regulator does *not*
support what you propose: that one has auto-shutdown (OVP) enabled by
default (and *cannot* be disabled), no auto-shutdown on OCP (and no way
to enable it).

So, for labibb then "regulator-over-voltage-auto-shutdown" would be
sort of implicit, so every driver implementing similar hardware would
have to error out in case one property is set (ovp max voltage)but not
the other (auto-shutdown).

In a way, this looks messy.. but it may also make a lot of sense.

> Best Regards
> 	Matti Vaittinen
> 

Yours,
--Angelo