linux-kernel - Re: short-circuit and over-current IRQs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5bf8b75f3a2f9db5fc200a9418ece5dfa2f91ab5.camel@fi.rohmeurope.com>
Date:   Thu, 28 Jan 2021 09:23:08 +0000
From:   "Vaittinen, Matti" <Matti.Vaittinen@...rohmeurope.com>
To:     "angelogioacchino.delregno@...ainline.org" 
        <angelogioacchino.delregno@...ainline.org>,
        "broonie@...nel.org" <broonie@...nel.org>
CC:     "lgirdwood@...il.com" <lgirdwood@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: short-circuit and over-current IRQs


On Wed, 2021-01-27 at 16:32 +0000, Mark Brown wrote:
> On Wed, Jan 27, 2021 at 03:34:46PM +0100, AngeloGioacchino Del Regno
> wrote:
> > Il 27/01/21 13:56, Matti Vaittinen ha scritto:
> > > I can only speak for BD9576MUF - which has two limits for
> > > monitored
> > > entity (temperature/voltage). One limit being 'warning' limit (or
> > > 'detection' as data-sheet says), the other being 'protection'
> > > limit.
> > I would tend to agree with you here, Matti. Also from what I
> > understand,
> > the wanted outcome is software handling a possibly temporary issue
> > with
> > you charging caps, external IC initialization using (expectedly)
> > much
> > more power than needed before stabilizing, and eventually handling
> > other
> > "real" issues for which there is a solution that may not even
> > include
> > disabling the regulator itself, but some other handling on the
> > connected
> > device driver.
> 
> Note that the events the API currently has are expected to be for the
> actual error conditions, not for the warning ones - indicating that
> the
> voltage is out of regulation for example.

I am unsure how to interpret this. What is the criteria of issue being
an error/warning. When I was talking about warning I meant that the
issue which is detected is unexpected and abnormal (error?) - but might
still be recoverable (warning?). I understand the regulator framework
must not signal same events for different purposes - but I don't really
know what the current events are used for - I am grateful for any
guidance!

So, I would like to have an event which the driver can signal to
consumers so that the consumer drivers can try to recover problem W/O
shutting down the SoC.

I try to explain what the BD9576 does and how I see it - I hope you can
then tell me if I am misusing existing events/errors. Sorry for longish
explanation...

The patch I sent did just use existing flags as follows:

REGULATOR_EVENT_UNDER_VOLTAGE:

Sent when PMIC informs regulator output being under the target. Limit
for this is configurable for each output - from roughly at 0.9*(set
Vout) up to 0.99*(set Vout). The detection can also be disabled.

I think that if this is enabled for system, then this should be
regarded as "error condition" - but HW does not shut down. Furthermore,
BD9576 does NOT have "protection" limit for output dropping. It does
have a protection limit for regulator input dropping though - and PMIC
will forcibly shut down if voltage input of a regulator drops below
certain value.

What action the system can take when output voltage drops depends on
system and is not really up to the regulator driver to decide. But is
this notification correct? (I thought so just judging the name). What I
thought is that it would be nice to inform consumers about the detected
under-voltage (as HW supports this). In addition to this notification I
added a 'state flag' in driver so that if someone calls the
regulator_get_error we do return REGULATOR_ERROR_UNDER_VOLTAGE. Are
these correct flags?

REGULATOR_EVENT_REGULATION_OUT:

Current patch send this when PMIC informs regulator output being OVER
the set voltage. PMIC supports similar limit configuration / feature
disabling as with UVD. The over voltage limit can be set roughly from
1.01*(set Vout) to 1.1*(set Vout).

PMIC has also protection limits for regulator outputs so that if output
is roughly 1.2*(set Vout) the PMIC will shut down.

Again, I think this is an error condition but what action system
can/should take if output voltage is exceeding the limit is not known
by me.

I used same regulator_get_error logic and
REGULATOR_ERROR_REGULATION_OUT is returned by regulator_get_error when
condition is not cleared.

Again - I would appreciate knowledge about if this is correct flag.

REGULATOR_EVENT_OVER_TEMP:

Sent when PMIC gives us "thermal warning" IRQ. The PMIC has thermal
shut-down point 175 C when it will immediately cut the power from
system to protect it from damaging. The thermal warning IRQ is firing
at 30 C lower temperature and HW performs no other action here. AFAIR
the temperature limit can not be adjusted.

I think this is again an error condition - kind of last resort for the
system to do something before PMIC shuts it down. I don't think this is
intended to be any 'normal temperature regulation' feature. What system
can do here (turn off some HW? Reduce clocks? maxx fans? Shut down
graphichs block and send warning to display using TCON image overlay?
Forcibly close the panorama windows to cut sunlight?) depends on
system.

Here I also added REGULATOR_ERROR_OVER_TEMP to be returned by
regulator_get_error.

Finally, for one of the outputs we have "over current warning"
interrupt and over current protection. The limit for OCW is
configurable (data-sheet speaks of milliVOLTS here so maybe shunt is
used?) and it can also be disabled. Protection is again not
configurable and will shut down the outputs immediately.

I think it is again up to system designer to decide whether the OCW
interrupt should be used to inform error condition. I used
REGULATOR_EVENT_OVER_CURRENT and REGULATOR_ERROR_OVER_CURRENT
respectively. 


>   If you're supporting warning
> notifications as well you'll want to add more events (or possibly
> another flag to munge in with the existing events to indicate that
> it's
> a warning rather than an error).

As I said, I need some help here to decide if these events are what
existing notification flags are intended to be used for or if I should
add new notification(s)/flags? I think that from user perspective it
boils down to question if there is some different recovery actions to
be taken? In the case of BD9576MUF I don't see any other meaningfull
error events because, as I said, when the "protection" level is reached
the IRQs can't really be handled (because power from all outputs is
shut. Unless another processor which is not being powered by the BD9576
is controlling it - and I don't think this would be the case here).

> 
> > Though, Mark: for example, on qcom labibb, there's "PBS" that is
> > killing
> > the regulators on short-circuit condition and as you see, handling
> > that
> > is a little trickier compared to the over-current one, where there
> > is no
> > such auto-magic-thing...
> > .... I wouldn't know if it'd be a good idea to have a system like
> > qcom's
> > PBS everywhere.
> > For the sake of protecting HW "paranoidly" though.. maybe :))
> 
> Well, if these things are kicking in the hardware is in serious
> trouble
> anyway so it's unclear what the system would be likely to do in
> software, and also unclear how safe it is to rely on software to be
> able
> to take that action given that it let things get into such a bad
> state
> in the first place.

Actually, bear with me but I am unsure why we have these notifications
if we don't expect SW to be able to do anything? Wouldn't the panic
print be all that is needed then? I think that setups which have dual
limits (one for initiating potential SW recovery - other for HW to
forcing protection) actually make sense. So does implementing notifiers
/ error statuses for events where SW recovery is potentially helpful.
But whether the existing event notifications / error flags are correct
for these is something I can't decide :) Here I ask guidance for Mark &
others who know what is the idea behind existing error-flags/events.

Best Regards
	Matti Vaittinen