[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5089e197-5ee5-8775-501c-201546a0cc1a@roeck-us.net>
Date: Wed, 6 Sep 2017 19:19:36 -0700
From: Guenter Roeck <linux@...ck-us.net>
To: Andrew Jeffery <andrew@...id.au>
Cc: linux-hwmon@...r.kernel.org, jdelvare@...e.com, corbet@....net,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
openbmc@...ts.ozlabs.org, joel@....id.au
Subject: Re: [PATCH] hwmon: pmbus: Make reg check and clear faults functions
return errors
On 09/06/2017 04:32 PM, Andrew Jeffery wrote:
> On Wed, 2017-09-06 at 15:51 -0700, Guenter Roeck wrote:
>> On Wed, Sep 06, 2017 at 10:23:37AM +1000, Andrew Jeffery wrote:
>>> On Tue, 2017-09-05 at 10:00 -0700, Guenter Roeck wrote:
>>>> On Tue, Sep 05, 2017 at 05:01:32PM +1000, Andrew Jeffery wrote:
>>>>> Some functions exposed by pmbus core conflated errors that occurred when
>>>>> setting the page to access with errors that occurred when accessing
>>>>> registers in a page. In some cases, this caused legitimate errors to be
>>>>> hidden under the guise of the register not being supported.
>>>>>
>>>>
>>>> I'll have to look into this. If I recall correctly, the reason for not returning
>>>> errors from the "clear faults" command was that some early chips did not support
>>>> it, or did not support it on all pages. Those chips would now always fail.
>>>
>>> Yes, that is a concern.
>>>
>>> However, shouldn't the lack of support for CLEAR_FAULTS be a described
>>> property of the chip or page?
>>>
>>> Regardless, the issue here is the PAGE command is sometimes failing
>>> (more below). This means we won't have issued a CLEAR_FAULTS against
>>> the page even if the page supports CLEAR_FAULTS. That to me is
>>> something that should be propagated back up the call chain, as faults
>>> that can be cleared will not have been.
>>>
>>
>> We could also consider
>> - retry
>
> I guess that leads to the usual retries problem: How many? Oneshot? More?
>
> My expectation is that oneshot would be enough, but we'd still need to consider
> what to do if that wasn't successful.
>
> Does the retry policy need to be in the kernel? I guess if it's not we would
> need all operations in the path to the failure to be idempotent.
>
>> - Add a marker indicating that faults (specifically command errors)
>> are unreliable after clearing faults failed
>
> What kind of marker were you thinking? Something in dmesg? If it's anything
> else we'd probably want something to respond to the condition, which nothing
> would do currently.
>
>>
>> If we can't reliably clear faults, all bets are off.
>
> Yeah, it's a messy problem. However even if we resolve our issues in hardware
> (i.e. it's discovered that it is a design failure or something), I don't think
> we should drop a resolution to the problem highlighted by this patch.
>
>>
>>>>
>>>> On a higher level, I am not sure if it is a good idea to return an error
>>>> from a function intended to clear faults (and nothing else). That seems
>>>> counterproductive.
>>>>
>>>> Is this a problem you have actually observed ?
>>>
>>> Unfortunately yes. I was trying to reproduce some issues that we are
>>> seeing in userspace but wasn't having much luck (getting -EIO on
>>> reading a hwmon attribute). I ended up instrumenting our I2C bus
>>> driver, and found that the problem was more prevalent than what the
>>> read failures in userspace were indicating. The paths that were
>>> triggering these unreported conditions all traced through the check
>>> register and clear fault functions, which on analysis were swallowing
>>> the error.
>>>
>>>> If so, what is the chip ?
>>>
>>> Well, the chip that I'm *communicating* with is the Maxim MAX31785
>>> Intelligent Fan Controller. As to whether that's what is *causing* the
>>> PAGE command to fail is still up in the air. I would not be surprised
>>> if we have other issues in the hardware design.
>>>
>>> I haven't sent a follow-up to the series introducing the MAX31785
>>> driver because I've been chasing down communication issues. I felt it
>>> was important to understand whether the device has quirks that need to
>>> be worked around in the driver, or if our hardware design has bigger
>>> problems that should be handled in other ways. I've been in touch with
>>> Maxim who have asserted that some of the problems we're seeing cannot
>>> be caused by their chip.
>>>
>>
>> Guess I need to dig up my eval board and see if I can reproduce the problem.
>> Seems you are saying that the problem is always seen when issuing a sequence
>> of "clear faults" commands on multiple pages ?
>
> Yeah. We're also seeing bad behaviour under other command sequences as well,
> which lead to this hack of a work-around patch[1].
>
> I'd be very interested in the results of testing against the eval board. I
> don't have access to one and it seems Maxim have discontinued them.
>
> [1] https://patchwork.kernel.org/patch/9876083/
>
>>
>> The MAX31785 is microcode based, so I would not be entirely surprised if
>> it sometimes fails to reply if a sequence of <set page, clear faults>
>> commands is executed.
>
> Have you seen this behaviour in other microcode-based chips?
>
ltc2978 (commit e04d1ce9bbb) and most of the Zilker Labs chips (zl6100.c).
You could try the approach I used in the zl6100 driver: Add a minimum time
between accesses to the chip. I would probably no longer use udelay(), though,
but usleeep_range(), if I were to implement the code today.
>>
>> If possible, you might try reducing the i2c clock. I have not seen that with
>> Maxim chips, but some other PMBus chips don't operate reliably at 400 KHz.
>
> It's already running at 100kHz, which is the maximum clock rate the device is
> specified for. I've even underclocked the bus to 75kHz and still observed
> issues. Again, I wouldn't be surprised if the problem is something other than
> the Maxim chip, the bus that we have it on has a complex topology. I'm working
> with hardware people internally to try to isolate the problem.
>
If you can, you might try something in between. Early TI chips (ucd9000 variants,
some sold as OEM chips to third parties) had a problem with both 100kHz and 400kHz,
but worked fine between about 150 kHz and 350 kHz, for whatever reason. Of course,
maybe that was a signal problem on our boards at the time (the i2c signal routing
was quite complex).
Guenter
> Cheers,
>
> Andrew
>
>>
>> Guenter
>>
>>>>
>>>>>>> Signed-off-by: Andrew Jeffery <andrew@...id.au>
>>>>>
>>>>> ---
>>>>> Documentation/hwmon/pmbus-core | 12 ++--
>>>>> drivers/hwmon/pmbus/pmbus.h | 6 +-
>>>>> drivers/hwmon/pmbus/pmbus_core.c | 115 +++++++++++++++++++++++++++++----------
>>>>> 3 files changed, 95 insertions(+), 38 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/hwmon/pmbus-core b/Documentation/hwmon/pmbus-core
>>>>> index 8ed10e9ddfb5..3e9f41bb756f 100644
>>>>> --- a/Documentation/hwmon/pmbus-core
>>>>> +++ b/Documentation/hwmon/pmbus-core
>>>>> @@ -218,17 +218,17 @@ Specifically, it provides the following information.
>>>>> This function calls the device specific write_byte function if defined.
>>>>> Therefore, it must _not_ be called from that function.
>>>>>
>>>>> - bool pmbus_check_byte_register(struct i2c_client *client, int page, int reg);
>>>>> + int pmbus_check_byte_register(struct i2c_client *client, int page, int reg);
>>>>>
>>>>> - Check if byte register exists. Return true if the register exists, false
>>>>> - otherwise.
>>>>> + Check if byte register exists. Returns 1 if the register exists, 0 if it does
>>>>> + not, and less than zero on an unexpected error.
>>>>> This function calls the device specific write_byte function if defined to
>>>>> obtain the chip status. Therefore, it must _not_ be called from that function.
>>>>>
>>>>> - bool pmbus_check_word_register(struct i2c_client *client, int page, int reg);
>>>>> + int pmbus_check_word_register(struct i2c_client *client, int page, int reg);
>>>>>
>>>>> - Check if word register exists. Return true if the register exists, false
>>>>> - otherwise.
>>>>> + Check if word register exists. Returns 1 if the register exists, 0 if it does
>>>>> + not, and less than zero on an unexpected error.
>>>>> This function calls the device specific write_byte function if defined to
>>>>> obtain the chip status. Therefore, it must _not_ be called from that function.
>>>>>
>>>>> diff --git a/drivers/hwmon/pmbus/pmbus.h b/drivers/hwmon/pmbus/pmbus.h
>>>>> index bfcb13bae34b..c53032a04a6f 100644
>>>>> --- a/drivers/hwmon/pmbus/pmbus.h
>>>>> +++ b/drivers/hwmon/pmbus/pmbus.h
>>>>> @@ -413,9 +413,9 @@ int pmbus_write_byte_data(struct i2c_client *client, int page, u8 reg,
>>>>>>> u8 value);
>>>>>
>>>>> int pmbus_update_byte_data(struct i2c_client *client, int page, u8 reg,
>>>>>>> u8 mask, u8 value);
>>>>>
>>>>> -void pmbus_clear_faults(struct i2c_client *client);
>>>>> -bool pmbus_check_byte_register(struct i2c_client *client, int page, int reg);
>>>>> -bool pmbus_check_word_register(struct i2c_client *client, int page, int reg);
>>>>> +int pmbus_clear_faults(struct i2c_client *client);
>>>>> +int pmbus_check_byte_register(struct i2c_client *client, int page, int reg);
>>>>> +int pmbus_check_word_register(struct i2c_client *client, int page, int reg);
>>>>> int pmbus_do_probe(struct i2c_client *client, const struct i2c_device_id *id,
>>>>>>> struct pmbus_driver_info *info);
>>>>>
>>>>> int pmbus_do_remove(struct i2c_client *client);
>>>>> diff --git a/drivers/hwmon/pmbus/pmbus_core.c b/drivers/hwmon/pmbus/pmbus_core.c
>>>>> index f1eff6b6c798..153700e35431 100644
>>>>> --- a/drivers/hwmon/pmbus/pmbus_core.c
>>>>> +++ b/drivers/hwmon/pmbus/pmbus_core.c
>>>>> @@ -304,18 +304,24 @@ static int _pmbus_read_byte_data(struct i2c_client *client, int page, int reg)
>>>>>>> return pmbus_read_byte_data(client, page, reg);
>>>>>
>>>>> }
>>>>>
>>>>> -static void pmbus_clear_fault_page(struct i2c_client *client, int page)
>>>>> +static int pmbus_clear_fault_page(struct i2c_client *client, int page)
>>>>> {
>>>>>>>>>>>>> - _pmbus_write_byte(client, page, PMBUS_CLEAR_FAULTS);
>>>>>>> + return _pmbus_write_byte(client, page, PMBUS_CLEAR_FAULTS);
>>>>>
>>>>> }
>>>>>
>>>>> -void pmbus_clear_faults(struct i2c_client *client)
>>>>> +int pmbus_clear_faults(struct i2c_client *client)
>>>>> {
>>>>>>>>>>>>> struct pmbus_data *data = i2c_get_clientdata(client);
>>>>>>>>>>>>> + int rv;
>>>>>>> int i;
>>>>>
>>>>>
>>>>>>>>>>>>> - for (i = 0; i < data->info->pages; i++)
>>>>>>>>>>>>> - pmbus_clear_fault_page(client, i);
>>>>>>>>>>>>> + for (i = 0; i < data->info->pages; i++) {
>>>>>>>>>>>>> + rv = pmbus_clear_fault_page(client, i);
>>>>>>>>>>>>> + if (rv)
>>>>>>>>>>>>> + return rv;
>>>>>>> + }
>>>>>
>>>>> +
>>>>>>> + return 0;
>>>>>
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(pmbus_clear_faults);
>>>>>
>>>>> @@ -333,28 +339,45 @@ static int pmbus_check_status_cml(struct i2c_client *client)
>>>>>>> return 0;
>>>>>
>>>>> }
>>>>>
>>>>> -static bool pmbus_check_register(struct i2c_client *client,
>>>>> +static int pmbus_check_register(struct i2c_client *client,
>>>>>>>>>>>>> int (*func)(struct i2c_client *client,
>>>>>>>>>>>>> int page, int reg),
>>>>>>> int page, int reg)
>>>>>
>>>>> {
>>>>>>>>>>>>> + struct pmbus_data *data;
>>>>>>>>>>>>> + int check;
>>>>>>>>>>>>> int rv;
>>>>>>> - struct pmbus_data *data = i2c_get_clientdata(client);
>>>>>
>>>>>
>>>>>>>>>>>>> - rv = func(client, page, reg);
>>>>>>>>>>>>> - if (rv >= 0 && !(data->flags & PMBUS_SKIP_STATUS_CHECK))
>>>>>>>>>>>>> - rv = pmbus_check_status_cml(client);
>>>>>>>>>>>>> - pmbus_clear_fault_page(client, -1);
>>>>>>>>>>>>> - return rv >= 0;
>>>>>>> + data = i2c_get_clientdata(client);
>>>>>
>>>>> +
>>>>>>>>>>>>> + /*
>>>>>>>>>>>>> + * pmbus_set_page() guards transactions on the requested page matching
>>>>>>>>>>>>> + * the current page. This may be done in the execution of func(), but
>>>>>>>>>>>>> + * at that point a set-page error is conflated with accessing a
>>>>>>>>>>>>> + * non-existent register.
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> + rv = pmbus_set_page(client, page);
>>>>>>>>>>>>> + if (rv < 0)
>>>>>>> + return rv;
>>>>>
>>>>> +
>>>>>>>>>>>>> + check = func(client, page, reg);
>>>>>>>>>>>>> + if (check >= 0 && !(data->flags & PMBUS_SKIP_STATUS_CHECK))
>>>>>>> + check = pmbus_check_status_cml(client);
>>>>>
>>>>> +
>>>>>>>>>>>>> + rv = pmbus_clear_fault_page(client, -1);
>>>>>>>>>>>>> + if (rv < 0)
>>>>>>> + return rv;
>>>>>
>>>>> +
>>>>>>> + return check >= 0;
>>>>>
>>>>> }
>>>>>
>>>>> -bool pmbus_check_byte_register(struct i2c_client *client, int page, int reg)
>>>>> +int pmbus_check_byte_register(struct i2c_client *client, int page, int reg)
>>>>> {
>>>>>>> return pmbus_check_register(client, _pmbus_read_byte_data, page, reg);
>>>>>
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(pmbus_check_byte_register);
>>>>>
>>>>> -bool pmbus_check_word_register(struct i2c_client *client, int page, int reg)
>>>>> +int pmbus_check_word_register(struct i2c_client *client, int page, int reg)
>>>>> {
>>>>>>> return pmbus_check_register(client, _pmbus_read_word_data, page, reg);
>>>>>
>>>>> }
>>>>> @@ -390,7 +413,7 @@ static struct pmbus_data *pmbus_update_device(struct device *dev)
>>>>>
>>>>>>>>>>>>> mutex_lock(&data->update_lock);
>>>>>>>>>>>>> if (time_after(jiffies, data->last_updated + HZ) || !data->valid) {
>>>>>>>>>>>>> - int i, j;
>>>>>>> + int i, j, ret;
>>>>>
>>>>>
>>>>>>>>>>>>> for (i = 0; i < info->pages; i++) {
>>>>>>> data->status[PB_STATUS_BASE + i]
>>>>>
>>>>> @@ -424,7 +447,13 @@ static struct pmbus_data *pmbus_update_device(struct device *dev)
>>>>>>>>>>>>> sensor->page,
>>>>>>>>>>>>> sensor->reg);
>>>>>>>>>>>>> }
>>>>>>> - pmbus_clear_faults(client);
>>>>>
>>>>> +
>>>>>>>>>>>>> + ret = pmbus_clear_faults(client);
>>>>>>>>>>>>> + if (ret < 0) {
>>>>>>>>>>>>> + mutex_unlock(&data->update_lock);
>>>>>>>>>>>>> + return ERR_PTR(ret);
>>>>>>> + }
>>>>>
>>>>> +
>>>>>>>>>>>>> data->last_updated = jiffies;
>>>>>>>>>>>>> data->valid = 1;
>>>>>>> }
>>>>>
>>>>> @@ -754,6 +783,9 @@ static ssize_t pmbus_show_boolean(struct device *dev,
>>>>>>>>>>>>> struct pmbus_data *data = pmbus_update_device(dev);
>>>>>>> int val;
>>>>>
>>>>>
>>>>>>>>>>>>> + if (IS_ERR(data))
>>>>>>> + return PTR_ERR(data);
>>>>>
>>>>> +
>>>>>>>>>>>>> val = pmbus_get_boolean(data, boolean, attr->index);
>>>>>>>>>>>>> if (val < 0)
>>>>>>> return val;
>>>>>
>>>>> @@ -766,6 +798,9 @@ static ssize_t pmbus_show_sensor(struct device *dev,
>>>>>>>>>>>>> struct pmbus_data *data = pmbus_update_device(dev);
>>>>>>> struct pmbus_sensor *sensor = to_pmbus_sensor(devattr);
>>>>>
>>>>>
>>>>>>>>>>>>> + if (IS_ERR(data))
>>>>>>> + return PTR_ERR(data);
>>>>>
>>>>> +
>>>>>>>>>>>>> if (sensor->data < 0)
>>>>>>> return sensor->data;
>>>>>
>>>>>
>>>>> @@ -995,7 +1030,11 @@ static int pmbus_add_limit_attrs(struct i2c_client *client,
>>>>>>> struct pmbus_sensor *curr;
>>>>>
>>>>>
>>>>>>>>>>>>> for (i = 0; i < nlimit; i++) {
>>>>>>>>>>>>> - if (pmbus_check_word_register(client, page, l->reg)) {
>>>>>>>>>>>>> + ret = pmbus_check_word_register(client, page, l->reg);
>>>>>>>>>>>>> + if (ret < 0)
>>>>>>> + return ret;
>>>>>
>>>>> +
>>>>>>>>>>>>> + if (ret) {
>>>>>>>>>>>>> curr = pmbus_add_sensor(data, name, l->attr, index,
>>>>>>>>>>>>> page, l->reg, attr->class,
>>>>>>> attr->update || l->update,
>>>>>
>>>>> @@ -1041,6 +1080,8 @@ static int pmbus_add_sensor_attrs_one(struct i2c_client *client,
>>>>>>>>>>>>> if (!base)
>>>>>>>>>>>>> return -ENOMEM;
>>>>>>>>>>>>> if (attr->sfunc) {
>>>>>>> + int check;
>>>>>
>>>>> +
>>>>>>>>>>>>> ret = pmbus_add_limit_attrs(client, data, info, name,
>>>>>>>>>>>>> index, page, base, attr);
>>>>>>> if (ret < 0)
>>>>>
>>>>> @@ -1050,9 +1091,13 @@ static int pmbus_add_sensor_attrs_one(struct i2c_client *client,
>>>>>>>>>>>>> * alarm attributes, if there is a global alarm bit, and if
>>>>>>>>>>>>> * the generic status register for this page is accessible.
>>>>>>>>>>>>> */
>>>>>>>>>>>>> - if (!ret && attr->gbit &&
>>>>>>>>>>>>> - pmbus_check_byte_register(client, page,
>>>>>>> - data->status_register)) {
>>>>>
>>>>> +
>>>>>>>>>>>>> + check = pmbus_check_byte_register(client, page,
>>>>>>>>>>>>> + data->status_register);
>>>>>>>>>>>>> + if (check < 0)
>>>>>>> + return check;
>>>>>
>>>>> +
>>>>>>>>>>>>> + if (!ret && attr->gbit && check) {
>>>>>>>>>>>>> ret = pmbus_add_boolean(data, name, "alarm", index,
>>>>>>>>>>>>> NULL, NULL,
>>>>>>> PB_STATUS_BASE + page,
>>>>>
>>>>> @@ -1604,8 +1649,12 @@ static int pmbus_add_fan_attributes(struct i2c_client *client,
>>>>>>>>>>>>> if (!(info->func[page] & pmbus_fan_flags[f]))
>>>>>>> break;
>>>>>
>>>>>
>>>>>>>>>>>>> - if (!pmbus_check_word_register(client, page,
>>>>>>>>>>>>> - pmbus_fan_registers[f]))
>>>>>>>>>>>>> + ret = pmbus_check_word_register(client, page,
>>>>>>>>>>>>> + pmbus_fan_registers[f]);
>>>>>>>>>>>>> + if (ret < 0)
>>>>>>> + return ret;
>>>>>
>>>>> +
>>>>>>>>>>>>> + if (!ret)
>>>>>>> break;
>>>>>
>>>>>
>>>>>>> /*
>>>>>
>>>>> @@ -1628,9 +1677,13 @@ static int pmbus_add_fan_attributes(struct i2c_client *client,
>>>>>>>>>>>>> * Each fan status register covers multiple fans,
>>>>>>>>>>>>> * so we have to do some magic.
>>>>>>>>>>>>> */
>>>>>>>>>>>>> - if ((info->func[page] & pmbus_fan_status_flags[f]) &&
>>>>>>>>>>>>> - pmbus_check_byte_register(client,
>>>>>>>>>>>>> - page, pmbus_fan_status_registers[f])) {
>>>>>>>>>>>>> + ret = pmbus_check_byte_register(client, page,
>>>>>>>>>>>>> + pmbus_fan_status_registers[f]);
>>>>>>>>>>>>> + if (ret < 0)
>>>>>>> + return ret;
>>>>>
>>>>> +
>>>>>>>>>>>>> + if ((info->func[page] & pmbus_fan_status_flags[f])
>>>>>>>>>>>>> + && ret) {
>>>>>>> int base;
>>>>>
>>>>>
>>>>>>>>> if (f > 1) /* fan 3, 4 */
>>>>>
>>>>> @@ -1696,10 +1749,13 @@ static int pmbus_identify_common(struct i2c_client *client,
>>>>>>> struct pmbus_data *data, int page)
>>>>>
>>>>> {
>>>>>>>>>>>>> int vout_mode = -1;
>>>>>>> + int rv;
>>>>>
>>>>>
>>>>>>>>>>>>> - if (pmbus_check_byte_register(client, page, PMBUS_VOUT_MODE))
>>>>>>>>>>>>> + rv = pmbus_check_byte_register(client, page, PMBUS_VOUT_MODE);
>>>>>>>>>>>>> + if (rv == 1)
>>>>>>>>>>>>> vout_mode = _pmbus_read_byte_data(client, page,
>>>>>>> PMBUS_VOUT_MODE);
>>>>>
>>>>> +
>>>>>>>>>>>>> if (vout_mode >= 0 && vout_mode != 0xff) {
>>>>>>>>>>>>> /*
>>>>>>> * Not all chips support the VOUT_MODE command,
>>>>>
>>>>> @@ -1725,8 +1781,7 @@ static int pmbus_identify_common(struct i2c_client *client,
>>>>>>>>>>>>> }
>>>>>>> }
>>>>>
>>>>>
>>>>>>>>>>>>> - pmbus_clear_fault_page(client, page);
>>>>>>>>>>>>> - return 0;
>>>>>>> + return pmbus_clear_fault_page(client, page);
>>>>>
>>>>> }
>>>>>
>>>>> static int pmbus_init_common(struct i2c_client *client, struct pmbus_data *data,
>>>>> @@ -1756,7 +1811,9 @@ static int pmbus_init_common(struct i2c_client *client, struct pmbus_data *data,
>>>>>>>>>>>>> if (ret >= 0 && (ret & PB_CAPABILITY_ERROR_CHECK))
>>>>>>> client->flags |= I2C_CLIENT_PEC;
>>>>>
>>>>>
>>>>>>>>>>>>> - pmbus_clear_faults(client);
>>>>>>>>>>>>> + ret = pmbus_clear_faults(client);
>>>>>>>>>>>>> + if (ret < 0)
>>>>>>> + return ret;
>>>>>
>>>>>
>>>>>>>>>>>>> if (info->identify) {
>>>>>>> ret = (*info->identify)(client, info);
>>>>>
>>>>> --
>>>>> 2.11.0
>>>>>
>>
>>
Powered by blists - more mailing lists