linux-kernel - Re: [PATCH 1/2] i2c_imc: New driver for Intel's iMC, found on LGA2011 chips

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54FC6D2B.3060307@roeck-us.net>
Date:	Sun, 08 Mar 2015 08:39:23 -0700
From:	Guenter Roeck <linux@...ck-us.net>
To:	Andy Lutomirski <luto@...capital.net>
CC:	Jean Delvare <khali@...ux-fr.org>,
	Boaz Harrosh <boaz@...xistor.com>,
	One Thousand Gnomes <gnomes@...rguk.ukuu.org.uk>,
	Rui Wang <ruiv.wang@...il.com>,
	Alun Evans <alun@...gerous.net>,
	Robert Elliott <Elliott@...com>,
	"linux-i2c@...r.kernel.org" <linux-i2c@...r.kernel.org>,
	Wolfram Sang <wsa@...-dreams.de>,
	Mauro Carvalho Chehab <m.chehab@...sung.com>,
	Tony Luck <tony.luck@...el.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] i2c_imc: New driver for Intel's iMC, found on LGA2011
 chips

On 03/08/2015 07:03 AM, Andy Lutomirski wrote:
> On Mar 7, 2015 6:39 AM, "Guenter Roeck" <linux@...ck-us.net> wrote:
>>
>> On 03/06/2015 06:50 PM, Andy Lutomirski wrote:
>>>
>>> Sandy Bridge Xeon and Extreme chips have integrated memory
>>> controllers with (rather limited) onboard SMBUS masters.  This
>>> driver gives access to the bus.
>>>
>>> There are various groups working on standardizing a way to arbitrate
>>> access to the bus between the OS, SMM firmware, a BMC, hardware
>>> thermal control, etc.  In the mean time, running this driver is
>>> unsafe except under special circumstances.  Nonetheless, this driver
>>> has real users.
>>>
>>> As a compromise, the driver will refuse to load unless
>>> i2c_imc.allow_unsafe_access=Y.  When safe access becomes available,
>>> we can leave this option as a way for legacy users to run the
>>> driver, and we'll allow the driver to load by default if safe bus
>>> access is available.
>>>
>>> Signed-off-by: Andy Lutomirski <luto@...capital.net>
>>> ---
[ ... ]

>>> +
>>> +       if (imc_wait_not_busy(priv, chan, &stat) != 0) {
>>> +               /* Timeout.  TODO: Reset the controller? */
>>> +               ret = -EIO;
>>
>>
>> timeout -> -ETIMEDOUT ?
>
> OK
>
Actually, I just realized that imc_wait_not_busy returns a valid error code.
Given that, some static analysis checkers (and now me) will ask you
why you don't just use the error code from imc_wait_not_busy.
This applies to other calls to the same function as well.

>>
>>
>>> +               dev_err(&priv->pci_dev->dev, "controller is wedged\n");
>>
>>
>> If this happens, it will presumably happen all the time and the message will
>> pollute the log. Is the message really necessary ?
>
> I'd rather log something to help diagnose.  Would rate-limiting it be okay?
>
It would still pollute the log because it doesn't happen that often.
A message once a second still fills the log.

If it is for diagnose/debugging, why not dev_dbg ?

>>
>>
>>> +               goto out_release;
>>> +       }
>>> +
>>> +       /*
>>> +        * Be paranoid: try to detect races.  This will only detect races
>>> +        * against BIOS, not against hardware.  (I've never seen this happen.)
>>> +        */
>>> +       pci_read_config_dword(priv->pci_dev, SMBCMD(chan), &final_cmd);
>>> +       pci_read_config_dword(priv->pci_dev, SMBCNTL(chan), &final_cntl);
>>> +       if (((cmd ^ final_cmd) & SMBCMD_OUR_BITS) ||
>>> +           ((cntl ^ final_cntl) & SMBCNTL_OUR_BITS)) {
>>> +               WARN(1, "iMC SMBUS raced against firmware");
>>> +               dev_emerg(&priv->pci_dev->dev,
>>
>>
>> Is a stack trace and dev_emerg really warranted here ?
>>
>
> If this happens, something's very wrong and the user should stop using
> the driver.  We could potentially write the wrong address, and, if we
> manage to screw up thermal management, we could potentially corrupt
> data for to an inappropriate refresh interval.
>
> IOW, I want to hear about it if this happens.
>
Ok, that explains the WARN. Still not an "emergency", though.

>>
>>> +                         "Access to channel %d raced: cmd 0x%08X->0x%08X, cntl 0x%08X->0x%08X\n",
>>> +                         chan, cmd, final_cmd, cntl, final_cntl);
>>> +               atomic_set(&imc_raced, 1);
>>> +               ret = -EIO;
>>> +               goto out_release;
>>> +       }
>>> +
>>> +       if (stat & SMBSTAT_SBE) {
>>> +               /*
>>> +                * Clear the error to re-enable TSOD polling.  The docs say
>>> +                * that, as long as SBE is set, TSOD polling won't happen.
>>> +                * The docs also say that writing zero to this bit (which is
>>> +                * the only writable bit in the whole register) will clear
>>> +                * the error.  Empirically, writing 0 does not clear SBE, but
>>> +                * it's probably still good to do the write in compliance with
>>> +                * the spec.  (TSOD polling still happens and seems to
>>> +                * clear SBE on its own.)
>>> +                */
>>> +               pci_write_config_dword(priv->pci_dev, SMBSTAT(chan), 0);
>>> +               ret = -ENXIO;
>>> +               goto out_release;
>>> +       }
>>> +
>>> +       if (read_write == I2C_SMBUS_READ) {
>>> +               if (stat & SMBSTAT_RDO) {
>>> +                       /*
>>> +                        * The iMC SMBUS controller thinks of SMBUS
>>> +                        * words as being big-endian (MSB first).
>>> +                        * Linux treats them as little-endian, so we need
>>> +                        * to swap them.
>>> +                        *
>>> +                        * Note: the controller will often (always?) set
>>> +                        * WOD here.  This is probably a bug.
>>> +                        */
>>> +                       if (size == I2C_SMBUS_WORD_DATA)
>>> +                               data->word = swab16(stat & SMBSTAT_RDATA_MASK);
>>> +                       else
>>> +                               data->byte = stat & 0xFF;
>>> +                       ret = 0;
>>
>>
>> ret is already guaranteed to be 0 here.
>>
>>
>>> +               } else {
>>> +                       dev_err(&priv->pci_dev->dev,
>>> +                               "Unexpected read status 0x%08X\n", stat);
>>> +                       ret = -EIO;
>>> +               }
>>> +       } else {
>>> +               if (stat & SMBSTAT_WOD) {
>>> +                       /* Same bug (?) as in the read case. */
>>> +                       ret = 0;
>>
>>
>> ret is already 0, so only the else case is really needed.
>>
>
> I wanted to keep the success and failure paths in the same order in
> both the read and write cases.  I'll remove the unnecessary
> assignment, though.
>
Coding style suggests

	if (!(stat & SMBSTAT_RDO)) {
		dev_err();
		ret - -EIO;
		goto out_release;
	}

and

	if (!(stat & SMBSTAT_WOD)) {
		dev_err();
		ret = -EIO;
		goto out_release;
	}

which would improve readability here a lot since it would reduce
indentation level by one.

On a side note, I am a bit confused about the note "same bug as in the read case".
Do you want to say that RDO is sometimes/often set in the write case ?
If so, it might make more sense to just say it.

[ ... ]

>>> +static void imc_free_channel(struct imc_priv *priv, int i)
>>> +{
>>> +       struct imc_channel *ch = &priv->channels[i];
>>> +
>>> +       /* This can recurse into imc_smbus_xfer. */
>>
>>
>> So ?
>
> It needs to happen before mutex_destroy.  I improved the comment.
>
Seems to me obvious that a mutex would be destroyed last in cleanup.

>>
>>
>>> +       i2c_del_adapter(&ch->adapter);
>>> +
>>> +       mutex_destroy(&ch->mutex);
>>> +}
>>> +
>>> +static struct pci_dev *imc_get_related_device(struct pci_bus *bus, unsigned int devfn, u16 devid)
>>> +{
>>> +       struct pci_dev *ret = pci_get_slot(bus, devfn);
>>
>>
>> ret is a bit unusual as variable name for a pointer. dev, maybe ?
>>
>>
>>> +       if (!ret)
>>> +               return NULL;
>>> +       if (ret->vendor != PCI_VENDOR_ID_INTEL || ret->device != devid) {
>>> +               pci_dev_put(ret);
>>> +               return NULL;
>>> +       }
>>> +       return ret;
>>> +}
>>> +
>>> +static int imc_probe(struct pci_dev *dev, const struct pci_device_id *id)
>>> +{
>>> +       int i, err;
>>> +       struct imc_priv *priv;
>>> +       struct pci_dev *sad;  /* System Address Decoder */
>>> +       u32 sad_control;
>>> +
>>> +       /* Paranoia: the datasheet says this is always at 15.0 */
>>> +       if (dev->devfn != PCI_DEVFN(15, 0))
>>> +               return -ENODEV;
>>> +
>>> +       /*
>>> +        * The socket number is hidden away on a different PCI device.
>>> +        * There's another copy at devfn 11.0 offset 0x40, and an even
>>> +        * less convincing copy at 5.0 0x140.  The actual APICID register
>>> +        * is "not used ... and is still implemented in hardware because
>>> +        * of FUD".
>>> +        *
>>> +        * In principle we could double-check that the socket matches
>>> +        * the numa_node from SRAT, but this is probably not worth it.
>>> +        */
>>> +       sad = imc_get_related_device(dev->bus, PCI_DEVFN(13, 6),
>>> +                                    PCI_DEVICE_ID_INTEL_SBRIDGE_BR);
>>> +       if (!sad)
>>> +               return -ENODEV;
>>> +       pci_read_config_dword(sad, SAD_CONTROL, &sad_control);
>>> +       pci_dev_put(sad);
>>> +
>>> +       priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>>
>>
>> devm_kzalloc() ?
>>
>
> Done.
>
>>
>>> +       if (!priv)
>>> +               return -ENOMEM;
>>> +       priv->pci_dev = dev;
>>> +
>>> +       pci_set_drvdata(dev, priv);
>>> +
>>> +       for (i = 0; i < 2; i++) {
>>> +               int j;
>>> +               err = imc_init_channel(priv, i, sad_control & 0x7);
>>> +               if (err) {
>>> +                       for (j = 0; j < i; j++)
>>
>>
>>                          if (i)
>>                                  imc_free_channel(priv, 0);
>>
>> would be a bit simpler and accomplish the same.
>
> I want to be ready for future hardware that might support more than
> two channels.
>
Not my call to make, but I am a bit wary of future hardware support which may
never materialize. I prefer writing code liks this for the current use case.
The time to optimize the code for the future hardware is if and when the future
hardware materializes.

In general, I am also in favor of the guidance in the coding style document,
which suggests to have a single error exit and handle any necessary cleanup there.
In this case, it could be

		if (err)
			goto exit_cleanup;
...
exit_cleanup:
	for (i--; i >= 0; i--)
		imc_free_channel(priv, i);
exit_free:

...

>>> +       }
>>> +
>>> +       pr_warn("using this driver is dangerous unless your firmware is specifically designed for it; use at your own risk\n");
>>
>>
>> Seems to me this is a bit noisy. User should already know.
>
> I think I'm willing to mildly annoy the smallish number of legitimate
> allow_unsafe_access users to help scare away all the people who like
> shiny decode-dimms toys and enable this because some forum told them
> to.  I could be convinced otherwise, though.
>
Not my call to make ... you'll have to convince the maintainer.

Anyway, I wonder if it would make sense to use acpi_check_resource_conflict()
to check if there is a resource conflict with the BIOS instead of all these
warnings, and if that would answer the concerns about unsynchronized access.
 From looking into the datasheet, I don't really see the difference to the
i2c-i801 driver and other drivers where chip access might conflict with
BIOS / ACPI access. I may have missed some discussion, though, so maybe that
has been discussed already and doesn't work in this case.

> One other question: from my reading of the spec, it should be possible to
> augment this driver to expose a temporate sensor subdevice that shows
> recent cached temperatures from HW DIMM measurements.  They would be
> redundant with the jc42 outputs, but it would be safe to use them even on
> systems without safe SMBUS arbitration.  Should I do that as a followup
> later on?
>

Without thinking too much about it, this should be a separate driver,
and I think it might actually be more valuable (since less risky)
than this entire patch set.

Thanks,
Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/