[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DM4PR84MB13737BE099BF599DF83617DBFDBF9@DM4PR84MB1373.NAMPRD84.PROD.OUTLOOK.COM>
Date: Wed, 15 Mar 2023 22:41:28 +0000
From: "Seymour, Shane M" <shane.seymour@....com>
To: Greg KH <gregkh@...uxfoundation.org>
CC: "Martin K. Petersen" <martin.petersen@...cle.com>,
"jejb@...ux.ibm.com" <jejb@...ux.ibm.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>
Subject: RE: [PATCH for-next] scsi: Implement host state statistics
> On Wed, Mar 15, 2023 at 06:08:19AM +0000, Seymour, Shane M wrote:
> > The following patch implements host state statistics via sysfs. The intent
> > is to allow user space to see the state changes and be able to report when
> > a host changes state. The files do not separate out the time spent into
> > each state but only into three:
>
> Why does userspace care about these things at all? What tool needs them
> and what can userspace do with the information?
>
In enterprise setups you may a significant number of LUNs presented to a
system (100s to 1000s) via a single HBA (usually via FC). Having a HBA going
into error handling causes issues. Every time a host goes into SCSI EH all
I/O to that host is blocked until SCSI EH completes. That means waiting for
every I/O to either complete or timeout before starting any recovery
processing.
At this time there is no way for anything outside of the kernel to know if a
HBA is having any issues. The cause of those issues can vary significantly,
just two examples:
1) Storage end point issues
2) SAN issues (e.g. laser transmit power at any point in the SAN)
My experience with downstream distros is that nobody seems to notice the
noise that SCSI EH produces (LUN, device, bus, host resets) and we see it
when we get a vmcore and have to try and work out what caused an I/O hang.
I wanted to be more proactive in warning users that you've got a potential
storage issue you need to look at. It won't help when you have a sudden
massive issue but if you have an issue that is slowly getting worse over
a period of time you will at least get some warning.
> >
> > A (GPLv2) program called hostmond will be released in a few months that
> > will monitor these interfaces and report (local host only via syslog(3C))
> > when hosts change state.
>
> We kind of need to see this before the kernel changes can be accepted
> for obvious reasons, what is preventing that from happening now?
If you don't mind I'll answer this in my reply to James' email soon since
he commented about this.
>
> Please always use sysfs_emit() instead of the crazy scnprintf() for
> sysfs entries.
No problem I can make that change.
>
> u32 is a kernel type, not uint32_t please, but I don't know what the
> scsi layer is used to.
No problem I can make that change.
>
> thanks,
>
> greg k-h
Thank you for your willingness to provide feedback.
Shane
Powered by blists - more mailing lists