[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <DM4PR84MB137383C5042F6554636774B6FDBC9@DM4PR84MB1373.NAMPRD84.PROD.OUTLOOK.COM>
Date: Thu, 16 Mar 2023 00:15:15 +0000
From: "Seymour, Shane M" <shane.seymour@....com>
To: "jejb@...ux.ibm.com" <jejb@...ux.ibm.com>,
Greg KH <gregkh@...uxfoundation.org>
CC: "Martin K. Petersen" <martin.petersen@...cle.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-api@...r.kernel.org" <linux-api@...r.kernel.org>,
"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>
Subject: RE: [PATCH for-next] scsi: Implement host state statistics
> On Wed, 2023-03-15 at 07:36 +0100, Greg KH wrote:
> > On Wed, Mar 15, 2023 at 06:08:19AM +0000, Seymour, Shane M wrote:
> > > The following patch implements host state statistics via sysfs. The
> > > intent is to allow user space to see the state changes and be able
> > > to report when a host changes state. The files do not separate out
> > > the time spent into each state but only into three:
> >
> > Why does userspace care about these things at all?
>
> This is the most important question: Why are times spent in various
> states and transition counts important? Is this some kind of
> predictive failure system, or is it simply logging? If it's logging,
> wouldn't you get better information if we output state changes as they
> occur then they'd appear as timestamped entries in the syslog from
> which all these statistics could be deduced?
Hi James,
I had to write something to read the statistics to ensure that what was
being provided was sane and usable. Currently the program does:
1) Logging of state changes (with a count and what the current state is).
2) Logging a percentage of time spent in recovery over the last interval
(default 10 minutes) if that percentage is increasing.
I do plan on implementing the following in the near future:
1) Keeping statistical information in memory (for at least):
a) Hourly for the last 96 hours
b) Daily for the last 90 days
2) Analysing that data hourly and daily to determine if there is a
trend that is increasing or decreasing in terms of the count and the
time spent (if any) in recovery. That is are things getting better,
worse, or staying the same.
My end goal is to provide at least some warning that there may be a
storage issue and if it appears to be getting worse. I do want the
user space program to be something more than just something that logs
messages about state changes.
In regard to your idea about outputting state changes it's interesting
but I can see several drawbacks. The first is if you use syslog you don't
really have any idea where the messages will end up. Different distros
have different destinations (e.g. messages vs syslog vs systemd
journal) and you can configure the syslog daemon so that the messages
always end up on a different system.
There will be issues handling those files as well. You need to cope with
log file rotation, how many copies of old messages/syslog files are kept
when rotated, if they are compressed or not (and reading them when they
are), are any missing, how far to go back if there are a lot of old
messages/syslog files. I think you would need to look at them all
to determine what files were relevant and needed to be processed.
Having said that none of those issues are insurmountable but it makes
it hard to do the analysis I want to implement on the data. The
variability of the quantity of available data (how many messages/syslog
files you have) over a period of time provides challenges.
>
> > What tool needs them and what can userspace do with the
> > information?
> > >
> [...]
> > > A (GPLv2) program called hostmond will be released in a few months
> > > that will monitor these interfaces and report (local host only via
> > > syslog(3C)) when hosts change state.
> >
> > We kind of need to see this before the kernel changes can be accepted
> > for obvious reasons, what is preventing that from happening now?
>
> I don't think that's a requirement. The whole point of sysfs is it's
> user readable, so we don't need a tool to make use of its entries. On
> the other hand if this tool can help elucidate the use case for these
> statistics, then publishing it now would be useful to help everyone
> else understand why this is useful.
The main use of the existing code would be making it easier to work
out how to read the statistics from the sysfs files at the moment.
If the feedback is wait until I've fully implemented the user space program
with the analysis component and made it available I'm more than happy to do
that.
Thanks
Shane
>
> James
Powered by blists - more mailing lists