linux-kernel - Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZGT6sTOtk+WY3aYt@bhelgaas>
Date:   Wed, 17 May 2023 11:02:57 -0500
From:   Bjorn Helgaas <helgaas@...nel.org>
To:     Grant Grundler <grundler@...omium.org>
Cc:     Rajat Jain <rajatja@...omium.org>,
        Rajat Khandelwal <rajat.khandelwal@...ux.intel.com>,
        linux-pci@...r.kernel.org,
        Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
        linux-kernel@...r.kernel.org,
        Oliver O 'Halloran <oohall@...il.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the
 correctable errors

On Fri, Apr 07, 2023 at 04:46:03PM -0700, Grant Grundler wrote:
> On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
> > On Fri, Apr 07, 2023 at 11:53:27AM -0700, Grant Grundler wrote:
> > > On Thu, Apr 6, 2023 at 12:50 PM Bjorn Helgaas <helgaas@...nel.org>
> > wrote:
> > > > On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote:
> > > > > From: Rajat Khandelwal <rajat.khandelwal@...ux.intel.com>
> > > > >
> > > > > There are many instances where correctable errors tend to inundate
> > > > > the message buffer. We observe such instances during thunderbolt PCIe
> > > > > tunneling.
> > > ...
> >
> > > > >               if (info->severity == AER_CORRECTABLE)
> > > > > -                     pci_info(dev, "   [%2d] %-22s%s\n", i, errmsg,
> > > > > -                             info->first_error == i ? " (First)" :
> > "");
> > > > > +                     pci_info_ratelimited(dev, "   [%2d]
> > %-22s%s\n", i, errmsg,
> > > > > +                                          info->first_error == i ?
> > " (First)" : "");
> > > >
> > > > I don't think this is going to reliably work the way we want.  We have
> > > > a bunch of pci_info_ratelimited() calls, and each caller has its own
> > > > ratelimit_state data.  Unless we call pci_info_ratelimited() exactly
> > > > the same number of times for each error, the ratelimit counters will
> > > > get out of sync and we'll end up printing fragments from error A mixed
> > > > with fragments from error B.
> > >
> > > Ok - what I'm reading between the lines here is the output should be
> > > emitted in one step, not multiple pci_info_ratelimited() calls. if the
> > > code built an output string (using sprintnf()), and then called
> > > pci_info_ratelimited() exactly once at the bottom, would that be
> > > sufficient?
> > >
> > > > I think we need to explicitly manage the ratelimiting ourselves,
> > > > similar to print_hmi_event_info() or print_extlog_rcd().  Then we can
> > > > have a *single* ratelimit_state, and we can check it once to determine
> > > > whether to log this correctable error.
> > >
> > > Is the rate limiting per call location or per device? From above, I
> > > understood rate limiting is "per call location".  If the code only
> > > has one call location, it should achieve the same goal, right?
> >
> > Rate-limiting is per call location, so yes, if we only have one call
> > location, that would solve it.  It would also have the nice property
> > that all the output would be atomic so it wouldn't get mixed with
> > other stuff, and it might encourage us to be a little less wordy in
> > the output.
> >
> 
> +1 to all of those reasons. Especially reducing the number of lines output.
> 
> I'm going to be out for the next week. If someone else (Rajat Kendalwal
> maybe?) wants to rework this to use one call location it should be fairly
> straight forward. If not, I'll tackle this when I'm back (in 2 weeks
> essentially).

Ping?  Really hoping to merge this for v6.5.

Bjorn