lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Wed, 11 Mar 2020 18:58:46 +0000
From:   "Nath, Arindam" <Arindam.Nath@....com>
To:     Logan Gunthorpe <logang@...tatee.com>,
        "Mehta, Sanju" <Sanju.Mehta@....com>,
        "jdmason@...zu.us" <jdmason@...zu.us>,
        "dave.jiang@...el.com" <dave.jiang@...el.com>,
        "allenbh@...il.com" <allenbh@...il.com>,
        "S-k, Shyam-sundar" <Shyam-sundar.S-k@....com>
CC:     "linux-ntb@...glegroups.com" <linux-ntb@...glegroups.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2 2/5] ntb_perf: send command in response to EAGAIN

> -----Original Message-----
> From: Logan Gunthorpe <logang@...tatee.com>
> Sent: Thursday, March 12, 2020 00:17
> To: Nath, Arindam <Arindam.Nath@....com>; Mehta, Sanju
> <Sanju.Mehta@....com>; jdmason@...zu.us; dave.jiang@...el.com;
> allenbh@...il.com; S-k, Shyam-sundar <Shyam-sundar.S-k@....com>
> Cc: linux-ntb@...glegroups.com; linux-kernel@...r.kernel.org
> Subject: Re: [PATCH v2 2/5] ntb_perf: send command in response to EAGAIN
> 
> 
> 
> On 2020-03-11 12:11 p.m., Nath, Arindam wrote:
> >> -----Original Message-----
> >> From: Logan Gunthorpe <logang@...tatee.com>
> >> Sent: Wednesday, March 11, 2020 03:01
> >> To: Mehta, Sanju <Sanju.Mehta@....com>; jdmason@...zu.us;
> >> dave.jiang@...el.com; allenbh@...il.com; Nath, Arindam
> >> <Arindam.Nath@....com>; S-k, Shyam-sundar <Shyam-sundar.S-
> >> k@....com>
> >> Cc: linux-ntb@...glegroups.com; linux-kernel@...r.kernel.org
> >> Subject: Re: [PATCH v2 2/5] ntb_perf: send command in response to
> EAGAIN
> >>
> >>
> >>
> >> On 2020-03-10 2:54 p.m., Sanjay R Mehta wrote:
> >>> From: Arindam Nath <arindam.nath@....com>
> >>>
> >>> perf_spad_cmd_send() and perf_msg_cmd_send() return
> >>> -EAGAIN after trying to send commands for a maximum
> >>> of MSG_TRIES re-tries. But currently there is no
> >>> handling for this error. These functions are invoked
> >>> from perf_service_work() through function pointers,
> >>> so rather than simply call these functions is not
> >>> enough. We need to make sure to invoke them again in
> >>> case of -EAGAIN. Since peer status bits were cleared
> >>> before calling these functions, we set the same status
> >>> bits before queueing the work again for later invocation.
> >>> This way we simply won't go ahead and initialize the
> >>> XLAT registers wrongfully in case sending the very first
> >>> command itself fails.
> >>
> >> So what happens if there's an actual non-recoverable error that causes
> >> perf_msg_cmd_send() to fail? Are you proposing it just requeues high
> >> priority work forever?
> >
> > The intent of the patch is to handle -EAGAIN, since the error code is
> > an indication that we need to try again later. Currently there is a very
> > small time frame during which ntb_perf should be loaded on both sides
> > (primary and secondary) to have XLAT registers configured correctly.
> > Failing that the code will still fall through without properly initializing the
> > XLAT registers and there is no indication of that either until we have
> > actually tried to perform 'echo 0 > /sys/kernel/debug/.../run'.
> >
> > With the changes proposed in this patch, we do not have to depend
> > on whether the drivers at both ends are loaded within a fixed time
> > duration. So we can simply load the driver at one side, and at a later
> > time load the driver on the other, and still the XLAT registers would
> > be set up correctly.
> >
> > Looking at perf_spad_cmd_send() and perf_msg_cmd_send(), if the
> > concern is that ntb_peer_spad_read()/ntb_msg_read_sts() fail because
> > of some non-recoverable error and we still schedule a high priority
> > work, that is a valid concern. But isn't it still better than simply falling
> > through and initializing XLAT register with incorrect values?
> 
> I don't think it's ever acceptable to get into an infinite loop.
> Especially when you're running on the system's high priority work queue...
> 
> At the very least schedule a delayed work item to try again in some
> number of seconds or something. Essentially just have more retires,
> perhaps with longer delays in between.

Sounds like a good idea. Thanks for the suggestion.

Arindam

> 
> Falling through and continuing with the wrong values is certainly wrong.
> I didn't notice that. If an error occurs, it shouldn't continue, it
> should print an error to dmesg and stop.
> 
> >
> >>
> >> I never really reviewed this stuff properly but it looks like it has a
> >> bunch of problems. Using the high priority work queue for some low
> >> priority setup work seems wrong, at the very least. The spad and msg
> >> send loops also look like they have a bunch of inter-host race condition
> >> problems as well. Yikes.
> >
> > I am not very sure what the design considerations were when having
> > a high priority work queue, but perhaps we can all have a discussion
> > on this.
> 
> I'd change it. Seems completely wrong to me.
> 
> Logan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ