netdev - RE: [PATCH net V4 2/2] igb: offer a PTP Hardware Clock instead of the timecompare method

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 26 Mar 2012 20:46:13 +0000
From:	"Keller, Jacob E" <jacob.e.keller@...el.com>
To:	chetan loke <loke.chetan@...il.com>
CC:	Richard Cochran <richardcochran@...il.com>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"e1000-devel@...ts.sourceforge.net" 
	<e1000-devel@...ts.sourceforge.net>,
	"Kirsher, Jeffrey T" <jeffrey.t.kirsher@...el.com>,
	"Ronciak, John" <john.ronciak@...el.com>,
	"john.stultz@...aro.org" <john.stultz@...aro.org>,
	"tglx@...utronix.de" <tglx@...utronix.de>
Subject: RE: [PATCH net V4 2/2] igb: offer a PTP Hardware Clock instead of
 the timecompare method


> -----Original Message-----
> From: chetan loke [mailto:loke.chetan@...il.com]
> Sent: Monday, March 26, 2012 12:32 PM
> To: Keller, Jacob E
> Cc: Richard Cochran; netdev@...r.kernel.org; e1000-
> devel@...ts.sourceforge.net; Kirsher, Jeffrey T; Ronciak, John;
> john.stultz@...aro.org; tglx@...utronix.de
> Subject: Re: [PATCH net V4 2/2] igb: offer a PTP Hardware Clock instead of the
> timecompare method
> 
> On Mon, Mar 26, 2012 at 2:56 PM, Keller, Jacob E <jacob.e.keller@...el.com>
> wrote:
> 
> > I am curious how you see a kernel thread resolving the Tx/Rx issue? or
> > is the kernel thread being used by gettime? I don't believe we can
> > wait for the Tx/Rx
> 
> tx, rx path and kernel thread will use seq_lock_isave(tmreg_lock).
> 
> A new 'u64 cached_ns' will be introduced.
> 
> tx and rx path can update 'cached_ns' when they read the NIC counter.
>
Tx and rx paths don't read the SYSTIM registers. When hardware (other
than the 82580 which could use a different software implementation)
receives a PTP packet, it is time stamped *by the hardware* and the
result is stored in the RXSTAMP registers (analog for rx/tx). The
hardware will not timestamp again until after the register is read.
(ignore 82580, it is different). Then the driver will read this register.

The clincher for why we have so much issues here: the SYSTIM registers
are *not* 64bit nanosecond values. They are 64 bits wide (82580 is 40
bits + 32 bits of sub precision or 72 bits wide) but with a fixed point
system. This system allows for defining how much room for adjustment
you want but sacrifices the total amount that can be stored as nanoseconds.

The timecounter structure keeps track of a 64bit field of nanoseconds,
and is based on the cyclecounter. (By knowing how fast the cyclecounter
should be running at 100% optimum circumstances.)

In the tx/rx routine, we convert the cycle timestamps which has an
overflow rate anywhere from once a minute to a few hours. This
conversion uses last updated value of the timecounter, but does not
modify/update the counter. However, it has to lock the structure in
case an update occurs during the cyc2time function call, otherwise you
can corrupt the values. There is no problem using seqlock here. (as
long as we retry). I would take the write lock here, so that it doesn't
have to retry, although you could take the read version. Retrying is
necessary because otherwise there is a window where wildly inaccurate
nanosecond values could be obtained.

The issue with gettime is in the SYSTIM registers. Gettime could also
use cyc2time, and allow the background worktask to be the only one updating
the timecounter's last read value. (this only has to happen at least twice
every wraparound value, so for the 10gig 82599 about once every 30 seconds).
But the lock is also protecting against corruption of the SYSTIM register
reads. (The lock does 2 different things. We could add a separate lock,
but I am not convinced it is worth it).

We can't use timecounter_read inside a read portion of the seqlock because
it has side effects. Reading SYSTIM registers has side effects also
because we have to read them sequentially (lo, then hi, cannot intersperse
two sets of lo and hi reads).

I wrote a patch which shows how you could use just seqlocks, but it has
the issue mentioned above with the SYSTIM registers. These only get used
in the gettime code and the watchdog task that updates the timecounter.
We could implement a basic spinlock around that also, and that would mean
the seqlock would only be around the timecounter structure. This does add
the complication of two locks. :(

I feel like this approach is best, considering that each lock does precisely
one thing, and rarely need to be acquired at the same time. (the only place
being inside timecounter_read, and possibly in cyclecounter_update for the ixgbe)

> A kernel-thread should be scheduled periodically, will read the NIC counter
> and update cached_ns. It will use a _trylock variant. If the lock fails then
> its a hint that 'cached_ns' is getting updated somewhere, so just refresh the
> timer. Periodic update is needed to handle idle/bursty link conditions because
> the Rx/tx path may not run that often.
> 
> gettime uses read_seq_lock and reads 'cached_ns'. I mean we could also do a
> atomic_read(?)

As above cached_ns would not work. Timecounter structure isn't ticking on its
own. It converts the SYSTIM values to nanoseconds and keeps track of overflows.
(thereby allowing it to create nanoseconds since the epoch as a 64bit value,
which is what the user software expects). Atomic reads don't work because the
atomic values only have around 24bits of actual data. (we would need at least 64bits)

>
> 
> > path, because if we take too long the software sees it as a dropped
> timestamp.
> 
> What code-path is this?
>
For transmit only, inside the user software, it sends a packet then waits
for a copy of that packet on the receive errqueue which is copied by the kernel
and put there with the TX timestamp. If the driver takes too long to report the
timestamp then the user software times out and reports a time stamp as missing
even though it would later be returned by the driver. It is not an issue with
the RX timestamps.

> 
> > How would we only allow one app? Any app with permissions could call the
> ioctls.
> > I do agree that having too many ioctls is a problem. Even in cases
> > where PTP is
> 
> We don't. So this is what I'm thinking. Ideally speaking only 1 app should be
> adjusting the host-clock and 1 app per NIC if you adjust NIC's clock. Ignore
> the host-clock app. Once you enforce the rate-limit, driver will return -EBUSY
> if you exceed it. Too many EBUSYs will provide a hint on user-side.
>

This might be possible, but how would you code the solution? Software wouldn't
have any idea how often things are being run. Maybe have a timeout value that you
check for determining whether the last call came too soon?

> 
> Chetan