linux-kernel - RFC: revert request for cpuidle patches e11538d1 and 69a37bea

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130726173306.GB17985@jeder.rdu.redhat.com>
Date:	Fri, 26 Jul 2013 13:33:06 -0400
From:	Jeremy Eder <jeder@...hat.com>
To:	linux-kernel@...r.kernel.org
Cc:	rafael.j.wysocki@...el.com, riel@...hat.com,
	youquan.song@...el.com, paulmck@...ux.vnet.ibm.com,
	daniel.lezcano@...aro.org, arjan@...ux.intel.com,
	len.brown@...el.com
Subject: RFC:  revert request for cpuidle patches e11538d1 and 69a37bea

Hello,

We believe we've identified a particular commit to the cpuidle code that
seems to be impacting performance of variety of workloads.  The simplest way to
reproduce is using netperf TCP_RR test, so we're using that, on a pair of
Sandy Bridge based servers.  We also have data from a large database setup
where performance is also measurably/positively impacted, though that test
data isn't easily share-able.

Included below are test results from 3 test kernels:

kernel       reverts
-----------------------------------------------------------
1) vanilla   upstream (no reverts)

2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c

3) test      reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
                     e11538d1f03914eb92af5a1a378375c05ae8520c

In summary, netperf TCP_RR numbers improve by approximately 4% after
reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.  When
69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never
seems to get above 40%.  Taking that patch out gets C0 near 100% quite
often, and performance increases.

The below data are histograms representing the %c0 residency @ 1-second
sample rates (using turbostat), while under netperf test.

- If you look at the first 4 histograms, you can see %c0 residency almost
  entirely in the 30,40% bin.
- The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4,
  shows %c0 in the 80,90,100% bins.

Below each kernel name are netperf TCP_RR trans/s numbers for the
particular kernel that can be disclosed publicly, comparing the 3 test
kernels.  We ran a 4th test with the vanilla kernel where we've also set
/dev/cpu_dma_latency=0 to show overall impact boosting single-threaded
TCP_RR performance over 11% above baseline.

3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):  
TCP_RR trans/s 54323.78

-----------------------------------------------------------
3.10-rc2 vanilla RX (no reverts)
TCP_RR trans/s 48192.47

Receiver %c0 
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     0]: 
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [    59]: 
***********************************************************
   40.0000 -    50.0000 [     1]: *
   50.0000 -    60.0000 [     0]: 
   60.0000 -    70.0000 [     0]: 
   70.0000 -    80.0000 [     0]: 
   80.0000 -    90.0000 [     0]: 
   90.0000 -   100.0000 [     0]: 

Sender %c0
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     0]: 
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [    11]: ***********
   40.0000 -    50.0000 [    49]:
*************************************************
   50.0000 -    60.0000 [     0]: 
   60.0000 -    70.0000 [     0]: 
   70.0000 -    80.0000 [     0]: 
   80.0000 -    90.0000 [     0]: 
   90.0000 -   100.0000 [     0]: 

-----------------------------------------------------------
3.10-rc2 perfteam2 RX (reverts commit
e11538d1f03914eb92af5a1a378375c05ae8520c)
TCP_RR trans/s 49698.69

Receiver %c0 
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     1]: *
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [    59]:
***********************************************************
   40.0000 -    50.0000 [     0]: 
   50.0000 -    60.0000 [     0]: 
   60.0000 -    70.0000 [     0]: 
   70.0000 -    80.0000 [     0]: 
   80.0000 -    90.0000 [     0]: 
   90.0000 -   100.0000 [     0]: 

Sender %c0 
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     0]: 
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [     2]: **
   40.0000 -    50.0000 [    58]:
**********************************************************
   50.0000 -    60.0000 [     0]: 
   60.0000 -    70.0000 [     0]: 
   70.0000 -    80.0000 [     0]: 
   80.0000 -    90.0000 [     0]: 
   90.0000 -   100.0000 [     0]: 

-----------------------------------------------------------
3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 and
e11538d1f03914eb92af5a1a378375c05ae8520c)
TCP_RR trans/s 47766.95

Receiver %c0
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     1]: *
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [    27]: ***************************
   40.0000 -    50.0000 [     2]: **
   50.0000 -    60.0000 [     0]: 
   60.0000 -    70.0000 [     2]: **
   70.0000 -    80.0000 [     0]: 
   80.0000 -    90.0000 [     0]: 
   90.0000 -   100.0000 [    28]: ****************************

Sender:
    0.0000 -    10.0000 [     1]: *
   10.0000 -    20.0000 [     0]: 
   20.0000 -    30.0000 [     0]: 
   30.0000 -    40.0000 [    11]: ***********
   40.0000 -    50.0000 [     0]: 
   50.0000 -    60.0000 [     1]: *
   60.0000 -    70.0000 [     0]: 
   70.0000 -    80.0000 [     3]: ***
   80.0000 -    90.0000 [     7]: *******
   90.0000 -   100.0000 [    38]: **************************************

These results demonstrate gaining back the tendency of the CPU to stay in
more responsive, performant C-states (and thus yield measurably better
performance), by reverting commit 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.

While taking into account the changing landscape with regards to CPU
governors, and both P- and C-states, we think that a single-thread should
still be able to achieve maximum performance.  With the current upstream
code base, workloads with a low number of "hot" threads are not able to
achieve maximum performance "out of the box".

Also recently, Intel's LAD has posted upstream performance results that
include an interesting column with their table of results.  See upstream
commit 0a4db187a999, column #3 within the "Performance numbers" table.  It
seems known, even within Intel, that the deeper C-states incur a cost too
high to bear, as they've explicitly tested restricting the CPU to higher
c-states of C0,1.

-- Jeremy Eder
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/