lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Wed, 16 Oct 2013 18:42:08 -0700 From: Eric Dumazet <eric.dumazet@...il.com> To: Neil Horman <nhorman@...driver.com> Cc: Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org, sebastien.dugue@...l.net, Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, "H. Peter Anvin" <hpa@...or.com>, x86@...nel.org Subject: Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's On Wed, 2013-10-16 at 20:34 -0400, Neil Horman wrote: > > > > So I went to reproduce these results, but was unable to (due to the fact that I > only have a pretty jittery network to do testing accross at the moment with > these devices). So instead I figured that I would go back to just doing > measurements with the module that I cobbled together (operating under the > assumption that it would give me accurate, relatively jitter free results (I've > attached the module code for reference below). My results show slightly > different behavior: > > Base results runs: > 89417240 > 85170397 > 85208407 > 89422794 > 91645494 > 103655144 > 86063791 > 75647774 > 83502921 > 85847372 > AVG = 875 ns > > Prefetch only runs: > 70962849 > 77555099 > 81898170 > 68249290 > 72636538 > 83039294 > 78561494 > 83393369 > 85317556 > 79570951 > AVG = 781 ns > > Parallel addition only runs: > 42024233 > 44313064 > 48304416 > 64762297 > 42994259 > 41811628 > 55654282 > 64892958 > 55125582 > 42456403 > AVG = 510 ns > > > Both prefetch and parallel addition: > 41329930 > 40689195 > 61106622 > 46332422 > 49398117 > 52525171 > 49517101 > 61311153 > 43691814 > 49043084 > AVG = 494 ns > > > For reference, each of the above large numbers is the number of nanoseconds > taken to compute the checksum of a 4kb buffer 100000 times. To get my average > results, I ran the test in a loop 10 times, averaged them, and divided by > 100000. > > > Based on these, prefetching is obviously a a good improvement, but not as good > as parallel execution, and the winner by far is doing both. > > Thoughts? > > Neil > Your benchmark uses a single 4K page, so data is _super_ hot in cpu caches. ( prefetch should give no speedups, I am surprised it makes any difference) Try now with 32 huges pages, to get 64 MBytes of working set. Because in reality we never csum_partial() data in cpu cache. (Unless the NIC preloaded the data into cpu cache before sending the interrupt) Really, if Sebastien got a speed up, it means that something fishy was going on, like : - A copy of data into some area of memory, prefilling cpu caches - csum_partial() done while data is hot in cache. This is exactly a "should not happen" scenario, because the csum in this case should happen _while_ doing the copy, for 0 ns. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists