linux-kernel - Re: [PATCH 0/6] crypto: SHA512 multibuffer implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20160628083711.GB15985@gondor.apana.org.au>
Date:	Tue, 28 Jun 2016 16:37:11 +0800
From:	Herbert Xu <herbert@...dor.apana.org.au>
To:	Megha Dey <megha.dey@...el.com>
Cc:	tim.c.chen@...ux.intel.com, davem@...emloft.net,
	linux-crypto@...r.kernel.org, linux-kernel@...r.kernel.org,
	fenghua.yu@...el.com, Megha Dey <megha.dey@...ux.intel.com>
Subject: Re: [PATCH 0/6] crypto: SHA512 multibuffer implementation

On Mon, Jun 27, 2016 at 10:20:03AM -0700, Megha Dey wrote:
> From: Megha Dey <megha.dey@...ux.intel.com>
> 
> In this patch series, we introduce the multi-buffer crypto algorithm on
> x86_64 and apply it to SHA512 hash computation.  The multi-buffer technique
> takes advantage of the 8 data lanes in the AVX2 registers and allows
> computation to be performed on data from multiple jobs in parallel.
> This allows us to parallelize computations when data inter-dependency in
> a single crypto job prevents us to fully parallelize our computations.
> The algorithm can be extended to other hashing and encryption schemes
> in the future.
> 
> On multi-buffer SHA512 computation with AVX2, we see throughput increase
> up to 2x over the existing x86_64 single buffer AVX2 algorithm.
> 
> The multi-buffer crypto algorithm is described in the following paper:
> Processing Multiple Buffers in Parallel to Increase Performance on
> Intel® Architecture Processors
> http://www.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html
> 
> The outline of the algorithm is sketched below:
> Any driver requesting the crypto service will place an async
> crypto request on the workqueue.  The multi-buffer crypto daemon will
> pull request from work queue and put each request in an empty data lane
> for multi-buffer crypto computation.  When all the empty lanes are filled,
> computation will commence on the jobs in parallel and the job with the
> shortest remaining buffer will get completed and be returned.  To prevent
> prolonged stall when there is no new jobs arriving, we will flush a crypto
> job if it has not been completed after a maximum allowable delay.
> 
> The multi-buffer algorithm necessitates mapping multiple scatter gather
> buffers to linear addresses simultaneously. The crypto daemon may need
> to sleep and yield the cpu to work on something else from time to time.
> We made a change to not use kmap_atomic to do scatter-gather buffer
> mapping and take advantage of the fact that we can directly translate
> address the buffer's address to its linear address with x86_64.
> To accommodate the fragmented nature of scatter-gather, we will keep
> submitting the next scatter-buffer fragment for a job for multi-buffer
> computation until a job is completed and no more buffer fragments remain.
> At that time we will pull a new job to fill the now empty data slot.
> We call a get_completed_job function to check whether there are other
> jobs that have been completed when we job when we have no new job arrival
> to prevent extraneous delay in returning any completed jobs.
> 
> The multi-buffer algorithm should be used for cases where crypto jobs
> submissions are at a reasonable high rate.  For low crypto job submission
> rate, this algorithm will not be beneficial. The reason is at low rate,
> we do not fill out the data lanes before the maximum allowable latency,
> we will be flushing the jobs instead of processing them with all the
> data lanes full.  We will miss the benefit of parallel computation,
> and adding delay to the processing of the crypto job at the same time.
> Some tuning of the maximum latency parameter may be needed to get the
> best performance.
> 
> Also added, is a new mode in the tcrypt modules to calculate the speed of the
> sha512_mb algorithm.

All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@...dor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt