lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140106094523.GA4602@gmail.com>
Date:	Mon, 6 Jan 2014 10:45:23 +0100
From:	Andrea Mazzoleni <amadvance@...il.com>
To:	NeilBrown <neilb@...e.de>
Cc:	Andrea Mazzoleni <amadvance@...il.com>,
	linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
	linux-btrfs@...r.kernel.org, clm@...com, jbacik@...com
Subject: Re: [RFC] lib: raid: New RAID library supporting up to six parities

Hi Neil,

Thanks for your feedback. In the meantime I went further in developing and
I've just sent version 2 of the patch, that contains a preliminary btrfs
modification to use the new interface.

Please use this one for any kind of review because it contains a modification
of the interface to match better the btrfs use.
I'll now try to do something similar to the async_tx layer and to improve the
code documentation as you recommended.

Anyway, a good entry point to understand the code is to start from the
include/linux/raid/raid.h file. It contains the functions that external
modules should call with a complete description of them.

There is raid_par() used to compute parity, and raid_rec() to recover damaged
blocks. These two functions replace all the old xor_blocks and raid6 calls.

And there is the raid_sort() you mention. It's an helper function that can be
used to ensure that the blocks indexes are passed at the raid interface in
proper order. In existing code I saw that the indexes are often sorted before
calling raid6, with something like:

	if (faila > failb) {
		int tmp = failb;
		failb = faila;
		faila = tmp;
	}

To do the same with up to six failures, it's now required some kind of sort
function.

Ciao,
Andrea

On 01/06, NeilBrown wrote:
> On Fri,  3 Jan 2014 10:47:16 +0100 Andrea Mazzoleni <amadvance@...il.com>
> wrote:
> 
> > This patch adds a new lib/raid directory, containing a new RAID support
> > based on a Cauchy matrix working for up to six parities, and backward
> > compatible with the existing RAID6 support.
> 
> Hi Andrea,
>  thanks for this.  I suspect it is probably a good idea to include this as we
>  are very likely to get multi-parity RAID at some stage and having this will
>  be a good first step.
>  I don't really feel motivated to review all this code though... I'm hoping
>  that someone(s) else might can could comment.
> 
>  Also, while there are some good comments in there, I feel there could be
>  more :-)
>  e.g: what function(s) do I actually call to create or check the extra parity
>  blocks? I'm sure I could find out by reading the code but I'd rather read
>  some documentation.
>  And what is "raid_sort()" all about ... I didn't expect to find a sort
>  function in there (that can sort an array of length up-to 6).
> 
>  Also though would ultimately need to be accessible via the 'async_tx'
>  interface so that hardware-offload could be supported.  So I would need to
>  know that the design will fit nicely into that interface.
>  Providing the interface can certainly come later, but being sure that the
>  current code a good match would help now.
> 
>  So: anyone out there keen to review this code at give me a
>     Reviewed-by: 
>  ??
> 
> Thanks,
> NeilBrown
> 
> 
> 
> > 
> > It was developed for kernel 3.13-rc4, but it should work with any other
> > version because it's mostly formed of new files. The only modification
> > is about adding a new CONFIG_RAID_CAUCHY option in the "lib" configuration
> > section.
> > 
> > The main interface is defined in include/linux/raid/raid.h and provides
> > easy to use functions able to generate parities and to recover data.
> > This interface is different than the one provided by the RAID6 library,
> > because with more parities the number of recovering cases grows exponentially
> > and it's not feasible to have a different function for each one.
> > 
> > The library provides fast implementations using SSE2 and SSSE3 for x86/x64
> > and a portable C implementation working everythere.
> > If the RAID6 library is enabled in the kernel, its functionality is also used
> > to maintain the existing level of performance for the first two parities in
> > all the supported architectures.
> > 
> > At startup the module runs a very fast self test (about 1ms) to ensure that
> > the used functions are correct.
> > I verified that the module builds, loads and passes the self test in the x86,
> > and x64 architectures. I expects no problems in other platforms, but at now
> > they are not really tested. I only simulated similar conditions.
> > 
> > In the lib/raid/test directory are present also some user mode test programs:
> > selftest - Runs the same selftest executed at the module startup.
> > fulltest - Runs a more extensive test that checks all the built-in functions.
> > speetest - Runs a speed test in memory.
> > 
> > As a reference, in my icore7 2.7GHz the speedtest program reports:
> > 
> > ...
> > Speed test using 16 data buffers of 4096 bytes, for a total of 64 KiB.
> > Memory blocks have a displacement of 64 bytes to improve cache performance.
> > The reported value is the aggregate bandwidth of all data blocks in MiB/s,
> > not counting parity blocks.
> > 
> > Memory write speed using the C memset() function:
> >   memset   33518
> > 
> > RAID functions used for computing the parity:
> >             int8   int32   int64    sse2   sse2e   ssse3  ssse3e
> >     par1           11762   21450   44621
> >     par2            3520    6176   18100   20338
> >     par3     848                                    8009    9210
> >     par4     659                                    6518    7303
> >     par5     531                                    4931    5363
> >     par6     430                                    4069    4471
> > 
> > RAID functions used for recovering:
> >             int8   ssse3
> >     rec1     591    1126
> >     rec2     272     456
> >     rec3      80     305
> >     rec4      49     216
> >     rec5      34     151
> > ...
> > 
> > Legend:
> > parX functions to generate X parities
> > recX functions to recover X data blocks
> > int8 implemention based on 8 bits arithmetics
> > int32 implemention based on 32 bits arithmetics
> > int64 implemention based on 64 bits arithmetics
> > sse2 implemention based on SSE2
> > sse2e implemention based on SSE2 with 16 registers (x64)
> > ssse3 implemention based on SSSE3
> > ssse3e implemention based on SSSE3 with 16 registers (x64)
> > 
> > Signed-off-by: Andrea Mazzoleni <amadvance@...il.com>
> > ---
> > This a bunch of new code, and some context could help to understand what's
> > going on.
> > 
> > This is a port of the RAID engine that I'm currently using in my hobby project
> > called SnapRAID. This project was initially based on the Linux RAID6 library,
> > and then limited to support only two parities like the Linux kernel itself.
> > 
> > How to extend the RAID6 logic to support more parities was not so obvious,
> > and it was also a recurring argument in the linux-raid mailing list in the
> > past years. Unfortunately, with no general solution ever proposed.
> > 
> > It took some time, but finally I found a way to do it keeping backward
> > compatibility with RAID6, using a specially built Cauchy matrix.
> > 
> > After implementing this support for my hobby project I got this solution also
> > discussed in the linux-raid/linux-btrfs mailing list in November in the thread
> > "Triple parity and beyond":
> > 
> > http://thread.gmane.org/gmane.comp.file-systems.btrfs/30159
> > 
> > This patch is the implementation of such discussion, done porting my existing
> > code to the kernel environment.
> > 
> > The code compiles without warnings with gcc -Wall -Wextra, with the clang
> > analyzer, and it runs cleany with valgrind.
> > The checkpatch.pl robot still report some issues. The errors are all false
> > positives that I was not able to workaround. The remaining warnings are
> > intentionally not fixed because that would make the code more difficult to read,
> > or are others false positive.
> > 
> > Please let me know what do you think. Any kind of feedback is welcome.
> > 
> > Thanks,
> > Andrea
> > 
> >  include/linux/raid/raid.h |   86 +++
> >  lib/Kconfig               |   12 +
> >  lib/Makefile              |    1 +
> >  lib/raid/Makefile         |   14 +
> >  lib/raid/cpu.h            |   44 ++
> >  lib/raid/gf.h             |  109 ++++
> >  lib/raid/int.c            |  567 ++++++++++++++++
> >  lib/raid/internal.h       |  146 +++++
> >  lib/raid/mktables.c       |  338 ++++++++++
> >  lib/raid/module.c         |  437 +++++++++++++
> >  lib/raid/raid.c           |  425 ++++++++++++
> >  lib/raid/sort.c           |   72 +++
> >  lib/raid/test/Makefile    |   33 +
> >  lib/raid/test/combo.h     |  155 +++++
> >  lib/raid/test/fulltest.c  |   76 +++
> >  lib/raid/test/memory.c    |   79 +++
> >  lib/raid/test/memory.h    |   74 +++
> >  lib/raid/test/selftest.c  |   41 ++
> >  lib/raid/test/speedtest.c |  567 ++++++++++++++++
> >  lib/raid/test/test.c      |  314 +++++++++
> >  lib/raid/test/test.h      |   59 ++
> >  lib/raid/test/usermode.h  |   83 +++
> >  lib/raid/test/xor.c       |   41 ++
> >  lib/raid/x86.c            | 1569 +++++++++++++++++++++++++++++++++++++++++++++
> >  24 files changed, 5342 insertions(+)
> >  create mode 100644 include/linux/raid/raid.h
> >  create mode 100644 lib/raid/Makefile
> >  create mode 100644 lib/raid/cpu.h
> >  create mode 100644 lib/raid/gf.h
> >  create mode 100644 lib/raid/int.c
> >  create mode 100644 lib/raid/internal.h
> >  create mode 100644 lib/raid/mktables.c
> >  create mode 100644 lib/raid/module.c
> >  create mode 100644 lib/raid/raid.c
> >  create mode 100644 lib/raid/sort.c
> >  create mode 100644 lib/raid/test/Makefile
> >  create mode 100644 lib/raid/test/combo.h
> >  create mode 100644 lib/raid/test/fulltest.c
> >  create mode 100644 lib/raid/test/memory.c
> >  create mode 100644 lib/raid/test/memory.h
> >  create mode 100644 lib/raid/test/selftest.c
> >  create mode 100644 lib/raid/test/speedtest.c
> >  create mode 100644 lib/raid/test/test.c
> >  create mode 100644 lib/raid/test/test.h
> >  create mode 100644 lib/raid/test/usermode.h
> >  create mode 100644 lib/raid/test/xor.c
> >  create mode 100644 lib/raid/x86.c
> > 
> > diff --git a/include/linux/raid/raid.h b/include/linux/raid/raid.h
> > new file mode 100644
> > index 0000000..9b5e441
> > --- /dev/null
> > +++ b/include/linux/raid/raid.h
> > @@ -0,0 +1,86 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#ifndef __RAID_H
> > +#define __RAID_H
> > +
> > +#ifdef __KERNEL__ /* to build the user mode test */
> > +#include <linux/types.h>
> > +#endif
> > +
> > +/**
> > + * Max number of parity disks supported.
> > + */
> > +#define RAID_PARITY_MAX 6
> > +
> > +/**
> > + * Maximum number of data disks supported.
> > + */
> > +#define RAID_DATA_MAX 251
> > +
> > +/**
> > + * Computes the parity.
> > + *
> > + * @nd Number of data blocks.
> > + * @np Number of parities blocks to compute.
> > + * @size Size of the blocks pointed by @v. It must be a multipler of 64.
> > + * @v Vector of pointers to the blocks of data and parity.
> > + *   It has (@nd + @np) elements. The starting elements are the blocks for
> > + *   data, following with the parity blocks.
> > + *   Each blocks has @size bytes.
> > + */
> > +void raid_par(int nd, int np, size_t size, void **v);
> > +
> > +/**
> > + * Recovers failures in data and parity blocks.
> > + *
> > + * All the data and parity blocks marked as bad in the @id and @ip vector
> > + * are recovered and recomputed.
> > + *
> > + * The parities blocks to use for recovering are automatically selected from
> > + * the ones NOT present in the @ip vector.
> > + *
> > + * Ensure to have @nrd + @nrp <= @np, otherwise recovering is not possible.
> > + *
> > + * @nrd Number of failed data blocks to recover.
> > + * @id[] Vector of @nrd indexes of the data blocks to recover.
> > + *   The indexes start from 0. They must be in order.
> > + * @nrp Number of failed parity blocks to recover.
> > + * @ip[] Vector of @nrp indexes of the parity blocks to recover.
> > + *   The indexes start from 0. They must be in order.
> > + *   All the parities not specified here are assumed correct, and they are
> > + *   not recomputed.
> > + * @nd Number of data blocks.
> > + * @np Number of parity blocks.
> > + * @size Size of the blocks pointed by @v. It must be a multipler of 64.
> > + * @v Vector of pointers to the blocks of data and parity.
> > + *   It has (@nd + @np) elements. The starting elements are the blocks
> > + *   for data, following with the parity blocks.
> > + *   Each blocks has @size bytes.
> > + */
> > +void raid_rec(int nrd, int *id, int nrp, int *ip, int nd, int np, size_t size, void **v);
> > +
> > +/**
> > + * Sorts a small vector of integers.
> > + *
> > + * If you have block indexes not in order, you can use this function to sort
> > + * them before callign raid_rec().
> > + *
> > + * @n Number of integers. No more than RAID_PARITY_MAX.
> > + * @v Vector of integers.
> > + */
> > +void raid_sort(int n, int *v);
> > +
> > +#endif
> > +
> > diff --git a/lib/Kconfig b/lib/Kconfig
> > index 991c98b..a77ffbe 100644
> > --- a/lib/Kconfig
> > +++ b/lib/Kconfig
> > @@ -10,6 +10,18 @@ menu "Library routines"
> >  config RAID6_PQ
> >  	tristate
> >  
> > +config RAID_CAUCHY
> > +	tristate "RAID Cauchy functions"
> > +	help
> > +	  This option enables the RAID parity library based on a Cauchy matrix
> > +	  that supports up to six parities, and it's compatible with the
> > +	  existing RAID6 support.
> > +	  This library provides optimized functions for architectures with
> > +	  SSSE3 support.
> > +	  If the RAID6 module is enabled, it's automatically used to
> > +	  maintain the same performance level in all the architectures.
> > +	  Module will be called raid_cauchy.
> > +
> >  config BITREVERSE
> >  	tristate
> >  
> > diff --git a/lib/Makefile b/lib/Makefile
> > index a459c31..8b76716 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -79,6 +79,7 @@ obj-$(CONFIG_LZ4HC_COMPRESS) += lz4/
> >  obj-$(CONFIG_LZ4_DECOMPRESS) += lz4/
> >  obj-$(CONFIG_XZ_DEC) += xz/
> >  obj-$(CONFIG_RAID6_PQ) += raid6/
> > +obj-$(CONFIG_RAID_CAUCHY) += raid/
> >  
> >  lib-$(CONFIG_DECOMPRESS_GZIP) += decompress_inflate.o
> >  lib-$(CONFIG_DECOMPRESS_BZIP2) += decompress_bunzip2.o
> > diff --git a/lib/raid/Makefile b/lib/raid/Makefile
> > new file mode 100644
> > index 0000000..b8c1df5
> > --- /dev/null
> > +++ b/lib/raid/Makefile
> > @@ -0,0 +1,14 @@
> > +obj-$(CONFIG_RAID_CAUCHY) += raid_cauchy.o
> > +
> > +raid_cauchy-y	+= module.o raid.o tables.o int.o
> > +
> > +raid_cauchy-$(CONFIG_X86) += x86.o
> > +
> > +hostprogs-y	+= mktables
> > +
> > +quiet_cmd_mktable = TABLE   $@
> > +      cmd_mktable = $(obj)/mktables > $@ || ( rm -f $@ && exit 1 )
> > +
> > +targets += tables.c
> > +$(obj)/tables.c: $(obj)/mktables FORCE
> > +	$(call if_changed,mktable)
> > diff --git a/lib/raid/cpu.h b/lib/raid/cpu.h
> > new file mode 100644
> > index 0000000..3202fa9
> > --- /dev/null
> > +++ b/lib/raid/cpu.h
> > @@ -0,0 +1,44 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#ifndef __RAID_CPU_H
> > +#define __RAID_CPU_H
> > +
> > +#ifdef CONFIG_X86
> > +static inline int raid_cpu_has_sse2(void)
> > +{
> > +	return boot_cpu_has(X86_FEATURE_XMM2);
> > +}
> > +
> > +static inline int raid_cpu_has_ssse3(void)
> > +{
> > +	/* checks also for SSE2 */
> > +	/* likely it's implicit but it doesn't harm */
> > +	return boot_cpu_has(X86_FEATURE_XMM2)
> > +		&& boot_cpu_has(X86_FEATURE_SSSE3);
> > +}
> > +
> > +static inline int raid_cpu_has_avx2(void)
> > +{
> > +	/* checks also for SSE2 and SSSE3 */
> > +	/* likely it's implicit but it doesn't harm */
> > +	return boot_cpu_has(X86_FEATURE_XMM2)
> > +		&& boot_cpu_has(X86_FEATURE_SSSE3)
> > +		&& boot_cpu_has(X86_FEATURE_AVX)
> > +		&& boot_cpu_has(X86_FEATURE_AVX2);
> > +}
> > +#endif
> > +
> > +#endif
> > +
> > diff --git a/lib/raid/gf.h b/lib/raid/gf.h
> > new file mode 100644
> > index 0000000..f444e63
> > --- /dev/null
> > +++ b/lib/raid/gf.h
> > @@ -0,0 +1,109 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#ifndef __RAID_GF_H
> > +#define __RAID_GF_H
> > +
> > +/*
> > + * Galois field operations.
> > + *
> > + * Basic range checks are implemented using BUG_ON().
> > + */
> > +
> > +/*
> > + * GF a*b.
> > + */
> > +static __always_inline uint8_t mul(uint8_t a, uint8_t b)
> > +{
> > +	return gfmul[a][b];
> > +}
> > +
> > +/*
> > + * GF 1/a.
> > + * Not defined for a == 0.
> > + */
> > +static __always_inline uint8_t inv(uint8_t v)
> > +{
> > +	BUG_ON(v == 0); /* division by zero */
> > +
> > +	return gfinv[v];
> > +}
> > +
> > +/*
> > + * GF 2^a.
> > + */
> > +static __always_inline uint8_t pow2(int v)
> > +{
> > +	BUG_ON(v < 0 || v > 254); /* invalid exponent */
> > +
> > +	return gfexp[v];
> > +}
> > +
> > +/*
> > + * Gets the multiplication table for a specified value.
> > + */
> > +static __always_inline const uint8_t *table(uint8_t v)
> > +{
> > +	return gfmul[v];
> > +}
> > +
> > +/*
> > + * Gets the generator matrix coefficient for parity 'p' and disk 'd'.
> > + */
> > +static __always_inline uint8_t A(int p, int d)
> > +{
> > +	return gfgen[p][d];
> > +}
> > +
> > +/*
> > + * Dereference as uint8_t
> > + */
> > +#define v_8(p) (*(uint8_t *)&(p))
> > +
> > +/*
> > + * Dereference as uint32_t
> > + */
> > +#define v_32(p) (*(uint32_t *)&(p))
> > +
> > +/*
> > + * Dereference as uint64_t
> > + */
> > +#define v_64(p) (*(uint64_t *)&(p))
> > +
> > +/*
> > + * Multiply each byte of a uint32 by 2 in the GF(2^8).
> > + */
> > +static __always_inline uint32_t x2_32(uint32_t v)
> > +{
> > +	uint32_t mask = v & 0x80808080U;
> > +	mask = (mask << 1) - (mask >> 7);
> > +	v = (v << 1) & 0xfefefefeU;
> > +	v ^= mask & 0x1d1d1d1dU;
> > +	return v;
> > +}
> > +
> > +/*
> > + * Multiply each byte of a uint64 by 2 in the GF(2^8).
> > + */
> > +static __always_inline uint64_t x2_64(uint64_t v)
> > +{
> > +	uint64_t mask = v & 0x8080808080808080ULL;
> > +	mask = (mask << 1) - (mask >> 7);
> > +	v = (v << 1) & 0xfefefefefefefefeULL;
> > +	v ^= mask & 0x1d1d1d1d1d1d1d1dULL;
> > +	return v;
> > +}
> > +
> > +#endif
> > +
> > diff --git a/lib/raid/int.c b/lib/raid/int.c
> > new file mode 100644
> > index 0000000..cd1e147
> > --- /dev/null
> > +++ b/lib/raid/int.c
> > @@ -0,0 +1,567 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include "internal.h"
> > +#include "gf.h"
> > +
> > +/*
> > + * PAR1 (RAID5 with xor) 32bit C implementation
> > + */
> > +void raid_par1_int32(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	uint32_t p0;
> > +	uint32_t p1;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +
> > +	for (i = 0; i < size; i += 8) {
> > +		p0 = v_32(v[l][i]);
> > +		p1 = v_32(v[l][i+4]);
> > +		for (d = l-1; d >= 0; --d) {
> > +			p0 ^= v_32(v[d][i]);
> > +			p1 ^= v_32(v[d][i+4]);
> > +		}
> > +		v_32(p[i]) = p0;
> > +		v_32(p[i+4]) = p1;
> > +	}
> > +}
> > +
> > +/*
> > + * PAR1 (RAID5 with xor) 64bit C implementation
> > + */
> > +void raid_par1_int64(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	uint64_t p0;
> > +	uint64_t p1;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +
> > +	for (i = 0; i < size; i += 16) {
> > +		p0 = v_64(v[l][i]);
> > +		p1 = v_64(v[l][i+8]);
> > +		for (d = l-1; d >= 0; --d) {
> > +			p0 ^= v_64(v[d][i]);
> > +			p1 ^= v_64(v[d][i+8]);
> > +		}
> > +		v_64(p[i]) = p0;
> > +		v_64(p[i+8]) = p1;
> > +	}
> > +}
> > +
> > +/*
> > + * PAR2 (RAID6 with powers of 2) 32bit C implementation
> > + */
> > +void raid_par2_int32(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	uint32_t d0, q0, p0;
> > +	uint32_t d1, q1, p1;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +
> > +	for (i = 0; i < size; i += 8) {
> > +		q0 = p0 = v_32(v[l][i]);
> > +		q1 = p1 = v_32(v[l][i+4]);
> > +		for (d = l-1; d >= 0; --d) {
> > +			d0 = v_32(v[d][i]);
> > +			d1 = v_32(v[d][i+4]);
> > +
> > +			p0 ^= d0;
> > +			p1 ^= d1;
> > +
> > +			q0 = x2_32(q0);
> > +			q1 = x2_32(q1);
> > +
> > +			q0 ^= d0;
> > +			q1 ^= d1;
> > +		}
> > +		v_32(p[i]) = p0;
> > +		v_32(p[i+4]) = p1;
> > +		v_32(q[i]) = q0;
> > +		v_32(q[i+4]) = q1;
> > +	}
> > +}
> > +
> > +/*
> > + * PAR2 (RAID6 with powers of 2) 64bit C implementation
> > + */
> > +void raid_par2_int64(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	uint64_t d0, q0, p0;
> > +	uint64_t d1, q1, p1;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +
> > +	for (i = 0; i < size; i += 16) {
> > +		q0 = p0 = v_64(v[l][i]);
> > +		q1 = p1 = v_64(v[l][i+8]);
> > +		for (d = l-1; d >= 0; --d) {
> > +			d0 = v_64(v[d][i]);
> > +			d1 = v_64(v[d][i+8]);
> > +
> > +			p0 ^= d0;
> > +			p1 ^= d1;
> > +
> > +			q0 = x2_64(q0);
> > +			q1 = x2_64(q1);
> > +
> > +			q0 ^= d0;
> > +			q1 ^= d1;
> > +		}
> > +		v_64(p[i]) = p0;
> > +		v_64(p[i+8]) = p1;
> > +		v_64(q[i]) = q0;
> > +		v_64(q[i+8]) = q1;
> > +	}
> > +}
> > +
> > +/*
> > + * PAR3 (triple parity with Cauchy matrix) 8bit C implementation
> > + *
> > + * Note that instead of a generic multiplication table, likely resulting
> > + * in multiple cache misses, a precomputed table could be used.
> > + * But this is only a kind of reference function, and we are not really
> > + * interested in speed.
> > + */
> > +void raid_par3_int8(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	uint8_t d0, r0, q0, p0;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +
> > +	for (i = 0; i < size; i += 1) {
> > +		p0 = q0 = r0 = 0;
> > +		for (d = l; d > 0; --d) {
> > +			d0 = v_8(v[d][i]);
> > +
> > +			p0 ^= d0;
> > +			q0 ^= gfmul[d0][gfgen[1][d]];
> > +			r0 ^= gfmul[d0][gfgen[2][d]];
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		d0 = v_8(v[0][i]);
> > +
> > +		p0 ^= d0;
> > +		q0 ^= d0;
> > +		r0 ^= d0;
> > +
> > +		v_8(p[i]) = p0;
> > +		v_8(q[i]) = q0;
> > +		v_8(r[i]) = r0;
> > +	}
> > +}
> > +
> > +/*
> > + * PAR4 (quad parity with Cauchy matrix) 8bit C implementation
> > + *
> > + * Note that instead of a generic multiplication table, likely resulting
> > + * in multiple cache misses, a precomputed table could be used.
> > + * But this is only a kind of reference function, and we are not really
> > + * interested in speed.
> > + */
> > +void raid_par4_int8(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	uint8_t *s;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	uint8_t d0, s0, r0, q0, p0;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +	s = v[nd+3];
> > +
> > +	for (i = 0; i < size; i += 1) {
> > +		p0 = q0 = r0 = s0 = 0;
> > +		for (d = l; d > 0; --d) {
> > +			d0 = v_8(v[d][i]);
> > +
> > +			p0 ^= d0;
> > +			q0 ^= gfmul[d0][gfgen[1][d]];
> > +			r0 ^= gfmul[d0][gfgen[2][d]];
> > +			s0 ^= gfmul[d0][gfgen[3][d]];
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		d0 = v_8(v[0][i]);
> > +
> > +		p0 ^= d0;
> > +		q0 ^= d0;
> > +		r0 ^= d0;
> > +		s0 ^= d0;
> > +
> > +		v_8(p[i]) = p0;
> > +		v_8(q[i]) = q0;
> > +		v_8(r[i]) = r0;
> > +		v_8(s[i]) = s0;
> > +	}
> > +}
> > +
> > +/*
> > + * PAR5 (penta parity with Cauchy matrix) 8bit C implementation
> > + *
> > + * Note that instead of a generic multiplication table, likely resulting
> > + * in multiple cache misses, a precomputed table could be used.
> > + * But this is only a kind of reference function, and we are not really
> > + * interested in speed.
> > + */
> > +void raid_par5_int8(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	uint8_t *s;
> > +	uint8_t *t;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	uint8_t d0, t0, s0, r0, q0, p0;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +	s = v[nd+3];
> > +	t = v[nd+4];
> > +
> > +	for (i = 0; i < size; i += 1) {
> > +		p0 = q0 = r0 = s0 = t0 = 0;
> > +		for (d = l; d > 0; --d) {
> > +			d0 = v_8(v[d][i]);
> > +
> > +			p0 ^= d0;
> > +			q0 ^= gfmul[d0][gfgen[1][d]];
> > +			r0 ^= gfmul[d0][gfgen[2][d]];
> > +			s0 ^= gfmul[d0][gfgen[3][d]];
> > +			t0 ^= gfmul[d0][gfgen[4][d]];
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		d0 = v_8(v[0][i]);
> > +
> > +		p0 ^= d0;
> > +		q0 ^= d0;
> > +		r0 ^= d0;
> > +		s0 ^= d0;
> > +		t0 ^= d0;
> > +
> > +		v_8(p[i]) = p0;
> > +		v_8(q[i]) = q0;
> > +		v_8(r[i]) = r0;
> > +		v_8(s[i]) = s0;
> > +		v_8(t[i]) = t0;
> > +	}
> > +}
> > +
> > +/*
> > + * PAR6 (hexa parity with Cauchy matrix) 8bit C implementation
> > + *
> > + * Note that instead of a generic multiplication table, likely resulting
> > + * in multiple cache misses, a precomputed table could be used.
> > + * But this is only a kind of reference function, and we are not really
> > + * interested in speed.
> > + */
> > +void raid_par6_int8(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	uint8_t *s;
> > +	uint8_t *t;
> > +	uint8_t *u;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	uint8_t d0, u0, t0, s0, r0, q0, p0;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +	s = v[nd+3];
> > +	t = v[nd+4];
> > +	u = v[nd+5];
> > +
> > +	for (i = 0; i < size; i += 1) {
> > +		p0 = q0 = r0 = s0 = t0 = u0 = 0;
> > +		for (d = l; d > 0; --d) {
> > +			d0 = v_8(v[d][i]);
> > +
> > +			p0 ^= d0;
> > +			q0 ^= gfmul[d0][gfgen[1][d]];
> > +			r0 ^= gfmul[d0][gfgen[2][d]];
> > +			s0 ^= gfmul[d0][gfgen[3][d]];
> > +			t0 ^= gfmul[d0][gfgen[4][d]];
> > +			u0 ^= gfmul[d0][gfgen[5][d]];
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		d0 = v_8(v[0][i]);
> > +
> > +		p0 ^= d0;
> > +		q0 ^= d0;
> > +		r0 ^= d0;
> > +		s0 ^= d0;
> > +		t0 ^= d0;
> > +		u0 ^= d0;
> > +
> > +		v_8(p[i]) = p0;
> > +		v_8(q[i]) = q0;
> > +		v_8(r[i]) = r0;
> > +		v_8(s[i]) = s0;
> > +		v_8(t[i]) = t0;
> > +		v_8(u[i]) = u0;
> > +	}
> > +}
> > +
> > +/*
> > + * Recover failure of one data block at index id[0] using parity at index
> > + * ip[0] for any RAID level.
> > + *
> > + * Starting from the equation:
> > + *
> > + * Pd = A[ip[0],id[0]] * Dx
> > + *
> > + * and solving we get:
> > + *
> > + * Dx = A[ip[0],id[0]]^-1 * Pd
> > + */
> > +void raid_rec1_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *pa;
> > +	const uint8_t *T;
> > +	uint8_t G;
> > +	uint8_t V;
> > +	size_t i;
> > +
> > +	(void)nr; /* unused, it's always 1 */
> > +
> > +	/* if it's RAID5 uses the faster function */
> > +	if (ip[0] == 0) {
> > +		raid_rec1_par1(id, nd, size, vv);
> > +		return;
> > +	}
> > +
> > +#ifdef RAID_USE_RAID6_PQ
> > +	/* if it's RAID6 recovering with Q uses the faster function */
> > +	if (ip[0] == 1) {
> > +		raid6_datap_recov(nd + 2, size, id[0], vv);
> > +		return;
> > +	}
> > +#endif
> > +
> > +	/* setup the coefficients matrix */
> > +	G = A(ip[0], id[0]);
> > +
> > +	/* invert it to solve the system of linear equations */
> > +	V = inv(G);
> > +
> > +	/* get multiplication tables */
> > +	T = table(V);
> > +
> > +	/* compute delta parity */
> > +	raid_delta_gen(1, id, ip, nd, size, vv);
> > +
> > +	p = v[nd+ip[0]];
> > +	pa = v[id[0]];
> > +
> > +	for (i = 0; i < size; ++i) {
> > +		/* delta */
> > +		uint8_t Pd = p[i] ^ pa[i];
> > +
> > +		/* reconstruct */
> > +		pa[i] = T[Pd];
> > +	}
> > +}
> > +
> > +/*
> > + * Recover failure of two data blocks at indexes id[0],id[1] using parity at
> > + * indexes ip[0],ip[1] for any RAID level.
> > + *
> > + * Starting from the equations:
> > + *
> > + * Pd = A[ip[0],id[0]] * Dx + A[ip[0],id[1]] * Dy
> > + * Qd = A[ip[1],id[0]] * Dx + A[ip[1],id[1]] * Dy
> > + *
> > + * we solve inverting the coefficients matrix.
> > + */
> > +void raid_rec2_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *pa;
> > +	uint8_t *q;
> > +	uint8_t *qa;
> > +	const int N = 2;
> > +	const uint8_t *T[N][N];
> > +	uint8_t G[N*N];
> > +	uint8_t V[N*N];
> > +	size_t i;
> > +	int j, k;
> > +
> > +	(void)nr; /* unused, it's always 2 */
> > +
> > +	/* if it's RAID6 recovering with P and Q uses the faster function */
> > +	if (ip[0] == 0 && ip[1] == 1) {
> > +#ifdef RAID_USE_RAID6_PQ
> > +		raid6_2data_recov(nd + 2, size, id[0], id[1], vv);
> > +#else
> > +		raid_rec2_par2(id, ip, nd, size, vv);
> > +#endif
> > +		return;
> > +	}
> > +
> > +	/* setup the coefficients matrix */
> > +	for (j = 0; j < N; ++j)
> > +		for (k = 0; k < N; ++k)
> > +			G[j*N+k] = A(ip[j], id[k]);
> > +
> > +	/* invert it to solve the system of linear equations */
> > +	raid_invert(G, V, N);
> > +
> > +	/* get multiplication tables */
> > +	for (j = 0; j < N; ++j)
> > +		for (k = 0; k < N; ++k)
> > +			T[j][k] = table(V[j*N+k]);
> > +
> > +	/* compute delta parity */
> > +	raid_delta_gen(2, id, ip, nd, size, vv);
> > +
> > +	p = v[nd+ip[0]];
> > +	q = v[nd+ip[1]];
> > +	pa = v[id[0]];
> > +	qa = v[id[1]];
> > +
> > +	for (i = 0; i < size; ++i) {
> > +		/* delta */
> > +		uint8_t Pd = p[i] ^ pa[i];
> > +		uint8_t Qd = q[i] ^ qa[i];
> > +
> > +		/* reconstruct */
> > +		pa[i] = T[0][0][Pd] ^ T[0][1][Qd];
> > +		qa[i] = T[1][0][Pd] ^ T[1][1][Qd];
> > +	}
> > +}
> > +
> > +/*
> > + * Recover failure of N data blocks at indexes id[N] using parity at indexes
> > + * ip[N] for any RAID level.
> > + *
> > + * Starting from the N equations, with 0<=i<N :
> > + *
> > + * PD[i] = sum(A[ip[i],id[j]] * D[i]) 0<=j<N
> > + *
> > + * we solve inverting the coefficients matrix.
> > + *
> > + * Note that referring at previous equations you have:
> > + * PD[0] = Pd, PD[1] = Qd, PD[2] = Rd, ...
> > + * D[0] = Dx, D[1] = Dy, D[2] = Dz, ...
> > + */
> > +void raid_recX_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p[RAID_PARITY_MAX];
> > +	uint8_t *pa[RAID_PARITY_MAX];
> > +	const uint8_t *T[RAID_PARITY_MAX][RAID_PARITY_MAX];
> > +	uint8_t G[RAID_PARITY_MAX*RAID_PARITY_MAX];
> > +	uint8_t V[RAID_PARITY_MAX*RAID_PARITY_MAX];
> > +	size_t i;
> > +	int j, k;
> > +
> > +	/* setup the coefficients matrix */
> > +	for (j = 0; j < nr; ++j)
> > +		for (k = 0; k < nr; ++k)
> > +			G[j*nr+k] = A(ip[j], id[k]);
> > +
> > +	/* invert it to solve the system of linear equations */
> > +	raid_invert(G, V, nr);
> > +
> > +	/* get multiplication tables */
> > +	for (j = 0; j < nr; ++j)
> > +		for (k = 0; k < nr; ++k)
> > +			T[j][k] = table(V[j*nr+k]);
> > +
> > +	/* compute delta parity */
> > +	raid_delta_gen(nr, id, ip, nd, size, vv);
> > +
> > +	for (j = 0; j < nr; ++j) {
> > +		p[j] = v[nd+ip[j]];
> > +		pa[j] = v[id[j]];
> > +	}
> > +
> > +	for (i = 0; i < size; ++i) {
> > +		uint8_t PD[RAID_PARITY_MAX];
> > +
> > +		/* delta */
> > +		for (j = 0; j < nr; ++j)
> > +			PD[j] = p[j][i] ^ pa[j][i];
> > +
> > +		/* reconstruct */
> > +		for (j = 0; j < nr; ++j) {
> > +			uint8_t b = 0;
> > +			for (k = 0; k < nr; ++k)
> > +				b ^= T[j][k][PD[k]];
> > +			pa[j][i] = b;
> > +		}
> > +	}
> > +}
> > +
> > diff --git a/lib/raid/internal.h b/lib/raid/internal.h
> > new file mode 100644
> > index 0000000..7d84458
> > --- /dev/null
> > +++ b/lib/raid/internal.h
> > @@ -0,0 +1,146 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#ifndef __RAID_INTERNAL_H
> > +#define __RAID_INTERNAL_H
> > +
> > +/*
> > + * Includes anything required for compatibility.
> > + */
> > +#ifdef __KERNEL__ /* to build the user mode test */
> > +
> > +#include <linux/module.h>
> > +#include <linux/kconfig.h> /* for IS_* macros */
> > +#include <linux/export.h> /* for EXPORT_SYMBOL/EXPORT_SYMBOL_GPL */
> > +#include <linux/bug.h> /* for BUG_ON */
> > +#include <linux/gfp.h> /* for __get_free_pages */
> > +#include <linux/raid/xor.h> /* for xor_blocks */
> > +
> > +#ifdef CONFIG_X86
> > +#include <asm/i387.h> /* for kernel_fpu_begin/end() */
> > +#endif
> > +
> > +/* if we can use the RAID6 library */
> > +#if IS_BUILTIN(CONFIG_RAID6_PQ) \
> > +	|| (IS_MODULE(CONFIG_RAID6_PQ) && IS_MODULE(CONFIG_RAID6_CAUCHY))
> > +#define RAID_USE_RAID6_PQ 1
> > +#include <linux/raid/pq.h> /* for tables/functions */
> > +#endif
> > +
> > +#else
> > +
> > +#include "test/usermode.h"
> > +
> > +#endif
> > +
> > +/*
> > + * Include the main header.
> > + */
> > +#include <linux/raid/raid.h>
> > +
> > +/*
> > + * Internal functions.
> > + *
> > + * These are intented to provide access for testing.
> > + */
> > +void raid_init(void);
> > +int raid_selftest(void);
> > +int raid_speedtest(void);
> > +void raid_par_ref(int nd, int np, size_t size, void **vv);
> > +void raid_invert(uint8_t *M, uint8_t *V, int n);
> > +void raid_delta_gen(int nr, int *id, int *ip, int nd, size_t size, void **v);
> > +void raid_rec1_par1(int *id, int nd, size_t size, void **v);
> > +void raid_rec2_par2(int *id, int *ip, int nd, size_t size, void **vv);
> > +void raid_par1_xorblocks(int nd, size_t size, void **v);
> > +void raid_par1_int32(int nd, size_t size, void **vv);
> > +void raid_par1_int64(int nd, size_t size, void **vv);
> > +void raid_par1_sse2(int nd, size_t size, void **vv);
> > +void raid_par2_raid6(int nd, size_t size, void **vv);
> > +void raid_par2_int32(int nd, size_t size, void **vv);
> > +void raid_par2_int64(int nd, size_t size, void **vv);
> > +void raid_par2_sse2(int nd, size_t size, void **vv);
> > +void raid_par2_sse2ext(int nd, size_t size, void **vv);
> > +void raid_par3_int8(int nd, size_t size, void **vv);
> > +void raid_par3_ssse3(int nd, size_t size, void **vv);
> > +void raid_par3_ssse3ext(int nd, size_t size, void **vv);
> > +void raid_par4_int8(int nd, size_t size, void **vv);
> > +void raid_par4_ssse3(int nd, size_t size, void **vv);
> > +void raid_par4_ssse3ext(int nd, size_t size, void **vv);
> > +void raid_par5_int8(int nd, size_t size, void **vv);
> > +void raid_par5_ssse3(int nd, size_t size, void **vv);
> > +void raid_par5_ssse3ext(int nd, size_t size, void **vv);
> > +void raid_par6_int8(int nd, size_t size, void **vv);
> > +void raid_par6_ssse3(int nd, size_t size, void **vv);
> > +void raid_par6_ssse3ext(int nd, size_t size, void **vv);
> > +void raid_rec1_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv);
> > +void raid_rec2_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv);
> > +void raid_recX_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv);
> > +void raid_rec1_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv);
> > +void raid_rec2_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv);
> > +void raid_recX_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv);
> > +
> > +/*
> > + * Internal forwarders.
> > + */
> > +extern void (*raid_par_ptr[RAID_PARITY_MAX])(
> > +	int nd, size_t size, void **vv);
> > +extern void (*raid_rec_ptr[RAID_PARITY_MAX])(
> > +	int nr, int *id, int *ip, int nd, size_t size, void **vv);
> > +
> > +/*
> > + * Tables.
> > + *
> > + * Uses RAID6 tables if available, otherwise the ones in tables.c.
> > + */
> > +#ifdef RAID_USE_RAID6_PQ
> > +#define gfmul raid6_gfmul
> > +#define gfinv raid6_gfinv
> > +#define gfexp raid6_gfexp
> > +#else
> > +extern const uint8_t raid_gfmul[256][256] __aligned(256);
> > +extern const uint8_t raid_gfexp[256] __aligned(256);
> > +extern const uint8_t raid_gfinv[256] __aligned(256);
> > +#define gfmul raid_gfmul
> > +#define gfexp raid_gfexp
> > +#define gfinv raid_gfinv
> > +#endif
> > +
> > +extern const uint8_t raid_gfcauchy[6][256] __aligned(256);
> > +extern const uint8_t raid_gfcauchypshufb[251][4][2][16] __aligned(256);
> > +extern const uint8_t raid_gfmulpshufb[256][2][16] __aligned(256);
> > +#define gfgen raid_gfcauchy
> > +#define gfgenpshufb raid_gfcauchypshufb
> > +#define gfmulpshufb raid_gfmulpshufb
> > +
> > +/*
> > + * Assembler blocks.
> > + */
> > +#ifdef __KERNEL__ /* to build the user mode test */
> > +#define asm_begin() \
> > +	kernel_fpu_begin()
> > +
> > +#define asm_end() \
> > +	do { \
> > +		asm volatile("sfence" : : : "memory"); \
> > +		kernel_fpu_end(); \
> > +	} while (0)
> > +#else
> > +#define asm_begin() \
> > +	do { } while (0)
> > +#define asm_end() \
> > +	asm volatile("sfence" : : : "memory")
> > +#endif
> > +
> > +#endif
> > +
> > diff --git a/lib/raid/mktables.c b/lib/raid/mktables.c
> > new file mode 100644
> > index 0000000..9c8e0e0
> > --- /dev/null
> > +++ b/lib/raid/mktables.c
> > @@ -0,0 +1,338 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include <stdio.h>
> > +#include <stdint.h>
> > +
> > +/**
> > + * Multiplication in GF(2^8).
> > + */
> > +static uint8_t gfmul(uint8_t a, uint8_t b)
> > +{
> > +	uint8_t v;
> > +
> > +	v = 0;
> > +	while (b)  {
> > +		if ((b & 1) != 0)
> > +			v ^= a;
> > +
> > +		if ((a & 0x80) != 0) {
> > +			a <<= 1;
> > +			a ^= 0x1d;
> > +		} else {
> > +			a <<= 1;
> > +		}
> > +
> > +		b >>= 1;
> > +	}
> > +
> > +	return v;
> > +}
> > +
> > +/**
> > + * Inversion table in GF(2^8).
> > + */
> > +uint8_t gfinv[256];
> > +
> > +/**
> > + * Number of parity.
> > + * This is the number of rows of the generation matrix.
> > + */
> > +#define PARITY 6
> > +
> > +/**
> > + * Number of disks.
> > + * This is the number of columns of the generation matrix.
> > + */
> > +#define DISK (257-PARITY)
> > +
> > +/**
> > + * Setup the Cauchy matrix used to generate the parity.
> > + */
> > +static void set_cauchy(uint8_t *matrix)
> > +{
> > +	int i, j;
> > +	uint8_t inv_x, y;
> > +
> > +	/*
> > +	 * First row is formed by all 1.
> > +	 *
> > +	 * This is an Extended Cauchy matrix built from a Cauchy matrix
> > +	 * adding the first row of all 1.
> > +	 */
> > +	for (i = 0; i < DISK; ++i)
> > +		matrix[0*DISK+i] = 1;
> > +
> > +	/*
> > +	 * Second row is formed by power of 2^i.
> > +	 *
> > +	 * This is the first row of the Cauchy matrix.
> > +	 *
> > +	 * Each element of the Cauchy matrix is in the form 1/(xi+yj)
> > +	 * where all xi, and yj must be different.
> > +	 *
> > +	 * Choosing xi = 2^-i and y0 = 0, we obtain for the first row:
> > +	 *
> > +	 * 1/(xi+y0) = 1/(2^-i + 0) = 2^i
> > +	 *
> > +	 * with 2^-i != 0 for any i
> > +	 */
> > +	inv_x = 1;
> > +	for (i = 0; i < DISK; ++i) {
> > +		matrix[1*DISK+i] = inv_x;
> > +		inv_x = gfmul(2, inv_x);
> > +	}
> > +
> > +	/*
> > +	 * Next rows of the Cauchy matrix.
> > +	 *
> > +	 * Continue forming the Cauchy matrix with yj = 2^j obtaining :
> > +	 *
> > +	 * 1/(xi+yj) = 1/(2^-i + 2^j)
> > +	 *
> > +	 * with xi != yj for any i,j with i>=0,j>=1,i+j<255
> > +	 */
> > +	y = 2;
> > +	for (j = 0; j < PARITY-2; ++j) {
> > +		inv_x = 1;
> > +		for (i = 0; i < DISK; ++i) {
> > +			uint8_t x = gfinv[inv_x];
> > +			matrix[(j+2)*DISK+i] = gfinv[y ^ x];
> > +			inv_x = gfmul(2, inv_x);
> > +		}
> > +
> > +		y = gfmul(2, y);
> > +	}
> > +
> > +	/*
> > +	 * Adjusts the matrix multipling each row for
> > +	 * the inverse of the first element in the row.
> > +	 *
> > +	 * This operation doesn't invalidate the property that all the square
> > +	 * submatrices are not singular.
> > +	 */
> > +	for (j = 0; j < PARITY-2; ++j) {
> > +		uint8_t f = gfinv[matrix[(j+2)*DISK]];
> > +
> > +		for (i = 0; i < DISK; ++i)
> > +			matrix[(j+2)*DISK+i] = gfmul(matrix[(j+2)*DISK+i], f);
> > +	}
> > +}
> > +
> > +/**
> > + * Next power of 2.
> > + */
> > +static unsigned np(unsigned v)
> > +{
> > +	--v;
> > +	v |= v >> 1;
> > +	v |= v >> 2;
> > +	v |= v >> 4;
> > +	v |= v >> 8;
> > +	v |= v >> 16;
> > +	++v;
> > +
> > +	return v;
> > +}
> > +
> > +int main(void)
> > +{
> > +	uint8_t v;
> > +	int i, j, k, p;
> > +	uint8_t matrix[PARITY * 256];
> > +
> > +	printf("/*\n");
> > +	printf(" * Copyright (C) 2013 Andrea Mazzoleni\n");
> > +	printf(" *\n");
> > +	printf(" * This program is free software: you can redistribute it and/or modify\n");
> > +	printf(" * it under the terms of the GNU General Public License as published by\n");
> > +	printf(" * the Free Software Foundation, either version 2 of the License, or\n");
> > +	printf(" * (at your option) any later version.\n");
> > +	printf(" *\n");
> > +	printf(" * This program is distributed in the hope that it will be useful,\n");
> > +	printf(" * but WITHOUT ANY WARRANTY; without even the implied warranty of\n");
> > +	printf(" * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n");
> > +	printf(" * GNU General Public License for more details.\n");
> > +	printf(" */\n");
> > +	printf("\n");
> > +
> > +	printf("#include \"internal.h\"\n");
> > +	printf("\n");
> > +
> > +	/* a*b */
> > +	printf("#ifndef RAID_USE_RAID6_PQ\n");
> > +	printf("const uint8_t __aligned(256) raid_gfmul[256][256] =\n");
> > +	printf("{\n");
> > +	for (i = 0; i < 256; ++i) {
> > +		printf("\t{\n");
> > +		for (j = 0; j < 256; ++j) {
> > +			if (j % 8 == 0)
> > +				printf("\t\t");
> > +			v = gfmul(i, j);
> > +			if (v == 1)
> > +				gfinv[i] = j;
> > +			printf("0x%02x,", (unsigned)v);
> > +			if (j % 8 == 7)
> > +				printf("\n");
> > +			else
> > +				printf(" ");
> > +		}
> > +		printf("\t},\n");
> > +	}
> > +	printf("};\n");
> > +	printf("EXPORT_SYMBOL(raid_gfmul);\n");
> > +	printf("#endif\n");
> > +	printf("\n");
> > +
> > +	/* 2^a */
> > +	printf("#ifndef RAID_USE_RAID6_PQ\n");
> > +	printf("const uint8_t __aligned(256) raid_gfexp[256] =\n");
> > +	printf("{\n");
> > +	v = 1;
> > +	for (i = 0; i < 256; ++i) {
> > +		if (i % 8 == 0)
> > +			printf("\t");
> > +		printf("0x%02x,", v);
> > +		v = gfmul(v, 2);
> > +		if (i % 8 == 7)
> > +			printf("\n");
> > +		else
> > +			printf(" ");
> > +	}
> > +	printf("};\n");
> > +	printf("EXPORT_SYMBOL(raid_gfexp);\n");
> > +	printf("#endif\n");
> > +	printf("\n");
> > +
> > +	/* 1/a */
> > +	printf("#ifndef RAID_USE_RAID6_PQ\n");
> > +	printf("const uint8_t __aligned(256) raid_gfinv[256] =\n");
> > +	printf("{\n");
> > +	printf("\t/* note that the first element is not significative */\n");
> > +	for (i = 0; i < 256; ++i) {
> > +		if (i % 8 == 0)
> > +			printf("\t");
> > +		if (i == 0)
> > +			v = 0;
> > +		else
> > +			v = gfinv[i];
> > +		printf("0x%02x,", v);
> > +		if (i % 8 == 7)
> > +			printf("\n");
> > +		else
> > +			printf(" ");
> > +	}
> > +	printf("};\n");
> > +	printf("EXPORT_SYMBOL(raid_gfinv);\n");
> > +	printf("#endif\n");
> > +	printf("\n");
> > +
> > +	/* cauchy matrix */
> > +	set_cauchy(matrix);
> > +
> > +	printf("/**\n");
> > +	printf(" * Cauchy matrix used to generate parity.\n");
> > +	printf(" * This matrix is valid for up to %u parity with %u data disks.\n", PARITY, DISK);
> > +	printf(" *\n");
> > +	for (p = 0; p < PARITY; ++p) {
> > +		printf(" * ");
> > +		for (i = 0; i < DISK; ++i)
> > +			printf("%02x ", matrix[p*DISK+i]);
> > +		printf("\n");
> > +	}
> > +	printf(" */\n");
> > +	printf("const uint8_t __aligned(256) raid_gfcauchy[%u][256] =\n", PARITY);
> > +	printf("{\n");
> > +	for (p = 0; p < PARITY; ++p) {
> > +		printf("\t{\n");
> > +		for (i = 0; i < DISK; ++i) {
> > +			if (i % 8 == 0)
> > +				printf("\t\t");
> > +			printf("0x%02x,", matrix[p*DISK+i]);
> > +			if (i % 8 == 7)
> > +				printf("\n");
> > +			else
> > +				printf(" ");
> > +		}
> > +		printf("\n\t},\n");
> > +	}
> > +	printf("};\n");
> > +	printf("EXPORT_SYMBOL(raid_gfcauchy);\n");
> > +	printf("\n");
> > +
> > +	printf("#ifdef CONFIG_X86\n");
> > +	printf("/**\n");
> > +	printf(" * PSHUFB tables for the Cauchy matrix.\n");
> > +	printf(" *\n");
> > +	printf(" * Indexes are [DISK][PARITY - 2][LH].\n");
> > +	printf(" * Where DISK is from 0 to %u, PARITY from 2 to %u, LH from 0 to 1.\n", DISK - 1, PARITY - 1);
> > +	printf(" */\n");
> > +	printf("const uint8_t __aligned(256) raid_gfcauchypshufb[%u][%u][2][16] =\n", DISK, np(PARITY - 2));
> > +	printf("{\n");
> > +	for (i = 0; i < DISK; ++i) {
> > +		printf("\t{\n");
> > +		for (p = 2; p < PARITY; ++p) {
> > +			printf("\t\t{\n");
> > +			for (j = 0; j < 2; ++j) {
> > +				printf("\t\t\t{ ");
> > +				for (k = 0; k < 16; ++k) {
> > +					v = gfmul(matrix[p*DISK+i], k);
> > +					if (j == 1)
> > +						v = gfmul(v, 16);
> > +					printf("0x%02x", (unsigned)v);
> > +					if (k != 15)
> > +						printf(", ");
> > +				}
> > +				printf(" },\n");
> > +			}
> > +			printf("\t\t},\n");
> > +		}
> > +		printf("\t},\n");
> > +	}
> > +	printf("};\n");
> > +	printf("EXPORT_SYMBOL(raid_gfcauchypshufb);\n");
> > +	printf("#endif\n\n");
> > +
> > +	printf("#ifdef CONFIG_X86\n");
> > +	printf("/**\n");
> > +	printf(" * PSHUFB tables for generic multiplication.\n");
> > +	printf(" *\n");
> > +	printf(" * Indexes are [MULTIPLER][LH].\n");
> > +	printf(" * Where MULTIPLER is from 0 to 255, LH from 0 to 1.\n");
> > +	printf(" */\n");
> > +	printf("const uint8_t __aligned(256) raid_gfmulpshufb[256][2][16] =\n");
> > +	printf("{\n");
> > +	for (i = 0; i < 256; ++i) {
> > +		printf("\t{\n");
> > +		for (j = 0; j < 2; ++j) {
> > +			printf("\t\t{ ");
> > +			for (k = 0; k < 16; ++k) {
> > +				v = gfmul(i, k);
> > +				if (j == 1)
> > +					v = gfmul(v, 16);
> > +				printf("0x%02x", (unsigned)v);
> > +				if (k != 15)
> > +					printf(", ");
> > +			}
> > +			printf(" },\n");
> > +		}
> > +		printf("\t},\n");
> > +	}
> > +	printf("};\n");
> > +	printf("EXPORT_SYMBOL(raid_gfmulpshufb);\n");
> > +	printf("#endif\n\n");
> > +
> > +	return 0;
> > +}
> > +
> > diff --git a/lib/raid/module.c b/lib/raid/module.c
> > new file mode 100644
> > index 0000000..3a4fba3
> > --- /dev/null
> > +++ b/lib/raid/module.c
> > @@ -0,0 +1,437 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include "internal.h"
> > +#include "cpu.h"
> > +
> > +/*
> > + * Initializes and selects the best algorithm.
> > + */
> > +void raid_init(void)
> > +{
> > +	/* setup parity functions */
> > +	if (sizeof(void *) == 8) {
> > +		raid_par_ptr[0] = raid_par1_int64;
> > +		raid_par_ptr[1] = raid_par2_int64;
> > +	} else {
> > +		raid_par_ptr[0] = raid_par1_int32;
> > +		raid_par_ptr[1] = raid_par2_int32;
> > +	}
> > +	raid_par_ptr[2] = raid_par3_int8;
> > +	raid_par_ptr[3] = raid_par4_int8;
> > +	raid_par_ptr[4] = raid_par5_int8;
> > +	raid_par_ptr[5] = raid_par6_int8;
> > +
> > +	/* in kernel mode use the optimized xor_blocks() */
> > +#ifdef __KERNEL__
> > +	raid_par_ptr[0] = raid_par1_xorblocks;
> > +#endif
> > +	/* if RAID6 is present, use it */
> > +#ifdef RAID_USE_RAID6_PQ
> > +	raid_par_ptr[1] = raid_par2_raid6;
> > +#endif
> > +
> > +	/* optimized SSE2 functions */
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_sse2()) {
> > +		raid_par_ptr[0] = raid_par1_sse2;
> > +		raid_par_ptr[1] = raid_par2_sse2;
> > +#ifdef CONFIG_X86_64
> > +		raid_par_ptr[1] = raid_par2_sse2ext;
> > +#endif
> > +	}
> > +#endif
> > +
> > +	/* optimized SSSE3 functions */
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		raid_par_ptr[2] = raid_par3_ssse3;
> > +		raid_par_ptr[3] = raid_par4_ssse3;
> > +		raid_par_ptr[4] = raid_par5_ssse3;
> > +		raid_par_ptr[5] = raid_par6_ssse3;
> > +#ifdef CONFIG_X86_64
> > +		raid_par_ptr[2] = raid_par3_ssse3ext;
> > +		raid_par_ptr[3] = raid_par4_ssse3ext;
> > +		raid_par_ptr[4] = raid_par5_ssse3ext;
> > +		raid_par_ptr[5] = raid_par6_ssse3ext;
> > +#endif
> > +	}
> > +#endif
> > +
> > +	/* setup recovering functions */
> > +	raid_rec_ptr[0] = raid_rec1_int8;
> > +	raid_rec_ptr[1] = raid_rec2_int8;
> > +	raid_rec_ptr[2] = raid_recX_int8;
> > +	raid_rec_ptr[3] = raid_recX_int8;
> > +	raid_rec_ptr[4] = raid_recX_int8;
> > +	raid_rec_ptr[5] = raid_recX_int8;
> > +
> > +	/* optimized SSSE3 functions */
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		raid_rec_ptr[0] = raid_rec1_ssse3;
> > +		raid_rec_ptr[1] = raid_rec2_ssse3;
> > +		raid_rec_ptr[2] = raid_recX_ssse3;
> > +		raid_rec_ptr[3] = raid_recX_ssse3;
> > +		raid_rec_ptr[4] = raid_recX_ssse3;
> > +		raid_rec_ptr[5] = raid_recX_ssse3;
> > +	}
> > +#endif
> > +}
> > +
> > +/*
> > + * Refence parity computation.
> > + */
> > +void raid_par_ref(int nd, int np, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	size_t i;
> > +
> > +	for (i = 0; i < size; ++i) {
> > +		uint8_t p[RAID_PARITY_MAX];
> > +		int j, d;
> > +
> > +		for (j = 0; j < np; ++j)
> > +			p[j] = 0;
> > +
> > +		for (d = 0; d < nd; ++d) {
> > +			uint8_t b = v[d][i];
> > +
> > +			for (j = 0; j < np; ++j)
> > +				p[j] ^= gfmul[b][gfgen[j][d]];
> > +		}
> > +
> > +		for (j = 0; j < np; ++j)
> > +			v[nd + j][i] = p[j];
> > +	}
> > +}
> > +
> > +/*
> > + * Size of the blocks to test.
> > + */
> > +#define TEST_SIZE PAGE_SIZE
> > +
> > +/*
> > + * Number of data blocks to test.
> > + */
> > +#define TEST_COUNT (65536 / TEST_SIZE)
> > +
> > +/*
> > + * Period for the speed test.
> > + */
> > +#ifdef __KERNEL__ /* to build the user mode test */
> > +#define TEST_PERIOD 16
> > +#else
> > +#define TEST_PERIOD 512 /* more time in usermode */
> > +#endif
> > +
> > +/*
> > + * Parity generation test.
> > + */
> > +static int raid_test_par(int nd, int np, size_t size, void **v, void **ref)
> > +{
> > +	int i;
> > +	void *t[TEST_COUNT + RAID_PARITY_MAX];
> > +
> > +	/* setup data */
> > +	for (i = 0; i < nd; ++i)
> > +		t[i] = ref[i];
> > +
> > +	/* setup parity */
> > +	for (i = 0; i < np; ++i)
> > +		t[nd+i] = v[nd+i];
> > +
> > +	raid_par(nd, np, size, t);
> > +
> > +	/* compare parity */
> > +	for (i = 0; i < np; ++i) {
> > +		if (memcmp(t[nd+i], ref[nd+i], size) != 0) {
> > +			pr_err("raid: Self test failed!\n");
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Recovering test.
> > + */
> > +static int raid_test_rec(int nrd, int *id, int nrp, int *ip, int nd, int np, size_t size, void **v, void **ref)
> > +{
> > +	int i, j;
> > +	void *t[TEST_COUNT + RAID_PARITY_MAX];
> > +
> > +	/* setup data */
> > +	for (i = 0, j = 0; i < nd; ++i) {
> > +		if (j < nrd && id[j] == i) {
> > +			t[i] = v[i];
> > +			++j;
> > +		} else {
> > +			t[i] = ref[i];
> > +		}
> > +	}
> > +
> > +	/* setup parity */
> > +	for (i = 0, j = 0; i < np; ++i) {
> > +		if (j < nrp && ip[j] == i) {
> > +			t[nd+i] = v[nd+i];
> > +			++j;
> > +		} else {
> > +			t[nd+i] = ref[nd+i];
> > +		}
> > +	}
> > +
> > +	raid_rec(nrd, id, nrp, ip, nd, np, size, t);
> > +
> > +	/* compare all data and parity */
> > +	for (i = 0; i < nd+np; ++i) {
> > +		if (t[i] != ref[i] && memcmp(t[i], ref[i], size) != 0) {
> > +			pr_err("raid: Self test failed!\n");
> > +			return -EINVAL;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Basic functionality self test.
> > + */
> > +int raid_selftest(void)
> > +{
> > +	const int nd = TEST_COUNT;
> > +	const size_t size = TEST_SIZE;
> > +	const int nv = nd + RAID_PARITY_MAX * 2;
> > +	int order;
> > +	uint8_t *pages;
> > +	void *v[nd + RAID_PARITY_MAX * 2];
> > +	void *ref[nd + RAID_PARITY_MAX];
> > +	int id[RAID_PARITY_MAX];
> > +	int ip[RAID_PARITY_MAX];
> > +	int i, np;
> > +	int ret = 0;
> > +
> > +	/* ensure to have enough space for data */
> > +	BUG_ON(nd * size > 65536);
> > +
> > +	/* allocates pages for data and parity */
> > +	order = get_order(nv * size);
> > +	pages = (void *)__get_free_pages(GFP_KERNEL, order);
> > +	if (!pages) {
> > +		pr_err("raid: No memory available.\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	/* setup working vector */
> > +	for (i = 0; i < nv; ++i)
> > +		v[i] = pages + size * i;
> > +
> > +	/* use the multiplication table as data */
> > +	for (i = 0; i < nd; ++i)
> > +		ref[i] = ((uint8_t *)gfmul) + size * i;
> > +
> > +	/* setup reference parity */
> > +	for (i = 0; i < RAID_PARITY_MAX; ++i)
> > +		ref[nd+i] = v[nd+RAID_PARITY_MAX+i];
> > +
> > +	/* compute reference parity */
> > +	raid_par_ref(nd, RAID_PARITY_MAX, size, ref);
> > +
> > +	/* test for each parity level */
> > +	for (np = 1; np <= RAID_PARITY_MAX; ++np) {
> > +		/* test parity generation */
> > +		ret = raid_test_par(nd, np, size, v, ref);
> > +		if (ret != 0)
> > +			goto bail;
> > +
> > +		/* test recovering with full broken data disks */
> > +		for (i = 0; i < np; ++i)
> > +			id[i] = nd - np + i;
> > +
> > +		ret = raid_test_rec(np, id, 0, ip, nd, np, size, v, ref);
> > +		if (ret != 0)
> > +			goto bail;
> > +
> > +		/* test recovering with half broken data and ending parity */
> > +		for (i = 0; i < np / 2; ++i)
> > +			id[i] = i;
> > +
> > +		for (i = 0; i < (np + 1) / 2; ++i)
> > +			ip[i] = np - np / 2 + i;
> > +
> > +		ret = raid_test_rec(np / 2, id, np / 2, ip, nd, np, size, v, ref);
> > +		if (ret != 0)
> > +			goto bail;
> > +
> > +		/* test recovering with half broken data and leading parity */
> > +		for (i = 0; i < np / 2; ++i)
> > +			id[i] = i;
> > +
> > +		for (i = 0; i < (np + 1) / 2; ++i)
> > +			ip[i] = i;
> > +
> > +		ret = raid_test_rec(np / 2, id, np / 2, ip, nd, np, size, v, ref);
> > +		if (ret != 0)
> > +			goto bail;
> > +	}
> > +
> > +bail:
> > +	free_pages((unsigned long)pages, order);
> > +
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Test the speed of a single function.
> > + */
> > +static void raid_test_speed(
> > +	void (*func)(int nd, size_t size, void **vv),
> > +	const char *tag, const char *imp,
> > +	void **vv)
> > +{
> > +	unsigned count;
> > +	unsigned long j_start, j_stop;
> > +	unsigned long speed;
> > +
> > +	count = 0;
> > +
> > +	preempt_disable();
> > +
> > +	j_start = jiffies;
> > +	while (jiffies == j_start)
> > +		cpu_relax();
> > +
> > +	j_start = jiffies;
> > +	j_stop = j_start + TEST_PERIOD;
> > +	while (time_before(jiffies, j_stop)) {
> > +		func(TEST_COUNT, TEST_SIZE, vv);
> > +		++count;
> > +	}
> > +
> > +	preempt_enable();
> > +
> > +	speed = count * HZ / (TEST_PERIOD * 1024 * 1024 / (TEST_SIZE * TEST_COUNT));
> > +
> > +	pr_info("raid: %-4s %-6s %5ld MB/s\n", tag, imp, speed);
> > +}
> > +
> > +/*
> > + * Basic speed test.
> > + */
> > +int raid_speedtest(void)
> > +{
> > +	const int nd = TEST_COUNT;
> > +	const size_t size = TEST_SIZE;
> > +	int order;
> > +	uint8_t *pages;
> > +	void *v[nd + RAID_PARITY_MAX];
> > +	int i;
> > +
> > +	/* ensure to have enough space for data */
> > +	BUG_ON(nd * size > 65536);
> > +
> > +	/* allocates pages for parity */
> > +	order = get_order(RAID_PARITY_MAX * size);
> > +	pages = (void *)__get_free_pages(GFP_KERNEL, order);
> > +	if (!pages) {
> > +		pr_err("raid: No memory available.\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	/* use the multiplication table as data */
> > +	for (i = 0; i < nd; ++i)
> > +		v[i] = ((uint8_t *)gfmul) + size * i;
> > +
> > +	/* setup parity */
> > +	for (i = 0; i < RAID_PARITY_MAX; ++i)
> > +		v[nd+i] = pages + size * i;
> > +
> > +	raid_test_speed(raid_par1_int32, "par1", "int32", v);
> > +	raid_test_speed(raid_par2_int32, "par2", "int32", v);
> > +	raid_test_speed(raid_par1_int64, "par1", "int64", v);
> > +	raid_test_speed(raid_par2_int64, "par2", "int64", v);
> > +	raid_test_speed(raid_par3_int8, "par3", "int8", v);
> > +	raid_test_speed(raid_par4_int8, "par4", "int8", v);
> > +	raid_test_speed(raid_par5_int8, "par5", "int8", v);
> > +	raid_test_speed(raid_par6_int8, "par6", "int8", v);
> > +#ifdef __KERNEL__
> > +	raid_test_speed(raid_par1_xorblocks, "par1", "xor", v);
> > +#endif
> > +#ifdef RAID_USE_RAID6_PQ
> > +	raid_test_speed(raid_par2_raid6, "par2", "raid6", v);
> > +#endif
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_sse2()) {
> > +		raid_test_speed(raid_par1_sse2, "par1", "sse2", v);
> > +		raid_test_speed(raid_par2_sse2, "par2", "sse2", v);
> > +	}
> > +	if (raid_cpu_has_ssse3()) {
> > +		raid_test_speed(raid_par3_ssse3, "par3", "ssse3", v);
> > +		raid_test_speed(raid_par4_ssse3, "par4", "ssse3", v);
> > +		raid_test_speed(raid_par5_ssse3, "par5", "ssse3", v);
> > +		raid_test_speed(raid_par6_ssse3, "par6", "ssse3", v);
> > +#ifdef CONFIG_X86_64
> > +		raid_test_speed(raid_par2_sse2ext, "par2", "sse2e", v);
> > +		raid_test_speed(raid_par3_ssse3ext, "par3", "ssse3e", v);
> > +		raid_test_speed(raid_par4_ssse3ext, "par4", "ssse3e", v);
> > +		raid_test_speed(raid_par5_ssse3ext, "par5", "ssse3e", v);
> > +		raid_test_speed(raid_par6_ssse3ext, "par6", "ssse3e", v);
> > +#endif
> > +	}
> > +#endif
> > +
> > +	free_pages((unsigned long)pages, order);
> > +
> > +	return 0;
> > +}
> > +
> > +#ifdef __KERNEL__ /* to build the user mode test */
> > +static int speedtest;
> > +
> > +int __init raid_cauchy_init(void)
> > +{
> > +	int ret;
> > +
> > +	raid_init();
> > +
> > +	ret = raid_selftest();
> > +	if (ret != 0)
> > +		return ret;
> > +
> > +#ifdef RAID_USE_RAID6_PQ
> > +	pr_info("raid: Self test passed (using raid6)\n");
> > +#else
> > +	pr_info("raid: Self test passed\n");
> > +#endif
> > +
> > +	if (speedtest)
> > +		raid_speedtest();
> > +
> > +	return 0;
> > +}
> > +
> > +static void raid_cauchy_exit(void)
> > +{
> > +}
> > +
> > +subsys_initcall(raid_cauchy_init);
> > +module_exit(raid_cauchy_exit);
> > +module_param(speedtest, int, 0);
> > +MODULE_PARM_DESC(speedtest, "Runs a startup speed test");
> > +MODULE_AUTHOR("Andrea Mazzoleni <amadvance@...il.com>");
> > +MODULE_LICENSE("GPL");
> > +MODULE_DESCRIPTION("RAID Cauchy functions");
> > +#endif
> > +
> > diff --git a/lib/raid/raid.c b/lib/raid/raid.c
> > new file mode 100644
> > index 0000000..918cb67
> > --- /dev/null
> > +++ b/lib/raid/raid.c
> > @@ -0,0 +1,425 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include "internal.h"
> > +#include "gf.h"
> > +
> > +/*
> > + * This is a RAID implementation working in the Galois Field GF(2^8) with
> > + * the primitive polynomial x^8 + x^4 + x^3 + x^2 + 1 (285 decimal), and
> > + * supporting up to six parity levels.
> > + *
> > + * For RAID5 and RAID6 it works as as described in the H. Peter Anvin's
> > + * paper "The mathematics of RAID-6" [1]. Please refer to this paper for a
> > + * complete explanation.
> > + *
> > + * To support triple parity, it was first evaluated and then dropped, an
> > + * extension of the same approach, with additional parity coefficients set
> > + * as powers of 2^-1, with equations:
> > + *
> > + * P = sum(Di)
> > + * Q = sum(2^i * Di)
> > + * R = sum(2^-i * Di) with 0<=i<N
> > + *
> > + * This approach works well for triple parity and it's very efficient,
> > + * because we can implement very fast parallel multiplications and
> > + * divisions by 2 in GF(2^8).
> > + *
> > + * It's also similar at the approach used by ZFS RAIDZ3, with the
> > + * difference that ZFS uses powers of 4 instead of 2^-1.
> > + *
> > + * Unfortunately it doesn't work beyond triple parity, because whatever
> > + * value we choose to generate the power coefficients to compute other
> > + * parities, the resulting equations are not solvable for some
> > + * combinations of missing disks.
> > + *
> > + * This is expected, because the Vandermonde matrix used to compute the
> > + * parity has no guarantee to have all submatrices not singular
> > + * [2, Chap 11, Problem 7] and this is a requirement to have
> > + * a MDS (Maximum Distance Separable) code [2, Chap 11, Theorem 8].
> > + *
> > + * To overcome this limitation, we use a Cauchy matrix [3][4] to compute
> > + * the parity. A Cauchy matrix has the property to have all the square
> > + * submatrices not singular, resulting in always solvable equations,
> > + * for any combination of missing disks.
> > + *
> > + * The problem of this approach is that it requires the use of
> > + * generic multiplications, and not only by 2 or 2^-1, potentially
> > + * affecting badly the performance.
> > + *
> > + * Hopefully there is a method to implement parallel multiplications
> > + * using SSSE3 instructions [1][5]. Method competitive with the
> > + * computation of triple parity using power coefficients.
> > + *
> > + * Another important property of the Cauchy matrix is that we can setup
> > + * the first two rows with coeffients equal at the RAID5 and RAID6 approach
> > + * decribed, resulting in a compatible extension, and requiring SSSE3
> > + * instructions only if triple parity or beyond is used.
> > + *
> > + * The matrix is also adjusted, multipling each row by a constant factor
> > + * to make the first column of all 1, to optimize the computation for
> > + * the first disk.
> > + *
> > + * This results in the matrix A[row,col] defined as:
> > + *
> > + * 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01...
> > + * 01 02 04 08 10 20 40 80 1d 3a 74 e8 cd 87 13 26 4c 98 2d 5a b4 75...
> > + * 01 f5 d2 c4 9a 71 f1 7f fc 87 c1 c6 19 2f 40 55 3d ba 53 04 9c 61...
> > + * 01 bb a6 d7 c7 07 ce 82 4a 2f a5 9b b6 60 f1 ad e7 f4 06 d2 df 2e...
> > + * 01 97 7f 9c 7c 18 bd a2 58 1a da 74 70 a3 e5 47 29 07 f5 80 23 e9...
> > + * 01 2b 3f cf 73 2c d6 ed cb 74 15 78 8a c1 17 c9 89 68 21 ab 76 3b...
> > + *
> > + * This matrix supports 6 level of parity, one for each row, for up to 251
> > + * data disks, one for each column, with all the 377,342,351,231 square
> > + * submatrices not singular, verified also with brute-force.
> > + *
> > + * This matrix can be extended to support any number of parities, just
> > + * adding additional rows, and removing one column for each new row.
> > + * (see mktables.c for more details in how the matrix is generated)
> > + *
> > + * In details, parity is computed as:
> > + *
> > + * P = sum(Di)
> > + * Q = sum(2^i *  Di)
> > + * R = sum(A[2,i] * Di)
> > + * S = sum(A[3,i] * Di)
> > + * T = sum(A[4,i] * Di)
> > + * U = sum(A[5,i] * Di) with 0<=i<N
> > + *
> > + * To recover from a failure of six disks at indexes x,y,z,h,v,w,
> > + * with 0<=x<y<z<h<v<w<N, we compute the parity of the available N-6
> > + * disks as:
> > + *
> > + * Pa = sum(Di)
> > + * Qa = sum(2^i * Di)
> > + * Ra = sum(A[2,i] * Di)
> > + * Sa = sum(A[3,i] * Di)
> > + * Ta = sum(A[4,i] * Di)
> > + * Ua = sum(A[5,i] * Di) with 0<=i<N,i!=x,i!=y,i!=z,i!=h,i!=v,i!=w.
> > + *
> > + * And if we define:
> > + *
> > + * Pd = Pa + P
> > + * Qd = Qa + Q
> > + * Rd = Ra + R
> > + * Sd = Sa + S
> > + * Td = Ta + T
> > + * Ud = Ua + U
> > + *
> > + * we can sum these two sets of equations, obtaining:
> > + *
> > + * Pd =          Dx +          Dy +          Dz +          Dh +          Dv +          Dw
> > + * Qd =    2^x * Dx +    2^y * Dy +    2^z * Dz +    2^h * Dh +    2^v * Dv +    2^w * Dw
> > + * Rd = A[2,x] * Dx + A[2,y] * Dy + A[2,z] * Dz + A[2,h] * Dh + A[2,v] * Dv + A[2,w] * Dw
> > + * Sd = A[3,x] * Dx + A[3,y] * Dy + A[3,z] * Dz + A[3,h] * Dh + A[3,v] * Dv + A[3,w] * Dw
> > + * Td = A[4,x] * Dx + A[4,y] * Dy + A[4,z] * Dz + A[4,h] * Dh + A[4,v] * Dv + A[4,w] * Dw
> > + * Ud = A[5,x] * Dx + A[5,y] * Dy + A[5,z] * Dz + A[5,h] * Dh + A[5,v] * Dv + A[5,w] * Dw
> > + *
> > + * A linear system always solvable because the coefficients matrix is
> > + * always not singular due the properties of the matrix A[].
> > + *
> > + * Resulting speed in x64, with 16 data disks, using a stripe of 4 KiB,
> > + * for a Core i7-3740QM CPU @ 2.7GHz is:
> > + *
> > + *           int8   int32   int64    sse2   sse2e   ssse3  ssse3e
> > + *   par1           11469   21579   44743
> > + *   par2            3474    6176   17930   20435
> > + *   par3     850                                    7908    9069
> > + *   par4     647                                    6357    7159
> > + *   par5     527                                    5041    5412
> > + *   par6     432                                    4094    4470
> > + *
> > + * Values are in MiB/s of data processed, not counting generated parity.
> > + *
> > + * References:
> > + * [1] Anvin, "The mathematics of RAID-6", 2004
> > + * [2] MacWilliams, Sloane, "The Theory of Error-Correcting Codes", 1977
> > + * [3] Blomer, "An XOR-Based Erasure-Resilient Coding Scheme", 1995
> > + * [4] Roth, "Introduction to Coding Theory", 2006
> > + * [5] Plank, "Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions", 2013
> > + */
> > +
> > +#ifdef __KERNEL__ /* to build the user mode test */
> > +/**
> > + * Buffer filled with 0 used in recovering.
> > + */
> > +static uint8_t raid_zero_block[PAGE_SIZE] __aligned(256);
> > +#else
> > +extern uint8_t raid_zero_block[] __aligned(256);
> > +#endif
> > +
> > +/*
> > + * PAR1 (RAID5 with xor) implementation using the kernel xor_blocks()
> > + * function.
> > + */
> > +void raid_par1_xorblocks(int nd, size_t size, void **v)
> > +{
> > +	int i;
> > +
> > +	/* copy the first block */
> > +	memcpy(v[nd], v[0], size);
> > +
> > +	i = 1;
> > +	while (i < nd) {
> > +		int run = nd - i;
> > +
> > +		/* xor_blocks supports no more than MAX_XOR_BLOCKS blocks */
> > +		if (run > MAX_XOR_BLOCKS)
> > +			run = MAX_XOR_BLOCKS;
> > +
> > +		xor_blocks(run, size, v[nd], v + i);
> > +
> > +		i += run;
> > +	}
> > +}
> > +
> > +#ifdef RAID_USE_RAID6_PQ
> > +/**
> > + * PAR2 (RAID6 with powers of 2) implementation using raid6 library.
> > + */
> > +void raid_par2_raid6(int nd, size_t size, void **vv)
> > +{
> > +	raid6_call.gen_syndrome(nd + 2, size, vv);
> > +}
> > +#endif
> > +
> > +/* internal forwarder */
> > +void (*raid_par_ptr[RAID_PARITY_MAX])(int nd, size_t size, void **vv);
> > +
> > +void raid_par(int nd, int np, size_t size, void **v)
> > +{
> > +	BUG_ON(np < 1 || np > RAID_PARITY_MAX);
> > +	BUG_ON(size % 64 != 0);
> > +
> > +	raid_par_ptr[np - 1](nd, size, v);
> > +}
> > +EXPORT_SYMBOL_GPL(raid_par);
> > +
> > +/**
> > + * Inverts the square matrix M of size nxn into V.
> > + * We use Gauss elimination to invert.
> > + */
> > +void raid_invert(uint8_t *M, uint8_t *V, int n)
> > +{
> > +	int i, j, k;
> > +
> > +	/* set the identity matrix in V */
> > +	for (i = 0; i < n; ++i)
> > +		for (j = 0; j < n; ++j)
> > +			V[i*n+j] = i == j;
> > +
> > +	/* for each element in the diagonal */
> > +	for (k = 0; k < n; ++k) {
> > +		uint8_t f;
> > +
> > +		/* the diagonal element cannot be 0 because */
> > +		/* we are inverting matrices with all the square submatrices */
> > +		/* not singular */
> > +		BUG_ON(M[k*n+k] == 0);
> > +
> > +		/* make the diagonal element to be 1 */
> > +		f = inv(M[k*n+k]);
> > +		for (j = 0; j < n; ++j) {
> > +			M[k*n+j] = mul(f, M[k*n+j]);
> > +			V[k*n+j] = mul(f, V[k*n+j]);
> > +		}
> > +
> > +		/* make all the elements over and under the diagonal to be 0 */
> > +		for (i = 0; i < n; ++i) {
> > +			if (i == k)
> > +				continue;
> > +			f = M[i*n+k];
> > +			for (j = 0; j < n; ++j) {
> > +				M[i*n+j] ^= mul(f, M[k*n+j]);
> > +				V[i*n+j] ^= mul(f, V[k*n+j]);
> > +			}
> > +		}
> > +	}
> > +}
> > +
> > +/**
> > + * Computes the parity without the missing data blocks
> > + * and store it in the buffers of such data blocks.
> > + *
> > + * This is the parity expressed as Pa,Qa,Ra,Sa,Ta,Ua
> > + * in the equations.
> > + *
> > + * Note that all the other parities not in the ip[] vector
> > + * are destroyed.
> > + */
> > +void raid_delta_gen(int nr, int *id, int *ip, int nd, size_t size, void **v)
> > +{
> > +	void *p[RAID_PARITY_MAX];
> > +	void *pa[RAID_PARITY_MAX];
> > +	int i;
> > +
> > +	for (i = 0; i < nr; ++i) {
> > +		/* keep a copy of the parity buffer */
> > +		p[i] = v[nd+ip[i]];
> > +
> > +		/* buffer for missing data blocks */
> > +		pa[i] = v[id[i]];
> > +
> > +		/* set at zero the missing data blocks */
> > +		v[id[i]] = raid_zero_block;
> > +
> > +		/* compute the parity over the missing data blocks */
> > +		v[nd+ip[i]] = pa[i];
> > +	}
> > +
> > +	/* recompute the minimal parity required */
> > +	raid_par(nd, ip[nr - 1] + 1, size, v);
> > +
> > +	for (i = 0; i < nr; ++i) {
> > +		/* restore disk buffers as before */
> > +		v[id[i]] = pa[i];
> > +
> > +		/* restore parity buffers as before */
> > +		v[nd+ip[i]] = p[i];
> > +	}
> > +}
> > +
> > +/**
> > + * Recover failure of one data block for PAR1.
> > + *
> > + * Starting from the equation:
> > + *
> > + * Pd = Dx
> > + *
> > + * and solving we get:
> > + *
> > + * Dx = Pd
> > + */
> > +void raid_rec1_par1(int *id, int nd, size_t size, void **v)
> > +{
> > +	void *p;
> > +	void *pa;
> > +
> > +	/* for PAR1 we can directly compute the missing block */
> > +	/* and we don't need to use the zero buffer */
> > +	p = v[nd];
> > +	pa = v[id[0]];
> > +
> > +	/* use the parity as missing data block */
> > +	v[id[0]] = p;
> > +
> > +	/* compute the parity over the missing data block */
> > +	v[nd] = pa;
> > +
> > +	/* compute */
> > +	raid_par(nd, 1, size, v);
> > +
> > +	/* restore as before */
> > +	v[id[0]] = pa;
> > +	v[nd] = p;
> > +}
> > +
> > +/**
> > + * Recover failure of two data blocks for PAR2.
> > + *
> > + * Starting from the equations:
> > + *
> > + * Pd = Dx + Dy
> > + * Qd = 2^id[0] * Dx + 2^id[1] * Dy
> > + *
> > + * and solving we get:
> > + *
> > + *               1                     2^(-id[0])
> > + * Dy = ------------------- * Pd + ------------------- * Qd
> > + *      2^(id[1]-id[0]) + 1        2^(id[1]-id[0]) + 1
> > + *
> > + * Dx = Dy + Pd
> > + *
> > + * with conditions:
> > + *
> > + * 2^id[0] != 0
> > + * 2^(id[1]-id[0]) + 1 != 0
> > + *
> > + * That are always satisfied for any 0<=id[0]<id[1]<255.
> > + */
> > +void raid_rec2_par2(int *id, int *ip, int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	size_t i;
> > +	uint8_t *p;
> > +	uint8_t *pa;
> > +	uint8_t *q;
> > +	uint8_t *qa;
> > +	const uint8_t *T[2];
> > +
> > +	/* get multiplication tables */
> > +	T[0] = table(inv(pow2(id[1]-id[0]) ^ 1));
> > +	T[1] = table(inv(pow2(id[0]) ^ pow2(id[1])));
> > +
> > +	/* compute delta parity */
> > +	raid_delta_gen(2, id, ip, nd, size, vv);
> > +
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	pa = v[id[0]];
> > +	qa = v[id[1]];
> > +
> > +	for (i = 0; i < size; ++i) {
> > +		/* delta */
> > +		uint8_t Pd = p[i] ^ pa[i];
> > +		uint8_t Qd = q[i] ^ qa[i];
> > +
> > +		/* reconstruct */
> > +		uint8_t Dy = T[0][Pd] ^ T[1][Qd];
> > +		uint8_t Dx = Pd ^ Dy;
> > +
> > +		/* set */
> > +		pa[i] = Dx;
> > +		qa[i] = Dy;
> > +	}
> > +}
> > +
> > +/* internal forwarder */
> > +void (*raid_rec_ptr[RAID_PARITY_MAX])(
> > +	int nr, int *id, int *ip, int nd, size_t size, void **vv);
> > +
> > +void raid_rec(int nrd, int *id, int nrp, int *ip, int nd, int np, size_t size, void **v)
> > +{
> > +	BUG_ON(nrd > nd);
> > +	BUG_ON(nrd + nrp > np);
> > +	BUG_ON(size % 64 != 0);
> > +	BUG_ON(size > PAGE_SIZE);
> > +
> > +	/* if failed data is present */
> > +	if (nrd != 0) {
> > +		int iu[RAID_PARITY_MAX];
> > +		int i, j, k;
> > +
> > +		/* setup the vector of parities to use */
> > +		for (i = 0, j = 0, k = 0; i < np; ++i) {
> > +			if (j < nrp && ip[j] == i) {
> > +				/* this parity has to be recovered */
> > +				++j;
> > +			} else {
> > +				/* this parity is used for recovering */
> > +				iu[k] = i;
> > +				++k;
> > +			}
> > +		}
> > +
> > +		/* recover the data, and limit the parity use to needed ones */
> > +		raid_rec_ptr[nrd - 1](nrd, id, iu, nd, size, v);
> > +	}
> > +
> > +	/* recompute all the parities up to the last bad one */
> > +	if (nrp != 0)
> > +		raid_par(nd, ip[nrp - 1] + 1, size, v);
> > +}
> > +EXPORT_SYMBOL_GPL(raid_rec);
> > +
> > diff --git a/lib/raid/sort.c b/lib/raid/sort.c
> > new file mode 100644
> > index 0000000..8afac72
> > --- /dev/null
> > +++ b/lib/raid/sort.c
> > @@ -0,0 +1,72 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include "internal.h"
> > +
> > +#define RAID_SWAP(a, b) \
> > +	do { \
> > +		if (v[a] > v[b]) { \
> > +			int t = v[a]; \
> > +			v[a] = v[b]; \
> > +			v[b] = t; \
> > +		} \
> > +	} while (0)
> > +
> > +void raid_sort(int n, int *v)
> > +{
> > +	/* sorting networks generated with Batcher's Merge-Exchange */
> > +	switch (n) {
> > +	case 2:
> > +		RAID_SWAP(0, 1);
> > +		break;
> > +	case 3:
> > +		RAID_SWAP(0, 2);
> > +		RAID_SWAP(0, 1);
> > +		RAID_SWAP(1, 2);
> > +		break;
> > +	case 4:
> > +		RAID_SWAP(0, 2);
> > +		RAID_SWAP(1, 3);
> > +		RAID_SWAP(0, 1);
> > +		RAID_SWAP(2, 3);
> > +		RAID_SWAP(1, 2);
> > +		break;
> > +	case 5:
> > +		RAID_SWAP(0, 4);
> > +		RAID_SWAP(0, 2);
> > +		RAID_SWAP(1, 3);
> > +		RAID_SWAP(2, 4);
> > +		RAID_SWAP(0, 1);
> > +		RAID_SWAP(2, 3);
> > +		RAID_SWAP(1, 4);
> > +		RAID_SWAP(1, 2);
> > +		RAID_SWAP(3, 4);
> > +		break;
> > +	case 6:
> > +		RAID_SWAP(0, 4);
> > +		RAID_SWAP(1, 5);
> > +		RAID_SWAP(0, 2);
> > +		RAID_SWAP(1, 3);
> > +		RAID_SWAP(2, 4);
> > +		RAID_SWAP(3, 5);
> > +		RAID_SWAP(0, 1);
> > +		RAID_SWAP(2, 3);
> > +		RAID_SWAP(4, 5);
> > +		RAID_SWAP(1, 4);
> > +		RAID_SWAP(1, 2);
> > +		RAID_SWAP(3, 4);
> > +		break;
> > +	}
> > +}
> > +
> > diff --git a/lib/raid/test/Makefile b/lib/raid/test/Makefile
> > new file mode 100644
> > index 0000000..04e8e1e
> > --- /dev/null
> > +++ b/lib/raid/test/Makefile
> > @@ -0,0 +1,33 @@
> > +#
> > +# This is a simple Makefile to test some of the RAID code
> > +# from userspace.
> > +#
> > +
> > +CC  = gcc
> > +CFLAGS = -I.. -I../../../include -Wall -Wextra -g -O2
> > +LD = ld
> > +OBJS = raid.o int.o x86.o tables.o memory.o test.o sort.o module.o xor.o
> > +
> > +.c.o:
> > +	$(CC) $(CFLAGS) -c -o $@ $<
> > +
> > +%.c: ../%.c
> > +	cp -f $< $@
> > +
> > +all: fulltest speedtest selftest
> > +
> > +fulltest: $(OBJS) fulltest.o
> > +	$(CC) $(CFLAGS) -o fulltest $^
> > +
> > +speedtest: $(OBJS) speedtest.o
> > +	$(CC) $(CFLAGS) -o speedtest $^
> > +
> > +selftest: $(OBJS) selftest.o
> > +	$(CC) $(CFLAGS) -o selftest $^
> > +
> > +tables.c: mktables
> > +	./mktables > tables.c
> > +
> > +clean:
> > +	rm -f *.o mktables.c mktables tables.c fulltest speedtest selftest
> > +
> > diff --git a/lib/raid/test/combo.h b/lib/raid/test/combo.h
> > new file mode 100644
> > index 0000000..31530a2
> > --- /dev/null
> > +++ b/lib/raid/test/combo.h
> > @@ -0,0 +1,155 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#ifndef __RAID_COMBO_H
> > +#define __RAID_COMBO_H
> > +
> > +#include <assert.h>
> > +
> > +/**
> > + * Get the first permutation with repetition of r of n elements.
> > + *
> > + * Typical use is with permutation_next() in the form :
> > + *
> > + * int i[R];
> > + * permutation_first(R, N, i);
> > + * do {
> > + *    code using i[0], i[1], ..., i[R-1]
> > + * } while (permutation_next(R, N, i));
> > + *
> > + * It's equivalent at the code :
> > + *
> > + * for(i[0]=0;i[0]<N;++i[0])
> > + *     for(i[1]=0;i[1]<N;++i[1])
> > + *        ...
> > + *            for(i[R-2]=0;i[R-2]<N;++i[R-2])
> > + *                for(i[R-1]=0;i[R-1]<N;++i[R-1])
> > + *                    code using i[0], i[1], ..., i[R-1]
> > + */
> > +static inline void permutation_first(int r, int n, int *c)
> > +{
> > +	int i;
> > +
> > +	(void)n; /* unused, but kept for clarity */
> > +	assert(0 < r && r <= n);
> > +
> > +	for (i = 0; i < r; ++i)
> > +		c[i] = 0;
> > +}
> > +
> > +/**
> > + * Get the next permutation with repetition of r of n elements.
> > + * Return ==0 when finished.
> > + */
> > +static inline int permutation_next(int r, int n, int *c)
> > +{
> > +	int i = r - 1; /* present position */
> > +
> > +recurse:
> > +	/* next element at position i */
> > +	++c[i];
> > +
> > +	/* if the position has reached the max */
> > +	if (c[i] >= n) {
> > +
> > +		/* if we are at the first level, we have finished */
> > +		if (i == 0)
> > +			return 0;
> > +
> > +		/* increase the previous position */
> > +		--i;
> > +		goto recurse;
> > +	}
> > +
> > +	++i;
> > +
> > +	/* initialize all the next positions, if any */
> > +	while (i < r) {
> > +		c[i] = 0;
> > +		++i;
> > +	}
> > +
> > +	return 1;
> > +}
> > +
> > +/**
> > + * Get the first combination without repetition of r of n elements.
> > + *
> > + * Typical use is with combination_next() in the form :
> > + *
> > + * int i[R];
> > + * combination_first(R, N, i);
> > + * do {
> > + *    code using i[0], i[1], ..., i[R-1]
> > + * } while (combination_next(R, N, i));
> > + *
> > + * It's equivalent at the code :
> > + *
> > + * for(i[0]=0;i[0]<N-(R-1);++i[0])
> > + *     for(i[1]=i[0]+1;i[1]<N-(R-2);++i[1])
> > + *        ...
> > + *            for(i[R-2]=i[R-3]+1;i[R-2]<N-1;++i[R-2])
> > + *                for(i[R-1]=i[R-2]+1;i[R-1]<N;++i[R-1])
> > + *                    code using i[0], i[1], ..., i[R-1]
> > + */
> > +static inline void combination_first(int r, int n, int *c)
> > +{
> > +	int i;
> > +
> > +	(void)n; /* unused, but kept for clarity */
> > +	assert(0 < r && r <= n);
> > +
> > +	for (i = 0; i < r; ++i)
> > +		c[i] = i;
> > +}
> > +
> > +/**
> > + * Get the next combination without repetition of r of n elements.
> > + * Return ==0 when finished.
> > + */
> > +static inline int combination_next(int r, int n, int *c)
> > +{
> > +	int i = r - 1; /* present position */
> > +	int h = n; /* high limit for this position */
> > +
> > +recurse:
> > +	/* next element at position i */
> > +	++c[i];
> > +
> > +	/* if the position has reached the max */
> > +	if (c[i] >= h) {
> > +
> > +		/* if we are at the first level, we have finished */
> > +		if (i == 0)
> > +			return 0;
> > +
> > +		/* increase the previous position */
> > +		--i;
> > +		--h;
> > +		goto recurse;
> > +	}
> > +
> > +	++i;
> > +
> > +	/* initialize all the next positions, if any */
> > +	while (i < r) {
> > +		/* each position start at the next value of the previous one */
> > +		c[i] = c[i-1] + 1;
> > +		++i;
> > +	}
> > +
> > +	return 1;
> > +}
> > +#endif
> > +
> > diff --git a/lib/raid/test/fulltest.c b/lib/raid/test/fulltest.c
> > new file mode 100644
> > index 0000000..fa7c3dd
> > --- /dev/null
> > +++ b/lib/raid/test/fulltest.c
> > @@ -0,0 +1,76 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include "internal.h"
> > +#include "test.h"
> > +#include "cpu.h"
> > +
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +
> > +/*
> > + * Size of the blocks to test.
> > + */
> > +#define TEST_SIZE 256
> > +
> > +uint8_t raid_zero_block[TEST_SIZE] __aligned(256);
> > +
> > +int main(void)
> > +{
> > +	raid_init();
> > +
> > +	printf("RAID Cauchy test suite\n\n");
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_sse2())
> > +		printf("Including x86 SSE2 functions\n");
> > +	if (raid_cpu_has_ssse3())
> > +		printf("Including x86 SSSE3 functions\n");
> > +#endif
> > +#ifdef CONFIG_X86_64
> > +	printf("Including x64 extended SSE register set\n");
> > +#endif
> > +
> > +	printf("\nPlease wait about 60 seconds...\n\n");
> > +
> > +	printf("Test sorting...\n");
> > +	if (raid_test_sort() != 0) {
> > +		printf("FAILED!\n");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +	printf("Test combinations/permutations...\n");
> > +	if (raid_test_combo() != 0) {
> > +		printf("FAILED!\n");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +	printf("Test parity generation with %u data disks...\n", RAID_DATA_MAX);
> > +	if (raid_test_par(RAID_DATA_MAX, TEST_SIZE) != 0) {
> > +		printf("FAILED!\n");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +	printf("Test parity generation with 1 data disk...\n");
> > +	if (raid_test_par(1, TEST_SIZE) != 0) {
> > +		printf("FAILED!\n");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +	printf("Test recovering with all combinations of 32 data and 6 parity blocks...\n");
> > +	if (raid_test_rec(32, TEST_SIZE) != 0) {
> > +		printf("FAILED!\n");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +
> > +	printf("OK\n");
> > +	return 0;
> > +}
> > +
> > diff --git a/lib/raid/test/memory.c b/lib/raid/test/memory.c
> > new file mode 100644
> > index 0000000..6807ee4
> > --- /dev/null
> > +++ b/lib/raid/test/memory.c
> > @@ -0,0 +1,79 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include "internal.h"
> > +#include "memory.h"
> > +
> > +void *raid_malloc_align(size_t size, void **freeptr)
> > +{
> > +	unsigned char *ptr;
> > +	uintptr_t offset;
> > +
> > +	ptr = malloc(size + RAID_MALLOC_ALIGN);
> > +	if (!ptr)
> > +		return 0;
> > +
> > +	*freeptr = ptr;
> > +
> > +	offset = ((uintptr_t)ptr) % RAID_MALLOC_ALIGN;
> > +
> > +	if (offset != 0)
> > +		ptr += RAID_MALLOC_ALIGN - offset;
> > +
> > +	return ptr;
> > +}
> > +
> > +void **raid_malloc_vector(int nd, int n, size_t size, void **freeptr)
> > +{
> > +	void **v;
> > +	unsigned char *va;
> > +	int i;
> > +
> > +	v = malloc(n * sizeof(void *));
> > +	if (!v)
> > +		return 0;
> > +
> > +	va = raid_malloc_align(n * (size + RAID_MALLOC_DISPLACEMENT), freeptr);
> > +	if (!va) {
> > +		free(v);
> > +		return 0;
> > +	}
> > +
> > +	for (i = 0; i < n; ++i) {
> > +		v[i] = va;
> > +		va += size + RAID_MALLOC_DISPLACEMENT;
> > +	}
> > +
> > +	/* reverse order of the data blocks */
> > +	/* because they are usually accessed from the last one */
> > +	for (i = 0; i < nd/2; ++i) {
> > +		void *ptr = v[i];
> > +		v[i] = v[nd - 1 - i];
> > +		v[nd - 1 - i] = ptr;
> > +	}
> > +
> > +	return v;
> > +}
> > +
> > +void raid_mrand_vector(int n, size_t size, void **vv)
> > +{
> > +	unsigned char **v = (unsigned char **)vv;
> > +	int i;
> > +	size_t j;
> > +
> > +	for (i = 0; i < n; ++i)
> > +		for (j = 0; j < size; ++j)
> > +			v[i][j] = rand();
> > +}
> > +
> > diff --git a/lib/raid/test/memory.h b/lib/raid/test/memory.h
> > new file mode 100644
> > index 0000000..b5c4e9a
> > --- /dev/null
> > +++ b/lib/raid/test/memory.h
> > @@ -0,0 +1,74 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#ifndef __RAID_MEMORY_H
> > +#define __RAID_MEMORY_H
> > +
> > +/**
> > + * Memory alignment provided by raid_malloc_align().
> > + *
> > + * It should guarantee good cache performance everywhere.
> > + */
> > +#define RAID_MALLOC_ALIGN 256
> > +
> > +/**
> > + * Memory displacement to avoid cache collisions on contiguous blocks,
> > + * used by raid_malloc_vector().
> > + *
> > + * When allocating a sequence of blocks with a size of power of 2,
> > + * there is the risk that the start of each block is mapped into the same
> > + * cache line or prefetching prediction, resulting in collisions if you
> > + * access all the blocks in parallel, from the start to the end.
> > + *
> > + * The selected value was choosen empirically with some speed tests
> > + * with 16 data buffers of 4 KB.
> > + *
> > + * These are the results with no displacement:
> > + *
> > + *            int8   int32   int64    sse2   sse2e   ssse3  ssse3e
> > + *    par1            6940   13971   29824
> > + *    par2            2530    4675   14840   16485
> > + *    par3     490                                    6859    7710
> > + *
> > + * These are the results with displacement resulting in improvments
> > + * in the order of 20% or more:
> > + *
> > + *            int8   int32   int64    sse2   sse2e   ssse3  ssse3e
> > + *    par1           11762   21450   44621
> > + *    par2            3520    6176   18100   20338
> > + *    par3     848                                    8009    9210
> > + *
> > + */
> > +#define RAID_MALLOC_DISPLACEMENT 64
> > +
> > +/**
> > + * Aligned malloc.
> > + */
> > +void *raid_malloc_align(size_t size, void **freeptr);
> > +
> > +/**
> > + * Aligned vector allocation.
> > + * Returns a vector of @n pointers, each one pointing to a block of
> > + * the specified @size.
> > + * The first @nd elements are reversed in order.
> > + */
> > +void **raid_malloc_vector(int nd, int n, size_t size, void **freeptr);
> > +
> > +/**
> > + * Fills the memory vector with random data.
> > + */
> > +void raid_mrand_vector(int n, size_t size, void **vv);
> > +
> > +#endif
> > +
> > diff --git a/lib/raid/test/selftest.c b/lib/raid/test/selftest.c
> > new file mode 100644
> > index 0000000..c10db29
> > --- /dev/null
> > +++ b/lib/raid/test/selftest.c
> > @@ -0,0 +1,41 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include "internal.h"
> > +#include "cpu.h"
> > +
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +
> > +uint8_t raid_zero_block[PAGE_SIZE] __aligned(256);
> > +
> > +int main(void)
> > +{
> > +	raid_init();
> > +
> > +	printf("RAID Cauchy selftest\n\n");
> > +
> > +	printf("Self test...\n");
> > +	if (raid_selftest() != 0) {
> > +		printf("FAILED!\n");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +	printf("OK\n\n");
> > +
> > +	printf("Speed test...\n");
> > +	raid_speedtest();
> > +
> > +	return 0;
> > +}
> > +
> > diff --git a/lib/raid/test/speedtest.c b/lib/raid/test/speedtest.c
> > new file mode 100644
> > index 0000000..d3784c3
> > --- /dev/null
> > +++ b/lib/raid/test/speedtest.c
> > @@ -0,0 +1,567 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include "internal.h"
> > +#include "memory.h"
> > +#include "cpu.h"
> > +
> > +#include <sys/time.h>
> > +#include <stdio.h>
> > +#include <inttypes.h>
> > +
> > +/*
> > + * Size of the blocks to test.
> > + */
> > +#define TEST_SIZE PAGE_SIZE
> > +
> > +/*
> > + * Number of data blocks to test.
> > + */
> > +#define TEST_COUNT (65536 / TEST_SIZE)
> > +
> > +uint8_t raid_zero_block[TEST_SIZE] __aligned(256);
> > +
> > +/**
> > + * Differential us of two timeval.
> > + */
> > +static int64_t diffgettimeofday(struct timeval *start, struct timeval *stop)
> > +{
> > +	int64_t d;
> > +
> > +	d = 1000000LL * (stop->tv_sec - start->tv_sec);
> > +	d += stop->tv_usec - start->tv_usec;
> > +
> > +	return d;
> > +}
> > +
> > +/**
> > + * Start time measurement.
> > + */
> > +#define SPEED_START \
> > +	count = 0; \
> > +	gettimeofday(&start, 0); \
> > +	do { \
> > +		for (i = 0; i < delta; ++i)
> > +
> > +/**
> > + * Stop time measurement.
> > + */
> > +#define SPEED_STOP \
> > +		count += delta; \
> > +		gettimeofday(&stop, 0); \
> > +	} while (diffgettimeofday(&start, &stop) < 1000000LL); \
> > +	ds = size * (int64_t)count * nd; \
> > +	dt = diffgettimeofday(&start, &stop);
> > +
> > +void speed(void)
> > +{
> > +	struct timeval start;
> > +	struct timeval stop;
> > +	int64_t ds;
> > +	int64_t dt;
> > +	int i, j;
> > +	int id[RAID_PARITY_MAX];
> > +	int ip[RAID_PARITY_MAX];
> > +	int count;
> > +	int delta = 10;
> > +	int size = TEST_SIZE;
> > +	int nd = TEST_COUNT;
> > +	int nv;
> > +	void *v_alloc;
> > +	void **v;
> > +
> > +	nv = nd + RAID_PARITY_MAX;
> > +
> > +	v = raid_malloc_vector(nd, nv, size, &v_alloc);
> > +
> > +	/* initialize disks with fixed data */
> > +	for (i = 0; i < nd; ++i)
> > +		memset(v[i], i, size);
> > +
> > +	/* basic disks and parity mapping */
> > +	for (i = 0; i < RAID_PARITY_MAX; ++i) {
> > +		id[i] = i;
> > +		ip[i] = i;
> > +	}
> > +
> > +	printf("Speed test using %u data buffers of %u bytes, for a total of %u KiB.\n", nd, size, nd * size / 1024);
> > +	printf("Memory blocks have a displacement of %u bytes to improve cache performance.\n", RAID_MALLOC_DISPLACEMENT);
> > +	printf("The reported value is the aggregate bandwidth of all data blocks in MiB/s,\n");
> > +	printf("not counting parity blocks.\n");
> > +	printf("\n");
> > +
> > +	printf("Memory write speed using the C memset() function:\n");
> > +	printf("%8s", "memset");
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		for (j = 0; j < nd; ++j)
> > +			memset(v[j], j, size);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	printf("\n");
> > +	printf("\n");
> > +
> > +	/* RAID table */
> > +	printf("RAID functions used for computing the parity:\n");
> > +	printf("%8s", "");
> > +	printf("%8s", "int8");
> > +	printf("%8s", "int32");
> > +	printf("%8s", "int64");
> > +#ifdef CONFIG_X86
> > +	printf("%8s", "sse2");
> > +#ifdef CONFIG_X86_64
> > +	printf("%8s", "sse2e");
> > +#endif
> > +	printf("%8s", "ssse3");
> > +#ifdef CONFIG_X86_64
> > +	printf("%8s", "ssse3e");
> > +#endif
> > +#endif
> > +	printf("\n");
> > +
> > +	/* PAR1 */
> > +	printf("%8s", "par1");
> > +	fflush(stdout);
> > +
> > +	printf("%8s", "");
> > +
> > +	SPEED_START {
> > +		raid_par1_int32(nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		raid_par1_int64(nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_sse2()) {
> > +		SPEED_START {
> > +			raid_par1_sse2(nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +		fflush(stdout);
> > +	}
> > +#endif
> > +	printf("\n");
> > +
> > +	/* PAR2 */
> > +	printf("%8s", "par2");
> > +	fflush(stdout);
> > +
> > +	printf("%8s", "");
> > +
> > +	SPEED_START {
> > +		raid_par2_int32(nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		raid_par2_int64(nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_sse2()) {
> > +		SPEED_START {
> > +			raid_par2_sse2(nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +		fflush(stdout);
> > +
> > +#ifdef CONFIG_X86_64
> > +		SPEED_START {
> > +			raid_par2_sse2ext(nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +		fflush(stdout);
> > +#endif
> > +	}
> > +#endif
> > +	printf("\n");
> > +
> > +	/* PAR3 */
> > +	printf("%8s", "par3");
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		raid_par3_int8(nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +	printf("%8s", "");
> > +	printf("%8s", "");
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_sse2()) {
> > +		printf("%8s", "");
> > +
> > +#ifdef CONFIG_X86_64
> > +		printf("%8s", "");
> > +#endif
> > +	}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		SPEED_START {
> > +			raid_par3_ssse3(nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +		fflush(stdout);
> > +
> > +#ifdef CONFIG_X86_64
> > +		SPEED_START {
> > +			raid_par3_ssse3ext(nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +		fflush(stdout);
> > +#endif
> > +	}
> > +#endif
> > +	printf("\n");
> > +
> > +	/* PAR4 */
> > +	printf("%8s", "par4");
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		raid_par4_int8(nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +	printf("%8s", "");
> > +	printf("%8s", "");
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_sse2()) {
> > +		printf("%8s", "");
> > +
> > +#ifdef CONFIG_X86_64
> > +		printf("%8s", "");
> > +#endif
> > +	}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		SPEED_START {
> > +			raid_par4_ssse3(nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +		fflush(stdout);
> > +
> > +#ifdef CONFIG_X86_64
> > +		SPEED_START {
> > +			raid_par4_ssse3ext(nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +		fflush(stdout);
> > +#endif
> > +	}
> > +#endif
> > +	printf("\n");
> > +
> > +	/* PAR5 */
> > +	printf("%8s", "par5");
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		raid_par5_int8(nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +	printf("%8s", "");
> > +	printf("%8s", "");
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_sse2()) {
> > +		printf("%8s", "");
> > +
> > +#ifdef CONFIG_X86_64
> > +		printf("%8s", "");
> > +#endif
> > +	}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		SPEED_START {
> > +			raid_par5_ssse3(nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +		fflush(stdout);
> > +
> > +#ifdef CONFIG_X86_64
> > +		SPEED_START {
> > +			raid_par5_ssse3ext(nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +		fflush(stdout);
> > +#endif
> > +	}
> > +#endif
> > +	printf("\n");
> > +
> > +	/* PAR6 */
> > +	printf("%8s", "par6");
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		raid_par6_int8(nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +	printf("%8s", "");
> > +	printf("%8s", "");
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_sse2()) {
> > +		printf("%8s", "");
> > +
> > +#ifdef CONFIG_X86_64
> > +		printf("%8s", "");
> > +#endif
> > +	}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		SPEED_START {
> > +			raid_par6_ssse3(nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +		fflush(stdout);
> > +
> > +#ifdef CONFIG_X86_64
> > +		SPEED_START {
> > +			raid_par6_ssse3ext(nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +		fflush(stdout);
> > +#endif
> > +	}
> > +#endif
> > +	printf("\n");
> > +	printf("\n");
> > +
> > +	/* recover table */
> > +	printf("RAID functions used for recovering:\n");
> > +	printf("%8s", "");
> > +	printf("%8s", "int8");
> > +#ifdef CONFIG_X86
> > +	printf("%8s", "ssse3");
> > +#endif
> > +	printf("\n");
> > +
> > +	printf("%8s", "rec1");
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		for (j = 0; j < nd; ++j)
> > +			/* +1 to avoid PAR1 optimized case */
> > +			raid_rec1_int8(1, id, ip + 1, nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		SPEED_START {
> > +			for (j = 0; j < nd; ++j)
> > +				/* +1 to avoid PAR1 optimized case */
> > +				raid_rec1_ssse3(1, id, ip + 1, nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +	}
> > +#endif
> > +	printf("\n");
> > +
> > +	printf("%8s", "rec2");
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		for (j = 0; j < nd; ++j)
> > +			/* +1 to avoid PAR2 optimized case */
> > +			raid_rec2_int8(2, id, ip + 1, nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		SPEED_START {
> > +			for (j = 0; j < nd; ++j)
> > +				/* +1 to avoid PAR2 optimized case */
> > +				raid_rec2_ssse3(2, id, ip + 1, nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +	}
> > +#endif
> > +	printf("\n");
> > +
> > +	printf("%8s", "rec3");
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		for (j = 0; j < nd; ++j)
> > +			raid_recX_int8(3, id, ip, nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		SPEED_START {
> > +			for (j = 0; j < nd; ++j)
> > +				raid_recX_ssse3(3, id, ip, nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +	}
> > +#endif
> > +	printf("\n");
> > +
> > +	printf("%8s", "rec4");
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		for (j = 0; j < nd; ++j)
> > +			raid_recX_int8(4, id, ip, nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		SPEED_START {
> > +			for (j = 0; j < nd; ++j)
> > +				raid_recX_ssse3(4, id, ip, nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +	}
> > +#endif
> > +	printf("\n");
> > +
> > +	printf("%8s", "rec5");
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		for (j = 0; j < nd; ++j)
> > +			raid_recX_int8(5, id, ip, nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		SPEED_START {
> > +			for (j = 0; j < nd; ++j)
> > +				raid_recX_ssse3(5, id, ip, nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +	}
> > +#endif
> > +	printf("\n");
> > +
> > +	printf("%8s", "rec6");
> > +	fflush(stdout);
> > +
> > +	SPEED_START {
> > +		for (j = 0; j < nd; ++j)
> > +			raid_recX_int8(6, id, ip, nd, size, v);
> > +	} SPEED_STOP
> > +
> > +	printf("%8"PRIu64, ds / dt);
> > +	fflush(stdout);
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		SPEED_START {
> > +			for (j = 0; j < nd; ++j)
> > +				raid_recX_ssse3(6, id, ip, nd, size, v);
> > +		} SPEED_STOP
> > +
> > +		printf("%8"PRIu64, ds / dt);
> > +	}
> > +#endif
> > +	printf("\n");
> > +	printf("\n");
> > +
> > +	free(v_alloc);
> > +	free(v);
> > +}
> > +
> > +int main(void)
> > +{
> > +	raid_init();
> > +
> > +	printf("RAID Cauchy speed test\n\n");
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_sse2())
> > +		printf("Including x86 SSE2 functions\n");
> > +	if (raid_cpu_has_ssse3())
> > +		printf("Including x86 SSSE3 functions\n");
> > +#endif
> > +#ifdef CONFIG_X86_64
> > +	printf("Including x64 extended SSE register set\n");
> > +#endif
> > +
> > +	printf("\nPlease wait about 30 seconds...\n\n");
> > +
> > +	speed();
> > +
> > +	return 0;
> > +}
> > +
> > diff --git a/lib/raid/test/test.c b/lib/raid/test/test.c
> > new file mode 100644
> > index 0000000..69deacb
> > --- /dev/null
> > +++ b/lib/raid/test/test.c
> > @@ -0,0 +1,314 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include "internal.h"
> > +#include "cpu.h"
> > +#include "combo.h"
> > +#include "memory.h"
> > +
> > +/**
> > + * Binomial coefficient of n over r.
> > + */
> > +static int ibc(int n, int r)
> > +{
> > +	if (r == 0 || n == r)
> > +		return 1;
> > +	else
> > +		return ibc(n - 1, r - 1) + ibc(n - 1, r);
> > +}
> > +
> > +/**
> > + * Power n ^ r;
> > + */
> > +static int ipow(int n, int r)
> > +{
> > +	int v = 1;
> > +	while (r) {
> > +		v *= n;
> > +		--r;
> > +	}
> > +	return v;
> > +}
> > +
> > +int raid_test_combo(void)
> > +{
> > +	int r;
> > +	int count;
> > +	int p[RAID_PARITY_MAX];
> > +
> > +	for (r = 1; r <= RAID_PARITY_MAX; ++r) {
> > +		/* count combination (r of RAID_PARITY_MAX) elements */
> > +		count = 0;
> > +		combination_first(r, RAID_PARITY_MAX, p);
> > +
> > +		do {
> > +			++count;
> > +		} while (combination_next(r, RAID_PARITY_MAX, p));
> > +
> > +		if (count != ibc(RAID_PARITY_MAX, r))
> > +			return -1;
> > +	}
> > +
> > +	for (r = 1; r <= RAID_PARITY_MAX; ++r) {
> > +		/* count permutation (r of RAID_PARITY_MAX) elements */
> > +		count = 0;
> > +		permutation_first(r, RAID_PARITY_MAX, p);
> > +
> > +		do {
> > +			++count;
> > +		} while (permutation_next(r, RAID_PARITY_MAX, p));
> > +
> > +		if (count != ipow(RAID_PARITY_MAX, r))
> > +			return -1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int raid_test_sort(void)
> > +{
> > +	int p[RAID_PARITY_MAX];
> > +	int r;
> > +
> > +	for (r = 1; r <= RAID_PARITY_MAX; ++r) {
> > +		permutation_first(r, RAID_PARITY_MAX, p);
> > +		do {
> > +			int i[RAID_PARITY_MAX];
> > +			int j;
> > +
> > +			/* make a copy */
> > +			for (j = 0; j < r; ++j)
> > +				i[j] = p[j];
> > +
> > +			raid_sort(r, i);
> > +
> > +			/* check order */
> > +			for (j = 1; j < r; ++j)
> > +				if (i[j-1] > i[j])
> > +					return -1;
> > +		} while (permutation_next(r, RAID_PARITY_MAX, p));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int raid_test_rec(int nd, size_t size)
> > +{
> > +	void *v_alloc;
> > +	void **v;
> > +	void **data;
> > +	void **parity;
> > +	void **test;
> > +	void *data_save[RAID_PARITY_MAX];
> > +	void *parity_save[RAID_PARITY_MAX];
> > +	void *waste;
> > +	int nv;
> > +	int id[RAID_PARITY_MAX];
> > +	int ip[RAID_PARITY_MAX];
> > +	int i;
> > +	int j;
> > +	int nr;
> > +	void (*f[RAID_PARITY_MAX][4])(
> > +		int nr, int *id, int *ip, int nd, size_t size, void **vbuf);
> > +	int nf[RAID_PARITY_MAX];
> > +	int np;
> > +
> > +	np = RAID_PARITY_MAX;
> > +
> > +	nv = nd + np * 2 + 1;
> > +
> > +	v = raid_malloc_vector(nd, nv, size, &v_alloc);
> > +	if (!v)
> > +		return -1;
> > +
> > +	data = v;
> > +	parity = v + nd;
> > +	test = v + nd + np;
> > +
> > +	for (i = 0; i < np; ++i)
> > +		parity_save[i] = parity[i];
> > +
> > +	waste = v[nv-1];
> > +
> > +	/* fill data disk with random */
> > +	raid_mrand_vector(nd, size, v);
> > +
> > +	/* setup recov functions */
> > +	for (i = 0; i < np; ++i) {
> > +		nf[i] = 0;
> > +		if (i == 0) {
> > +			f[i][nf[i]++] = raid_rec1_int8;
> > +#ifdef CONFIG_X86
> > +			if (raid_cpu_has_ssse3())
> > +				f[i][nf[i]++] = raid_rec1_ssse3;
> > +#endif
> > +		} else if (i == 1) {
> > +			f[i][nf[i]++] = raid_rec2_int8;
> > +#ifdef CONFIG_X86
> > +			if (raid_cpu_has_ssse3())
> > +				f[i][nf[i]++] = raid_rec2_ssse3;
> > +#endif
> > +		} else {
> > +			f[i][nf[i]++] = raid_recX_int8;
> > +#ifdef CONFIG_X86
> > +			if (raid_cpu_has_ssse3())
> > +				f[i][nf[i]++] = raid_recX_ssse3;
> > +#endif
> > +		}
> > +	}
> > +
> > +	/* compute the parity */
> > +	raid_par_ref(nd, np, size, v);
> > +
> > +	/* set all the parity to the waste v */
> > +	for (i = 0; i < np; ++i)
> > +		parity[i] = waste;
> > +
> > +	/* all parity levels */
> > +	for (nr = 1; nr <= np; ++nr) {
> > +		/* all combinations (nr of nd) disks */
> > +		combination_first(nr, nd, id);
> > +		do {
> > +			/* all combinations (nr of np) parities */
> > +			combination_first(nr, np, ip);
> > +			do {
> > +				/* for each recover function */
> > +				for (j = 0; j < nf[nr-1]; ++j) {
> > +					/* set */
> > +					for (i = 0; i < nr; ++i) {
> > +						/* remove the missing data */
> > +						data_save[i] = data[id[i]];
> > +						data[id[i]] = test[i];
> > +						/* set the parity to use */
> > +						parity[ip[i]] = parity_save[ip[i]];
> > +					}
> > +
> > +					/* recover */
> > +					f[nr-1][j](nr, id, ip, nd, size, v);
> > +
> > +					/* check */
> > +					for (i = 0; i < nr; ++i)
> > +						if (memcmp(test[i], data_save[i], size) != 0)
> > +							goto bail;
> > +
> > +					/* restore */
> > +					for (i = 0; i < nr; ++i) {
> > +						/* restore the data */
> > +						data[id[i]] = data_save[i];
> > +						/* restore the parity */
> > +						parity[ip[i]] = waste;
> > +					}
> > +				}
> > +			} while (combination_next(nr, np, ip));
> > +		} while (combination_next(nr, nd, id));
> > +	}
> > +
> > +	free(v_alloc);
> > +	free(v);
> > +	return 0;
> > +
> > +bail:
> > +	free(v_alloc);
> > +	free(v);
> > +	return -1;
> > +}
> > +
> > +int raid_test_par(int nd, size_t size)
> > +{
> > +	void *v_alloc;
> > +	void **v;
> > +	int nv;
> > +	int i, j;
> > +	void (*f[64])(int nd, size_t size, void **vbuf);
> > +	int nf;
> > +	int np;
> > +
> > +	np = RAID_PARITY_MAX;
> > +
> > +	nv = nd + np * 2;
> > +
> > +	v = raid_malloc_vector(nd, nv, size, &v_alloc);
> > +	if (!v)
> > +		return -1;
> > +
> > +	/* fill with random */
> > +	raid_mrand_vector(nv, size, v);
> > +
> > +	/* compute the parity */
> > +	raid_par_ref(nd, np, size, v);
> > +
> > +	/* copy in back buffers */
> > +	for (i = 0; i < np; ++i)
> > +		memcpy(v[nd + np + i], v[nd + i], size);
> > +
> > +	/* load all the available functions */
> > +	nf = 0;
> > +
> > +	f[nf++] = raid_par1_xorblocks;
> > +	f[nf++] = raid_par1_int32;
> > +	f[nf++] = raid_par1_int64;
> > +	f[nf++] = raid_par2_int32;
> > +	f[nf++] = raid_par2_int64;
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_sse2()) {
> > +		f[nf++] = raid_par1_sse2;
> > +		f[nf++] = raid_par2_sse2;
> > +#ifdef CONFIG_X86_64
> > +		f[nf++] = raid_par2_sse2ext;
> > +#endif
> > +	}
> > +#endif
> > +
> > +	f[nf++] = raid_par3_int8;
> > +	f[nf++] = raid_par4_int8;
> > +	f[nf++] = raid_par5_int8;
> > +	f[nf++] = raid_par6_int8;
> > +
> > +#ifdef CONFIG_X86
> > +	if (raid_cpu_has_ssse3()) {
> > +		f[nf++] = raid_par3_ssse3;
> > +		f[nf++] = raid_par4_ssse3;
> > +		f[nf++] = raid_par5_ssse3;
> > +		f[nf++] = raid_par6_ssse3;
> > +#ifdef CONFIG_X86_64
> > +		f[nf++] = raid_par3_ssse3ext;
> > +		f[nf++] = raid_par4_ssse3ext;
> > +		f[nf++] = raid_par5_ssse3ext;
> > +		f[nf++] = raid_par6_ssse3ext;
> > +#endif
> > +	}
> > +#endif
> > +
> > +	/* check all the functions */
> > +	for (j = 0; j < nf; ++j) {
> > +		/* compute parity */
> > +		f[j](nd, size, v);
> > +
> > +		/* check it */
> > +		for (i = 0; i < np; ++i)
> > +			if (memcmp(v[nd + np + i], v[nd + i], size) != 0)
> > +				goto bail;
> > +	}
> > +
> > +	free(v_alloc);
> > +	free(v);
> > +	return 0;
> > +
> > +bail:
> > +	free(v_alloc);
> > +	free(v);
> > +	return -1;
> > +}
> > +
> > diff --git a/lib/raid/test/test.h b/lib/raid/test/test.h
> > new file mode 100644
> > index 0000000..67684fe
> > --- /dev/null
> > +++ b/lib/raid/test/test.h
> > @@ -0,0 +1,59 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#ifndef __RAID_TEST_H
> > +#define __RAID_TEST_H
> > +
> > +/**
> > + * Tests sorting functions.
> > + *
> > + * Test raid_sort() with all the possible combinations of elements to sort.
> > + *
> > + * Returns 0 on success.
> > + */
> > +int raid_test_sort(void);
> > +
> > +/**
> > + * Tests combination functions.
> > + *
> > + * Tests combination_first() and combination_next() for all the parity levels.
> > + *
> > + * Returns 0 on success.
> > + */
> > +int raid_test_combo(void);
> > +
> > +/**
> > + * Tests recovering functions.
> > + *
> > + * All the recovering functions are tested with all the combinations
> > + * of failing disks and recovering parities.
> > + *
> > + * Take care that the test time grows exponentially with the number of disks.
> > + *
> > + * Returns 0 on success.
> > + */
> > +int raid_test_rec(int nd, size_t size);
> > +
> > +/**
> > + * Tests parity generation functions.
> > + *
> > + * All the parity generation functions are tested with the specified
> > + * number of disks.
> > + *
> > + * Returns 0 on success.
> > + */
> > +int raid_test_par(int nd, size_t size);
> > +
> > +#endif
> > +
> > diff --git a/lib/raid/test/usermode.h b/lib/raid/test/usermode.h
> > new file mode 100644
> > index 0000000..0f73cec
> > --- /dev/null
> > +++ b/lib/raid/test/usermode.h
> > @@ -0,0 +1,83 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#ifndef __RAID_USERMODE_H
> > +#define __RAID_USERMODE_H
> > +
> > +/*
> > + * Compatibility layer for user mode applications.
> > + */
> > +#include <stdlib.h>
> > +#include <stdint.h>
> > +#include <assert.h>
> > +#include <string.h>
> > +#include <malloc.h>
> > +#include <errno.h>
> > +#include <sys/time.h>
> > +
> > +#define pr_err printf
> > +#define pr_info printf
> > +#define __aligned(a) __attribute__((aligned(a)))
> > +#define PAGE_SIZE 4096
> > +#define EXPORT_SYMBOL_GPL(a) int dummy_##a
> > +#define EXPORT_SYMBOL(a) int dummy_##a
> > +#if defined(__i386__)
> > +#define CONFIG_X86 1
> > +#define CONFIG_X86_32 1
> > +#endif
> > +#if defined(__x86_64__)
> > +#define CONFIG_X86 1
> > +#define CONFIG_X86_64 1
> > +#endif
> > +#define BUG_ON(a) assert(!(a))
> > +#define MAX_XOR_BLOCKS 1
> > +void xor_blocks(unsigned int count, unsigned int bytes, void *dest, void **srcs);
> > +#define GFP_KERNEL 0
> > +#define get_order(x) 5
> > +#define __get_free_pages(x, order) memalign(PAGE_SIZE, PAGE_SIZE * (1 << order))
> > +#define free_pages(x, order) free((void *)x)
> > +#define preempt_disable() do { } while (0)
> > +#define preempt_enable() do { } while (0)
> > +#define cpu_relax() do { } while (0)
> > +#define HZ 1000
> > +#define jiffies get_jiffies()
> > +static inline unsigned long get_jiffies(void)
> > +{
> > +	struct timeval t;
> > +	gettimeofday(&t, 0);
> > +	return t.tv_sec * 1000 + t.tv_usec / 1000;
> > +}
> > +#define time_before(x, y) ((x) < (y))
> > +
> > +#ifdef CONFIG_X86
> > +#define X86_FEATURE_XMM2 (0*32+26)
> > +#define X86_FEATURE_SSSE3 (4*32+9)
> > +#define X86_FEATURE_AVX (4*32+28)
> > +#define X86_FEATURE_AVX2 (9*32+5)
> > +
> > +static inline int boot_cpu_has(int flag)
> > +{
> > +	uint32_t eax, ebx, ecx, edx;
> > +
> > +	eax = (flag & 0x100) ? 7 : (flag & 0x20) ? 0x80000001 : 1;
> > +	ecx = 0;
> > +
> > +	asm volatile("cpuid" : "+a" (eax), "=b" (ebx), "=d" (edx), "+c" (ecx));
> > +
> > +	return ((flag & 0x100 ? ebx : (flag & 0x80) ? ecx : edx) >> (flag & 31)) & 1;
> > +}
> > +#endif /* CONFIG_X86 */
> > +
> > +#endif
> > +
> > diff --git a/lib/raid/test/xor.c b/lib/raid/test/xor.c
> > new file mode 100644
> > index 0000000..2d68636
> > --- /dev/null
> > +++ b/lib/raid/test/xor.c
> > @@ -0,0 +1,41 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include "internal.h"
> > +
> > +/**
> > + * Implementation of the kernel xor_blocks().
> > + */
> > +void xor_blocks(unsigned int count, unsigned int bytes, void *dest, void **srcs)
> > +{
> > +	uint32_t *p1 = dest;
> > +	uint32_t *p2 = srcs[0];
> > +	long lines = bytes / (sizeof(uint32_t)) / 8;
> > +
> > +	BUG_ON(count != 1);
> > +
> > +	do {
> > +		p1[0] ^= p2[0];
> > +		p1[1] ^= p2[1];
> > +		p1[2] ^= p2[2];
> > +		p1[3] ^= p2[3];
> > +		p1[4] ^= p2[4];
> > +		p1[5] ^= p2[5];
> > +		p1[6] ^= p2[6];
> > +		p1[7] ^= p2[7];
> > +		p1 += 8;
> > +		p2 += 8;
> > +	} while (--lines > 0);
> > +}
> > +
> > diff --git a/lib/raid/x86.c b/lib/raid/x86.c
> > new file mode 100644
> > index 0000000..fac8a67
> > --- /dev/null
> > +++ b/lib/raid/x86.c
> > @@ -0,0 +1,1569 @@
> > +/*
> > + * Copyright (C) 2013 Andrea Mazzoleni
> > + *
> > + * This program is free software: you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation, either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include "internal.h"
> > +#include "gf.h"
> > +
> > +#ifdef CONFIG_X86
> > +/*
> > + * PAR1 (RAID5 with xor) SSE2 implementation
> > + *
> > + * Note that we don't have the corresponding x64 sse2ext function using more
> > + * registers because processing a block of 64 bytes already fills
> > + * the typical cache block, and processing 128 bytes doesn't increase
> > + * performance.
> > + */
> > +void raid_par1_sse2(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +
> > +	asm_begin();
> > +
> > +	for (i = 0; i < size; i += 64) {
> > +		asm volatile("movdqa %0,%%xmm0" : : "m" (v[l][i]));
> > +		asm volatile("movdqa %0,%%xmm1" : : "m" (v[l][i+16]));
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (v[l][i+32]));
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (v[l][i+48]));
> > +		for (d = l-1; d >= 0; --d) {
> > +			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
> > +			asm volatile("movdqa %0,%%xmm5" : : "m" (v[d][i+16]));
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (v[d][i+32]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (v[d][i+48]));
> > +			asm volatile("pxor %xmm4,%xmm0");
> > +			asm volatile("pxor %xmm5,%xmm1");
> > +			asm volatile("pxor %xmm6,%xmm2");
> > +			asm volatile("pxor %xmm7,%xmm3");
> > +		}
> > +		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
> > +		asm volatile("movntdq %%xmm1,%0" : "=m" (p[i+16]));
> > +		asm volatile("movntdq %%xmm2,%0" : "=m" (p[i+32]));
> > +		asm volatile("movntdq %%xmm3,%0" : "=m" (p[i+48]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +static const struct gfconst16 {
> > +	uint8_t poly[16];
> > +	uint8_t low4[16];
> > +} gfconst16  __aligned(32) = {
> > +	{ 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d },
> > +	{ 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f },
> > +};
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +/*
> > + * PAR2 (RAID6 with powers of 2) SSE2 implementation
> > + */
> > +void raid_par2_sse2(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +
> > +	asm_begin();
> > +
> > +	asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
> > +
> > +	for (i = 0; i < size; i += 32) {
> > +		asm volatile("movdqa %0,%%xmm0" : : "m" (v[l][i]));
> > +		asm volatile("movdqa %0,%%xmm1" : : "m" (v[l][i+16]));
> > +		asm volatile("movdqa %xmm0,%xmm2");
> > +		asm volatile("movdqa %xmm1,%xmm3");
> > +		for (d = l-1; d >= 0; --d) {
> > +			asm volatile("pxor %xmm4,%xmm4");
> > +			asm volatile("pxor %xmm5,%xmm5");
> > +			asm volatile("pcmpgtb %xmm2,%xmm4");
> > +			asm volatile("pcmpgtb %xmm3,%xmm5");
> > +			asm volatile("paddb %xmm2,%xmm2");
> > +			asm volatile("paddb %xmm3,%xmm3");
> > +			asm volatile("pand %xmm7,%xmm4");
> > +			asm volatile("pand %xmm7,%xmm5");
> > +			asm volatile("pxor %xmm4,%xmm2");
> > +			asm volatile("pxor %xmm5,%xmm3");
> > +
> > +			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
> > +			asm volatile("movdqa %0,%%xmm5" : : "m" (v[d][i+16]));
> > +			asm volatile("pxor %xmm4,%xmm0");
> > +			asm volatile("pxor %xmm5,%xmm1");
> > +			asm volatile("pxor %xmm4,%xmm2");
> > +			asm volatile("pxor %xmm5,%xmm3");
> > +		}
> > +		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
> > +		asm volatile("movntdq %%xmm1,%0" : "=m" (p[i+16]));
> > +		asm volatile("movntdq %%xmm2,%0" : "=m" (q[i]));
> > +		asm volatile("movntdq %%xmm3,%0" : "=m" (q[i+16]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86_64
> > +/*
> > + * PAR2 (RAID6 with powers of 2) SSE2 implementation
> > + *
> > + * Note that it uses 16 registers, meaning that x64 is required.
> > + */
> > +void raid_par2_sse2ext(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +
> > +	asm_begin();
> > +
> > +	asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.poly[0]));
> > +
> > +	for (i = 0; i < size; i += 64) {
> > +		asm volatile("movdqa %0,%%xmm0" : : "m" (v[l][i]));
> > +		asm volatile("movdqa %0,%%xmm1" : : "m" (v[l][i+16]));
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (v[l][i+32]));
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (v[l][i+48]));
> > +		asm volatile("movdqa %xmm0,%xmm4");
> > +		asm volatile("movdqa %xmm1,%xmm5");
> > +		asm volatile("movdqa %xmm2,%xmm6");
> > +		asm volatile("movdqa %xmm3,%xmm7");
> > +		for (d = l-1; d >= 0; --d) {
> > +			asm volatile("pxor %xmm8,%xmm8");
> > +			asm volatile("pxor %xmm9,%xmm9");
> > +			asm volatile("pxor %xmm10,%xmm10");
> > +			asm volatile("pxor %xmm11,%xmm11");
> > +			asm volatile("pcmpgtb %xmm4,%xmm8");
> > +			asm volatile("pcmpgtb %xmm5,%xmm9");
> > +			asm volatile("pcmpgtb %xmm6,%xmm10");
> > +			asm volatile("pcmpgtb %xmm7,%xmm11");
> > +			asm volatile("paddb %xmm4,%xmm4");
> > +			asm volatile("paddb %xmm5,%xmm5");
> > +			asm volatile("paddb %xmm6,%xmm6");
> > +			asm volatile("paddb %xmm7,%xmm7");
> > +			asm volatile("pand %xmm15,%xmm8");
> > +			asm volatile("pand %xmm15,%xmm9");
> > +			asm volatile("pand %xmm15,%xmm10");
> > +			asm volatile("pand %xmm15,%xmm11");
> > +			asm volatile("pxor %xmm8,%xmm4");
> > +			asm volatile("pxor %xmm9,%xmm5");
> > +			asm volatile("pxor %xmm10,%xmm6");
> > +			asm volatile("pxor %xmm11,%xmm7");
> > +
> > +			asm volatile("movdqa %0,%%xmm8" : : "m" (v[d][i]));
> > +			asm volatile("movdqa %0,%%xmm9" : : "m" (v[d][i+16]));
> > +			asm volatile("movdqa %0,%%xmm10" : : "m" (v[d][i+32]));
> > +			asm volatile("movdqa %0,%%xmm11" : : "m" (v[d][i+48]));
> > +			asm volatile("pxor %xmm8,%xmm0");
> > +			asm volatile("pxor %xmm9,%xmm1");
> > +			asm volatile("pxor %xmm10,%xmm2");
> > +			asm volatile("pxor %xmm11,%xmm3");
> > +			asm volatile("pxor %xmm8,%xmm4");
> > +			asm volatile("pxor %xmm9,%xmm5");
> > +			asm volatile("pxor %xmm10,%xmm6");
> > +			asm volatile("pxor %xmm11,%xmm7");
> > +		}
> > +		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
> > +		asm volatile("movntdq %%xmm1,%0" : "=m" (p[i+16]));
> > +		asm volatile("movntdq %%xmm2,%0" : "=m" (p[i+32]));
> > +		asm volatile("movntdq %%xmm3,%0" : "=m" (p[i+48]));
> > +		asm volatile("movntdq %%xmm4,%0" : "=m" (q[i]));
> > +		asm volatile("movntdq %%xmm5,%0" : "=m" (q[i+16]));
> > +		asm volatile("movntdq %%xmm6,%0" : "=m" (q[i+32]));
> > +		asm volatile("movntdq %%xmm7,%0" : "=m" (q[i+48]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +/*
> > + * PAR3 (triple parity with Cauchy matrix) SSSE3 implementation
> > + */
> > +void raid_par3_ssse3(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +
> > +	/* special case with only one data disk */
> > +	if (l == 0) {
> > +		for (i = 0; i < 3; ++i)
> > +			memcpy(v[1+i], v[0], size);
> > +		return;
> > +	}
> > +
> > +	asm_begin();
> > +
> > +	/* generic case with at least two data disks */
> > +	asm volatile("movdqa %0,%%xmm3" : : "m" (gfconst16.poly[0]));
> > +	asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
> > +
> > +	for (i = 0; i < size; i += 16) {
> > +		/* last disk without the by two multiplication */
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
> > +
> > +		asm volatile("movdqa %xmm4,%xmm0");
> > +		asm volatile("movdqa %xmm4,%xmm1");
> > +
> > +		asm volatile("movdqa %xmm4,%xmm5");
> > +		asm volatile("psrlw  $4,%xmm5");
> > +		asm volatile("pand   %xmm7,%xmm4");
> > +		asm volatile("pand   %xmm7,%xmm5");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
> > +		asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[l][0][1][0]));
> > +		asm volatile("pshufb %xmm4,%xmm2");
> > +		asm volatile("pshufb %xmm5,%xmm6");
> > +		asm volatile("pxor   %xmm6,%xmm2");
> > +
> > +		/* intermediate disks */
> > +		for (d = l-1; d > 0; --d) {
> > +			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
> > +
> > +			asm volatile("pxor %xmm5,%xmm5");
> > +			asm volatile("pcmpgtb %xmm1,%xmm5");
> > +			asm volatile("paddb %xmm1,%xmm1");
> > +			asm volatile("pand %xmm3,%xmm5");
> > +			asm volatile("pxor %xmm5,%xmm1");
> > +
> > +			asm volatile("pxor %xmm4,%xmm0");
> > +			asm volatile("pxor %xmm4,%xmm1");
> > +
> > +			asm volatile("movdqa %xmm4,%xmm5");
> > +			asm volatile("psrlw  $4,%xmm5");
> > +			asm volatile("pand   %xmm7,%xmm4");
> > +			asm volatile("pand   %xmm7,%xmm5");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pxor   %xmm6,%xmm2");
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][1][0]));
> > +			asm volatile("pshufb %xmm5,%xmm6");
> > +			asm volatile("pxor   %xmm6,%xmm2");
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
> > +
> > +		asm volatile("pxor %xmm5,%xmm5");
> > +		asm volatile("pcmpgtb %xmm1,%xmm5");
> > +		asm volatile("paddb %xmm1,%xmm1");
> > +		asm volatile("pand %xmm3,%xmm5");
> > +		asm volatile("pxor %xmm5,%xmm1");
> > +
> > +		asm volatile("pxor %xmm4,%xmm0");
> > +		asm volatile("pxor %xmm4,%xmm1");
> > +		asm volatile("pxor %xmm4,%xmm2");
> > +
> > +		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
> > +		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
> > +		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86_64
> > +/*
> > + * PAR3 (triple parity with Cauchy matrix) SSSE3 implementation
> > + *
> > + * Note that it uses 16 registers, meaning that x64 is required.
> > + */
> > +void raid_par3_ssse3ext(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +
> > +	/* special case with only one data disk */
> > +	if (l == 0) {
> > +		for (i = 0; i < 3; ++i)
> > +			memcpy(v[1+i], v[0], size);
> > +		return;
> > +	}
> > +
> > +	asm_begin();
> > +
> > +	/* generic case with at least two data disks */
> > +	asm volatile("movdqa %0,%%xmm3" : : "m" (gfconst16.poly[0]));
> > +	asm volatile("movdqa %0,%%xmm11" : : "m" (gfconst16.low4[0]));
> > +
> > +	for (i = 0; i < size; i += 32) {
> > +		/* last disk without the by two multiplication */
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
> > +		asm volatile("movdqa %0,%%xmm12" : : "m" (v[l][i+16]));
> > +
> > +		asm volatile("movdqa %xmm4,%xmm0");
> > +		asm volatile("movdqa %xmm4,%xmm1");
> > +		asm volatile("movdqa %xmm12,%xmm8");
> > +		asm volatile("movdqa %xmm12,%xmm9");
> > +
> > +		asm volatile("movdqa %xmm4,%xmm5");
> > +		asm volatile("movdqa %xmm12,%xmm13");
> > +		asm volatile("psrlw  $4,%xmm5");
> > +		asm volatile("psrlw  $4,%xmm13");
> > +		asm volatile("pand   %xmm11,%xmm4");
> > +		asm volatile("pand   %xmm11,%xmm12");
> > +		asm volatile("pand   %xmm11,%xmm5");
> > +		asm volatile("pand   %xmm11,%xmm13");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][0][1][0]));
> > +		asm volatile("movdqa %xmm2,%xmm10");
> > +		asm volatile("movdqa %xmm7,%xmm15");
> > +		asm volatile("pshufb %xmm4,%xmm2");
> > +		asm volatile("pshufb %xmm12,%xmm10");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pshufb %xmm13,%xmm15");
> > +		asm volatile("pxor   %xmm7,%xmm2");
> > +		asm volatile("pxor   %xmm15,%xmm10");
> > +
> > +		/* intermediate disks */
> > +		for (d = l-1; d > 0; --d) {
> > +			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
> > +			asm volatile("movdqa %0,%%xmm12" : : "m" (v[d][i+16]));
> > +
> > +			asm volatile("pxor %xmm5,%xmm5");
> > +			asm volatile("pxor %xmm13,%xmm13");
> > +			asm volatile("pcmpgtb %xmm1,%xmm5");
> > +			asm volatile("pcmpgtb %xmm9,%xmm13");
> > +			asm volatile("paddb %xmm1,%xmm1");
> > +			asm volatile("paddb %xmm9,%xmm9");
> > +			asm volatile("pand %xmm3,%xmm5");
> > +			asm volatile("pand %xmm3,%xmm13");
> > +			asm volatile("pxor %xmm5,%xmm1");
> > +			asm volatile("pxor %xmm13,%xmm9");
> > +
> > +			asm volatile("pxor %xmm4,%xmm0");
> > +			asm volatile("pxor %xmm4,%xmm1");
> > +			asm volatile("pxor %xmm12,%xmm8");
> > +			asm volatile("pxor %xmm12,%xmm9");
> > +
> > +			asm volatile("movdqa %xmm4,%xmm5");
> > +			asm volatile("movdqa %xmm12,%xmm13");
> > +			asm volatile("psrlw  $4,%xmm5");
> > +			asm volatile("psrlw  $4,%xmm13");
> > +			asm volatile("pand   %xmm11,%xmm4");
> > +			asm volatile("pand   %xmm11,%xmm12");
> > +			asm volatile("pand   %xmm11,%xmm5");
> > +			asm volatile("pand   %xmm11,%xmm13");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][0][1][0]));
> > +			asm volatile("movdqa %xmm6,%xmm14");
> > +			asm volatile("movdqa %xmm7,%xmm15");
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm12,%xmm14");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pshufb %xmm13,%xmm15");
> > +			asm volatile("pxor   %xmm6,%xmm2");
> > +			asm volatile("pxor   %xmm14,%xmm10");
> > +			asm volatile("pxor   %xmm7,%xmm2");
> > +			asm volatile("pxor   %xmm15,%xmm10");
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
> > +		asm volatile("movdqa %0,%%xmm12" : : "m" (v[0][i+16]));
> > +
> > +		asm volatile("pxor %xmm5,%xmm5");
> > +		asm volatile("pxor %xmm13,%xmm13");
> > +		asm volatile("pcmpgtb %xmm1,%xmm5");
> > +		asm volatile("pcmpgtb %xmm9,%xmm13");
> > +		asm volatile("paddb %xmm1,%xmm1");
> > +		asm volatile("paddb %xmm9,%xmm9");
> > +		asm volatile("pand %xmm3,%xmm5");
> > +		asm volatile("pand %xmm3,%xmm13");
> > +		asm volatile("pxor %xmm5,%xmm1");
> > +		asm volatile("pxor %xmm13,%xmm9");
> > +
> > +		asm volatile("pxor %xmm4,%xmm0");
> > +		asm volatile("pxor %xmm4,%xmm1");
> > +		asm volatile("pxor %xmm4,%xmm2");
> > +		asm volatile("pxor %xmm12,%xmm8");
> > +		asm volatile("pxor %xmm12,%xmm9");
> > +		asm volatile("pxor %xmm12,%xmm10");
> > +
> > +		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
> > +		asm volatile("movntdq %%xmm8,%0" : "=m" (p[i+16]));
> > +		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
> > +		asm volatile("movntdq %%xmm9,%0" : "=m" (q[i+16]));
> > +		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
> > +		asm volatile("movntdq %%xmm10,%0" : "=m" (r[i+16]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +/*
> > + * PAR4 (quad parity with Cauchy matrix) SSSE3 implementation
> > + */
> > +void raid_par4_ssse3(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	uint8_t *s;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +	s = v[nd+3];
> > +
> > +	/* special case with only one data disk */
> > +	if (l == 0) {
> > +		for (i = 0; i < 4; ++i)
> > +			memcpy(v[1+i], v[0], size);
> > +		return;
> > +	}
> > +
> > +	asm_begin();
> > +
> > +	/* generic case with at least two data disks */
> > +	for (i = 0; i < size; i += 16) {
> > +		/* last disk without the by two multiplication */
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
> > +
> > +		asm volatile("movdqa %xmm4,%xmm0");
> > +		asm volatile("movdqa %xmm4,%xmm1");
> > +
> > +		asm volatile("movdqa %xmm4,%xmm5");
> > +		asm volatile("psrlw  $4,%xmm5");
> > +		asm volatile("pand   %xmm7,%xmm4");
> > +		asm volatile("pand   %xmm7,%xmm5");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][0][1][0]));
> > +		asm volatile("pshufb %xmm4,%xmm2");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pxor   %xmm7,%xmm2");
> > +
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][1][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][1][1][0]));
> > +		asm volatile("pshufb %xmm4,%xmm3");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pxor   %xmm7,%xmm3");
> > +
> > +		/* intermediate disks */
> > +		for (d = l-1; d > 0; --d) {
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
> > +			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
> > +
> > +			asm volatile("pxor %xmm5,%xmm5");
> > +			asm volatile("pcmpgtb %xmm1,%xmm5");
> > +			asm volatile("paddb %xmm1,%xmm1");
> > +			asm volatile("pand %xmm7,%xmm5");
> > +			asm volatile("pxor %xmm5,%xmm1");
> > +
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
> > +
> > +			asm volatile("pxor %xmm4,%xmm0");
> > +			asm volatile("pxor %xmm4,%xmm1");
> > +
> > +			asm volatile("movdqa %xmm4,%xmm5");
> > +			asm volatile("psrlw  $4,%xmm5");
> > +			asm volatile("pand   %xmm7,%xmm4");
> > +			asm volatile("pand   %xmm7,%xmm5");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][0][1][0]));
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pxor   %xmm6,%xmm2");
> > +			asm volatile("pxor   %xmm7,%xmm2");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][1][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][1][1][0]));
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pxor   %xmm6,%xmm3");
> > +			asm volatile("pxor   %xmm7,%xmm3");
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
> > +
> > +		asm volatile("pxor %xmm5,%xmm5");
> > +		asm volatile("pcmpgtb %xmm1,%xmm5");
> > +		asm volatile("paddb %xmm1,%xmm1");
> > +		asm volatile("pand %xmm7,%xmm5");
> > +		asm volatile("pxor %xmm5,%xmm1");
> > +
> > +		asm volatile("pxor %xmm4,%xmm0");
> > +		asm volatile("pxor %xmm4,%xmm1");
> > +		asm volatile("pxor %xmm4,%xmm2");
> > +		asm volatile("pxor %xmm4,%xmm3");
> > +
> > +		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
> > +		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
> > +		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
> > +		asm volatile("movntdq %%xmm3,%0" : "=m" (s[i]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86_64
> > +/*
> > + * PAR4 (quad parity with Cauchy matrix) SSSE3 implementation
> > + *
> > + * Note that it uses 16 registers, meaning that x64 is required.
> > + */
> > +void raid_par4_ssse3ext(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	uint8_t *s;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +	s = v[nd+3];
> > +
> > +	/* special case with only one data disk */
> > +	if (l == 0) {
> > +		for (i = 0; i < 4; ++i)
> > +			memcpy(v[1+i], v[0], size);
> > +		return;
> > +	}
> > +
> > +	asm_begin();
> > +
> > +	/* generic case with at least two data disks */
> > +	for (i = 0; i < size; i += 32) {
> > +		/* last disk without the by two multiplication */
> > +		asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.low4[0]));
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
> > +		asm volatile("movdqa %0,%%xmm12" : : "m" (v[l][i+16]));
> > +
> > +		asm volatile("movdqa %xmm4,%xmm0");
> > +		asm volatile("movdqa %xmm4,%xmm1");
> > +		asm volatile("movdqa %xmm12,%xmm8");
> > +		asm volatile("movdqa %xmm12,%xmm9");
> > +
> > +		asm volatile("movdqa %xmm4,%xmm5");
> > +		asm volatile("movdqa %xmm12,%xmm13");
> > +		asm volatile("psrlw  $4,%xmm5");
> > +		asm volatile("psrlw  $4,%xmm13");
> > +		asm volatile("pand   %xmm15,%xmm4");
> > +		asm volatile("pand   %xmm15,%xmm12");
> > +		asm volatile("pand   %xmm15,%xmm5");
> > +		asm volatile("pand   %xmm15,%xmm13");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][0][1][0]));
> > +		asm volatile("movdqa %xmm2,%xmm10");
> > +		asm volatile("movdqa %xmm7,%xmm15");
> > +		asm volatile("pshufb %xmm4,%xmm2");
> > +		asm volatile("pshufb %xmm12,%xmm10");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pshufb %xmm13,%xmm15");
> > +		asm volatile("pxor   %xmm7,%xmm2");
> > +		asm volatile("pxor   %xmm15,%xmm10");
> > +
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][1][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][1][1][0]));
> > +		asm volatile("movdqa %xmm3,%xmm11");
> > +		asm volatile("movdqa %xmm7,%xmm15");
> > +		asm volatile("pshufb %xmm4,%xmm3");
> > +		asm volatile("pshufb %xmm12,%xmm11");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pshufb %xmm13,%xmm15");
> > +		asm volatile("pxor   %xmm7,%xmm3");
> > +		asm volatile("pxor   %xmm15,%xmm11");
> > +
> > +		/* intermediate disks */
> > +		for (d = l-1; d > 0; --d) {
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
> > +			asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.low4[0]));
> > +			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
> > +			asm volatile("movdqa %0,%%xmm12" : : "m" (v[d][i+16]));
> > +
> > +			asm volatile("pxor %xmm5,%xmm5");
> > +			asm volatile("pxor %xmm13,%xmm13");
> > +			asm volatile("pcmpgtb %xmm1,%xmm5");
> > +			asm volatile("pcmpgtb %xmm9,%xmm13");
> > +			asm volatile("paddb %xmm1,%xmm1");
> > +			asm volatile("paddb %xmm9,%xmm9");
> > +			asm volatile("pand %xmm7,%xmm5");
> > +			asm volatile("pand %xmm7,%xmm13");
> > +			asm volatile("pxor %xmm5,%xmm1");
> > +			asm volatile("pxor %xmm13,%xmm9");
> > +
> > +			asm volatile("pxor %xmm4,%xmm0");
> > +			asm volatile("pxor %xmm4,%xmm1");
> > +			asm volatile("pxor %xmm12,%xmm8");
> > +			asm volatile("pxor %xmm12,%xmm9");
> > +
> > +			asm volatile("movdqa %xmm4,%xmm5");
> > +			asm volatile("movdqa %xmm12,%xmm13");
> > +			asm volatile("psrlw  $4,%xmm5");
> > +			asm volatile("psrlw  $4,%xmm13");
> > +			asm volatile("pand   %xmm15,%xmm4");
> > +			asm volatile("pand   %xmm15,%xmm12");
> > +			asm volatile("pand   %xmm15,%xmm5");
> > +			asm volatile("pand   %xmm15,%xmm13");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][0][1][0]));
> > +			asm volatile("movdqa %xmm6,%xmm14");
> > +			asm volatile("movdqa %xmm7,%xmm15");
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm12,%xmm14");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pshufb %xmm13,%xmm15");
> > +			asm volatile("pxor   %xmm6,%xmm2");
> > +			asm volatile("pxor   %xmm14,%xmm10");
> > +			asm volatile("pxor   %xmm7,%xmm2");
> > +			asm volatile("pxor   %xmm15,%xmm10");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][1][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][1][1][0]));
> > +			asm volatile("movdqa %xmm6,%xmm14");
> > +			asm volatile("movdqa %xmm7,%xmm15");
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm12,%xmm14");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pshufb %xmm13,%xmm15");
> > +			asm volatile("pxor   %xmm6,%xmm3");
> > +			asm volatile("pxor   %xmm14,%xmm11");
> > +			asm volatile("pxor   %xmm7,%xmm3");
> > +			asm volatile("pxor   %xmm15,%xmm11");
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
> > +		asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.low4[0]));
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
> > +		asm volatile("movdqa %0,%%xmm12" : : "m" (v[0][i+16]));
> > +
> > +		asm volatile("pxor %xmm5,%xmm5");
> > +		asm volatile("pxor %xmm13,%xmm13");
> > +		asm volatile("pcmpgtb %xmm1,%xmm5");
> > +		asm volatile("pcmpgtb %xmm9,%xmm13");
> > +		asm volatile("paddb %xmm1,%xmm1");
> > +		asm volatile("paddb %xmm9,%xmm9");
> > +		asm volatile("pand %xmm7,%xmm5");
> > +		asm volatile("pand %xmm7,%xmm13");
> > +		asm volatile("pxor %xmm5,%xmm1");
> > +		asm volatile("pxor %xmm13,%xmm9");
> > +
> > +		asm volatile("pxor %xmm4,%xmm0");
> > +		asm volatile("pxor %xmm4,%xmm1");
> > +		asm volatile("pxor %xmm4,%xmm2");
> > +		asm volatile("pxor %xmm4,%xmm3");
> > +		asm volatile("pxor %xmm12,%xmm8");
> > +		asm volatile("pxor %xmm12,%xmm9");
> > +		asm volatile("pxor %xmm12,%xmm10");
> > +		asm volatile("pxor %xmm12,%xmm11");
> > +
> > +		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
> > +		asm volatile("movntdq %%xmm8,%0" : "=m" (p[i+16]));
> > +		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
> > +		asm volatile("movntdq %%xmm9,%0" : "=m" (q[i+16]));
> > +		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
> > +		asm volatile("movntdq %%xmm10,%0" : "=m" (r[i+16]));
> > +		asm volatile("movntdq %%xmm3,%0" : "=m" (s[i]));
> > +		asm volatile("movntdq %%xmm11,%0" : "=m" (s[i+16]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +/*
> > + * PAR5 (penta parity with Cauchy matrix) SSSE3 implementation
> > + */
> > +void raid_par5_ssse3(int nd, size_t size, void **vv)
> > +/* ensures that stack is aligned at 16 bytes because we allocate SSE registers in it */
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	uint8_t *s;
> > +	uint8_t *t;
> > +	int d, l;
> > +	size_t i;
> > +	uint8_t p0[16] __aligned(16);
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +	s = v[nd+3];
> > +	t = v[nd+4];
> > +
> > +	/* special case with only one data disk */
> > +	if (l == 0) {
> > +		for (i = 0; i < 5; ++i)
> > +			memcpy(v[1+i], v[0], size);
> > +		return;
> > +	}
> > +
> > +	asm_begin();
> > +
> > +	/* generic case with at least two data disks */
> > +	for (i = 0; i < size; i += 16) {
> > +		/* last disk without the by two multiplication */
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
> > +
> > +		asm volatile("movdqa %xmm4,%xmm0");
> > +		asm volatile("movdqa %%xmm4,%0" : "=m" (p0[0]));
> > +
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
> > +		asm volatile("movdqa %xmm4,%xmm5");
> > +		asm volatile("psrlw  $4,%xmm5");
> > +		asm volatile("pand   %xmm7,%xmm4");
> > +		asm volatile("pand   %xmm7,%xmm5");
> > +
> > +		asm volatile("movdqa %0,%%xmm1" : : "m" (gfgenpshufb[l][0][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][0][1][0]));
> > +		asm volatile("pshufb %xmm4,%xmm1");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pxor   %xmm7,%xmm1");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][1][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][1][1][0]));
> > +		asm volatile("pshufb %xmm4,%xmm2");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pxor   %xmm7,%xmm2");
> > +
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][2][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][2][1][0]));
> > +		asm volatile("pshufb %xmm4,%xmm3");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pxor   %xmm7,%xmm3");
> > +
> > +		/* intermediate disks */
> > +		for (d = l-1; d > 0; --d) {
> > +			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (p0[0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
> > +
> > +			asm volatile("pxor %xmm5,%xmm5");
> > +			asm volatile("pcmpgtb %xmm0,%xmm5");
> > +			asm volatile("paddb %xmm0,%xmm0");
> > +			asm volatile("pand %xmm7,%xmm5");
> > +			asm volatile("pxor %xmm5,%xmm0");
> > +
> > +			asm volatile("pxor %xmm4,%xmm0");
> > +			asm volatile("pxor %xmm4,%xmm6");
> > +			asm volatile("movdqa %%xmm6,%0" : "=m" (p0[0]));
> > +
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
> > +			asm volatile("movdqa %xmm4,%xmm5");
> > +			asm volatile("psrlw  $4,%xmm5");
> > +			asm volatile("pand   %xmm7,%xmm4");
> > +			asm volatile("pand   %xmm7,%xmm5");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][0][1][0]));
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pxor   %xmm6,%xmm1");
> > +			asm volatile("pxor   %xmm7,%xmm1");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][1][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][1][1][0]));
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pxor   %xmm6,%xmm2");
> > +			asm volatile("pxor   %xmm7,%xmm2");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][2][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][2][1][0]));
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pxor   %xmm6,%xmm3");
> > +			asm volatile("pxor   %xmm7,%xmm3");
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
> > +		asm volatile("movdqa %0,%%xmm6" : : "m" (p0[0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
> > +
> > +		asm volatile("pxor %xmm5,%xmm5");
> > +		asm volatile("pcmpgtb %xmm0,%xmm5");
> > +		asm volatile("paddb %xmm0,%xmm0");
> > +		asm volatile("pand %xmm7,%xmm5");
> > +		asm volatile("pxor %xmm5,%xmm0");
> > +
> > +		asm volatile("pxor %xmm4,%xmm0");
> > +		asm volatile("pxor %xmm4,%xmm1");
> > +		asm volatile("pxor %xmm4,%xmm2");
> > +		asm volatile("pxor %xmm4,%xmm3");
> > +		asm volatile("pxor %xmm4,%xmm6");
> > +
> > +		asm volatile("movntdq %%xmm6,%0" : "=m" (p[i]));
> > +		asm volatile("movntdq %%xmm0,%0" : "=m" (q[i]));
> > +		asm volatile("movntdq %%xmm1,%0" : "=m" (r[i]));
> > +		asm volatile("movntdq %%xmm2,%0" : "=m" (s[i]));
> > +		asm volatile("movntdq %%xmm3,%0" : "=m" (t[i]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86_64
> > +/*
> > + * PAR5 (penta parity with Cauchy matrix) SSSE3 implementation
> > + *
> > + * Note that it uses 16 registers, meaning that x64 is required.
> > + */
> > +void raid_par5_ssse3ext(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	uint8_t *s;
> > +	uint8_t *t;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +	s = v[nd+3];
> > +	t = v[nd+4];
> > +
> > +	/* special case with only one data disk */
> > +	if (l == 0) {
> > +		for (i = 0; i < 5; ++i)
> > +			memcpy(v[1+i], v[0], size);
> > +		return;
> > +	}
> > +
> > +	asm_begin();
> > +
> > +	/* generic case with at least two data disks */
> > +	asm volatile("movdqa %0,%%xmm14" : : "m" (gfconst16.poly[0]));
> > +	asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.low4[0]));
> > +
> > +	for (i = 0; i < size; i += 16) {
> > +		/* last disk without the by two multiplication */
> > +		asm volatile("movdqa %0,%%xmm10" : : "m" (v[l][i]));
> > +
> > +		asm volatile("movdqa %xmm10,%xmm0");
> > +		asm volatile("movdqa %xmm10,%xmm1");
> > +
> > +		asm volatile("movdqa %xmm10,%xmm11");
> > +		asm volatile("psrlw  $4,%xmm11");
> > +		asm volatile("pand   %xmm15,%xmm10");
> > +		asm volatile("pand   %xmm15,%xmm11");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
> > +		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][0][1][0]));
> > +		asm volatile("pshufb %xmm10,%xmm2");
> > +		asm volatile("pshufb %xmm11,%xmm13");
> > +		asm volatile("pxor   %xmm13,%xmm2");
> > +
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][1][0][0]));
> > +		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][1][1][0]));
> > +		asm volatile("pshufb %xmm10,%xmm3");
> > +		asm volatile("pshufb %xmm11,%xmm13");
> > +		asm volatile("pxor   %xmm13,%xmm3");
> > +
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (gfgenpshufb[l][2][0][0]));
> > +		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][2][1][0]));
> > +		asm volatile("pshufb %xmm10,%xmm4");
> > +		asm volatile("pshufb %xmm11,%xmm13");
> > +		asm volatile("pxor   %xmm13,%xmm4");
> > +
> > +		/* intermediate disks */
> > +		for (d = l-1; d > 0; --d) {
> > +			asm volatile("movdqa %0,%%xmm10" : : "m" (v[d][i]));
> > +
> > +			asm volatile("pxor %xmm11,%xmm11");
> > +			asm volatile("pcmpgtb %xmm1,%xmm11");
> > +			asm volatile("paddb %xmm1,%xmm1");
> > +			asm volatile("pand %xmm14,%xmm11");
> > +			asm volatile("pxor %xmm11,%xmm1");
> > +
> > +			asm volatile("pxor %xmm10,%xmm0");
> > +			asm volatile("pxor %xmm10,%xmm1");
> > +
> > +			asm volatile("movdqa %xmm10,%xmm11");
> > +			asm volatile("psrlw  $4,%xmm11");
> > +			asm volatile("pand   %xmm15,%xmm10");
> > +			asm volatile("pand   %xmm15,%xmm11");
> > +
> > +			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][0][0][0]));
> > +			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][0][1][0]));
> > +			asm volatile("pshufb %xmm10,%xmm12");
> > +			asm volatile("pshufb %xmm11,%xmm13");
> > +			asm volatile("pxor   %xmm12,%xmm2");
> > +			asm volatile("pxor   %xmm13,%xmm2");
> > +
> > +			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][1][0][0]));
> > +			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][1][1][0]));
> > +			asm volatile("pshufb %xmm10,%xmm12");
> > +			asm volatile("pshufb %xmm11,%xmm13");
> > +			asm volatile("pxor   %xmm12,%xmm3");
> > +			asm volatile("pxor   %xmm13,%xmm3");
> > +
> > +			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][2][0][0]));
> > +			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][2][1][0]));
> > +			asm volatile("pshufb %xmm10,%xmm12");
> > +			asm volatile("pshufb %xmm11,%xmm13");
> > +			asm volatile("pxor   %xmm12,%xmm4");
> > +			asm volatile("pxor   %xmm13,%xmm4");
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		asm volatile("movdqa %0,%%xmm10" : : "m" (v[0][i]));
> > +
> > +		asm volatile("pxor %xmm11,%xmm11");
> > +		asm volatile("pcmpgtb %xmm1,%xmm11");
> > +		asm volatile("paddb %xmm1,%xmm1");
> > +		asm volatile("pand %xmm14,%xmm11");
> > +		asm volatile("pxor %xmm11,%xmm1");
> > +
> > +		asm volatile("pxor %xmm10,%xmm0");
> > +		asm volatile("pxor %xmm10,%xmm1");
> > +		asm volatile("pxor %xmm10,%xmm2");
> > +		asm volatile("pxor %xmm10,%xmm3");
> > +		asm volatile("pxor %xmm10,%xmm4");
> > +
> > +		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
> > +		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
> > +		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
> > +		asm volatile("movntdq %%xmm3,%0" : "=m" (s[i]));
> > +		asm volatile("movntdq %%xmm4,%0" : "=m" (t[i]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +/*
> > + * PAR6 (hexa parity with Cauchy matrix) SSSE3 implementation
> > + */
> > +void raid_par6_ssse3(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	uint8_t *s;
> > +	uint8_t *t;
> > +	uint8_t *u;
> > +	int d, l;
> > +	size_t i;
> > +	uint8_t p0[16] __aligned(16);
> > +	uint8_t q0[16] __aligned(16);
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +	s = v[nd+3];
> > +	t = v[nd+4];
> > +	u = v[nd+5];
> > +
> > +	/* special case with only one data disk */
> > +	if (l == 0) {
> > +		for (i = 0; i < 6; ++i)
> > +			memcpy(v[1+i], v[0], size);
> > +		return;
> > +	}
> > +
> > +	asm_begin();
> > +
> > +	/* generic case with at least two data disks */
> > +	for (i = 0; i < size; i += 16) {
> > +		/* last disk without the by two multiplication */
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
> > +
> > +		asm volatile("movdqa %%xmm4,%0" : "=m" (p0[0]));
> > +		asm volatile("movdqa %%xmm4,%0" : "=m" (q0[0]));
> > +
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
> > +		asm volatile("movdqa %xmm4,%xmm5");
> > +		asm volatile("psrlw  $4,%xmm5");
> > +		asm volatile("pand   %xmm7,%xmm4");
> > +		asm volatile("pand   %xmm7,%xmm5");
> > +
> > +		asm volatile("movdqa %0,%%xmm0" : : "m" (gfgenpshufb[l][0][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][0][1][0]));
> > +		asm volatile("pshufb %xmm4,%xmm0");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pxor   %xmm7,%xmm0");
> > +
> > +		asm volatile("movdqa %0,%%xmm1" : : "m" (gfgenpshufb[l][1][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][1][1][0]));
> > +		asm volatile("pshufb %xmm4,%xmm1");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pxor   %xmm7,%xmm1");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][2][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][2][1][0]));
> > +		asm volatile("pshufb %xmm4,%xmm2");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pxor   %xmm7,%xmm2");
> > +
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][3][0][0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][3][1][0]));
> > +		asm volatile("pshufb %xmm4,%xmm3");
> > +		asm volatile("pshufb %xmm5,%xmm7");
> > +		asm volatile("pxor   %xmm7,%xmm3");
> > +
> > +		/* intermediate disks */
> > +		for (d = l-1; d > 0; --d) {
> > +			asm volatile("movdqa %0,%%xmm5" : : "m" (p0[0]));
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (q0[0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
> > +
> > +			asm volatile("pxor %xmm4,%xmm4");
> > +			asm volatile("pcmpgtb %xmm6,%xmm4");
> > +			asm volatile("paddb %xmm6,%xmm6");
> > +			asm volatile("pand %xmm7,%xmm4");
> > +			asm volatile("pxor %xmm4,%xmm6");
> > +
> > +			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
> > +
> > +			asm volatile("pxor %xmm4,%xmm5");
> > +			asm volatile("pxor %xmm4,%xmm6");
> > +			asm volatile("movdqa %%xmm5,%0" : "=m" (p0[0]));
> > +			asm volatile("movdqa %%xmm6,%0" : "=m" (q0[0]));
> > +
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
> > +			asm volatile("movdqa %xmm4,%xmm5");
> > +			asm volatile("psrlw  $4,%xmm5");
> > +			asm volatile("pand   %xmm7,%xmm4");
> > +			asm volatile("pand   %xmm7,%xmm5");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][0][1][0]));
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pxor   %xmm6,%xmm0");
> > +			asm volatile("pxor   %xmm7,%xmm0");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][1][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][1][1][0]));
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pxor   %xmm6,%xmm1");
> > +			asm volatile("pxor   %xmm7,%xmm1");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][2][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][2][1][0]));
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pxor   %xmm6,%xmm2");
> > +			asm volatile("pxor   %xmm7,%xmm2");
> > +
> > +			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][3][0][0]));
> > +			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][3][1][0]));
> > +			asm volatile("pshufb %xmm4,%xmm6");
> > +			asm volatile("pshufb %xmm5,%xmm7");
> > +			asm volatile("pxor   %xmm6,%xmm3");
> > +			asm volatile("pxor   %xmm7,%xmm3");
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		asm volatile("movdqa %0,%%xmm5" : : "m" (p0[0]));
> > +		asm volatile("movdqa %0,%%xmm6" : : "m" (q0[0]));
> > +		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
> > +
> > +		asm volatile("pxor %xmm4,%xmm4");
> > +		asm volatile("pcmpgtb %xmm6,%xmm4");
> > +		asm volatile("paddb %xmm6,%xmm6");
> > +		asm volatile("pand %xmm7,%xmm4");
> > +		asm volatile("pxor %xmm4,%xmm6");
> > +
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
> > +		asm volatile("pxor %xmm4,%xmm0");
> > +		asm volatile("pxor %xmm4,%xmm1");
> > +		asm volatile("pxor %xmm4,%xmm2");
> > +		asm volatile("pxor %xmm4,%xmm3");
> > +		asm volatile("pxor %xmm4,%xmm5");
> > +		asm volatile("pxor %xmm4,%xmm6");
> > +
> > +		asm volatile("movntdq %%xmm5,%0" : "=m" (p[i]));
> > +		asm volatile("movntdq %%xmm6,%0" : "=m" (q[i]));
> > +		asm volatile("movntdq %%xmm0,%0" : "=m" (r[i]));
> > +		asm volatile("movntdq %%xmm1,%0" : "=m" (s[i]));
> > +		asm volatile("movntdq %%xmm2,%0" : "=m" (t[i]));
> > +		asm volatile("movntdq %%xmm3,%0" : "=m" (u[i]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86_64
> > +/*
> > + * PAR6 (hexa parity with Cauchy matrix) SSSE3 implementation
> > + *
> > + * Note that it uses 16 registers, meaning that x64 is required.
> > + */
> > +void raid_par6_ssse3ext(int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *q;
> > +	uint8_t *r;
> > +	uint8_t *s;
> > +	uint8_t *t;
> > +	uint8_t *u;
> > +	int d, l;
> > +	size_t i;
> > +
> > +	l = nd - 1;
> > +	p = v[nd];
> > +	q = v[nd+1];
> > +	r = v[nd+2];
> > +	s = v[nd+3];
> > +	t = v[nd+4];
> > +	u = v[nd+5];
> > +
> > +	/* special case with only one data disk */
> > +	if (l == 0) {
> > +		for (i = 0; i < 6; ++i)
> > +			memcpy(v[1+i], v[0], size);
> > +		return;
> > +	}
> > +
> > +	asm_begin();
> > +
> > +	/* generic case with at least two data disks */
> > +	asm volatile("movdqa %0,%%xmm14" : : "m" (gfconst16.poly[0]));
> > +	asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.low4[0]));
> > +
> > +	for (i = 0; i < size; i += 16) {
> > +		/* last disk without the by two multiplication */
> > +		asm volatile("movdqa %0,%%xmm10" : : "m" (v[l][i]));
> > +
> > +		asm volatile("movdqa %xmm10,%xmm0");
> > +		asm volatile("movdqa %xmm10,%xmm1");
> > +
> > +		asm volatile("movdqa %xmm10,%xmm11");
> > +		asm volatile("psrlw  $4,%xmm11");
> > +		asm volatile("pand   %xmm15,%xmm10");
> > +		asm volatile("pand   %xmm15,%xmm11");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
> > +		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][0][1][0]));
> > +		asm volatile("pshufb %xmm10,%xmm2");
> > +		asm volatile("pshufb %xmm11,%xmm13");
> > +		asm volatile("pxor   %xmm13,%xmm2");
> > +
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][1][0][0]));
> > +		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][1][1][0]));
> > +		asm volatile("pshufb %xmm10,%xmm3");
> > +		asm volatile("pshufb %xmm11,%xmm13");
> > +		asm volatile("pxor   %xmm13,%xmm3");
> > +
> > +		asm volatile("movdqa %0,%%xmm4" : : "m" (gfgenpshufb[l][2][0][0]));
> > +		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][2][1][0]));
> > +		asm volatile("pshufb %xmm10,%xmm4");
> > +		asm volatile("pshufb %xmm11,%xmm13");
> > +		asm volatile("pxor   %xmm13,%xmm4");
> > +
> > +		asm volatile("movdqa %0,%%xmm5" : : "m" (gfgenpshufb[l][3][0][0]));
> > +		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][3][1][0]));
> > +		asm volatile("pshufb %xmm10,%xmm5");
> > +		asm volatile("pshufb %xmm11,%xmm13");
> > +		asm volatile("pxor   %xmm13,%xmm5");
> > +
> > +		/* intermediate disks */
> > +		for (d = l-1; d > 0; --d) {
> > +			asm volatile("movdqa %0,%%xmm10" : : "m" (v[d][i]));
> > +
> > +			asm volatile("pxor %xmm11,%xmm11");
> > +			asm volatile("pcmpgtb %xmm1,%xmm11");
> > +			asm volatile("paddb %xmm1,%xmm1");
> > +			asm volatile("pand %xmm14,%xmm11");
> > +			asm volatile("pxor %xmm11,%xmm1");
> > +
> > +			asm volatile("pxor %xmm10,%xmm0");
> > +			asm volatile("pxor %xmm10,%xmm1");
> > +
> > +			asm volatile("movdqa %xmm10,%xmm11");
> > +			asm volatile("psrlw  $4,%xmm11");
> > +			asm volatile("pand   %xmm15,%xmm10");
> > +			asm volatile("pand   %xmm15,%xmm11");
> > +
> > +			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][0][0][0]));
> > +			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][0][1][0]));
> > +			asm volatile("pshufb %xmm10,%xmm12");
> > +			asm volatile("pshufb %xmm11,%xmm13");
> > +			asm volatile("pxor   %xmm12,%xmm2");
> > +			asm volatile("pxor   %xmm13,%xmm2");
> > +
> > +			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][1][0][0]));
> > +			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][1][1][0]));
> > +			asm volatile("pshufb %xmm10,%xmm12");
> > +			asm volatile("pshufb %xmm11,%xmm13");
> > +			asm volatile("pxor   %xmm12,%xmm3");
> > +			asm volatile("pxor   %xmm13,%xmm3");
> > +
> > +			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][2][0][0]));
> > +			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][2][1][0]));
> > +			asm volatile("pshufb %xmm10,%xmm12");
> > +			asm volatile("pshufb %xmm11,%xmm13");
> > +			asm volatile("pxor   %xmm12,%xmm4");
> > +			asm volatile("pxor   %xmm13,%xmm4");
> > +
> > +			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][3][0][0]));
> > +			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][3][1][0]));
> > +			asm volatile("pshufb %xmm10,%xmm12");
> > +			asm volatile("pshufb %xmm11,%xmm13");
> > +			asm volatile("pxor   %xmm12,%xmm5");
> > +			asm volatile("pxor   %xmm13,%xmm5");
> > +		}
> > +
> > +		/* first disk with all coefficients at 1 */
> > +		asm volatile("movdqa %0,%%xmm10" : : "m" (v[0][i]));
> > +
> > +		asm volatile("pxor %xmm11,%xmm11");
> > +		asm volatile("pcmpgtb %xmm1,%xmm11");
> > +		asm volatile("paddb %xmm1,%xmm1");
> > +		asm volatile("pand %xmm14,%xmm11");
> > +		asm volatile("pxor %xmm11,%xmm1");
> > +
> > +		asm volatile("pxor %xmm10,%xmm0");
> > +		asm volatile("pxor %xmm10,%xmm1");
> > +		asm volatile("pxor %xmm10,%xmm2");
> > +		asm volatile("pxor %xmm10,%xmm3");
> > +		asm volatile("pxor %xmm10,%xmm4");
> > +		asm volatile("pxor %xmm10,%xmm5");
> > +
> > +		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
> > +		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
> > +		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
> > +		asm volatile("movntdq %%xmm3,%0" : "=m" (s[i]));
> > +		asm volatile("movntdq %%xmm4,%0" : "=m" (t[i]));
> > +		asm volatile("movntdq %%xmm5,%0" : "=m" (u[i]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +/*
> > + * RAID recovering for one disk SSSE3 implementation
> > + */
> > +void raid_rec1_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	uint8_t *p;
> > +	uint8_t *pa;
> > +	uint8_t G;
> > +	uint8_t V;
> > +	size_t i;
> > +
> > +	(void)nr; /* unused, it's always 1 */
> > +
> > +	/* if it's RAID5 uses the faster function */
> > +	if (ip[0] == 0) {
> > +		raid_rec1_par1(id, nd, size, vv);
> > +		return;
> > +	}
> > +
> > +#ifdef RAID_USE_RAID6_PQ
> > +	/* if it's RAID6 recovering with Q uses the faster function */
> > +	if (ip[0] == 1) {
> > +		raid6_datap_recov(nd + 2, size, id[0], vv);
> > +		return;
> > +	}
> > +#endif
> > +
> > +	/* setup the coefficients matrix */
> > +	G = A(ip[0], id[0]);
> > +
> > +	/* invert it to solve the system of linear equations */
> > +	V = inv(G);
> > +
> > +	/* compute delta parity */
> > +	raid_delta_gen(1, id, ip, nd, size, vv);
> > +
> > +	p = v[nd+ip[0]];
> > +	pa = v[id[0]];
> > +
> > +	asm_begin();
> > +
> > +	asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
> > +	asm volatile("movdqa %0,%%xmm4" : : "m" (gfmulpshufb[V][0][0]));
> > +	asm volatile("movdqa %0,%%xmm5" : : "m" (gfmulpshufb[V][1][0]));
> > +
> > +	for (i = 0; i < size; i += 16) {
> > +		asm volatile("movdqa %0,%%xmm0" : : "m" (p[i]));
> > +		asm volatile("movdqa %0,%%xmm1" : : "m" (pa[i]));
> > +		asm volatile("movdqa %xmm4,%xmm2");
> > +		asm volatile("movdqa %xmm5,%xmm3");
> > +		asm volatile("pxor   %xmm0,%xmm1");
> > +		asm volatile("movdqa %xmm1,%xmm0");
> > +		asm volatile("psrlw  $4,%xmm1");
> > +		asm volatile("pand   %xmm7,%xmm0");
> > +		asm volatile("pand   %xmm7,%xmm1");
> > +		asm volatile("pshufb %xmm0,%xmm2");
> > +		asm volatile("pshufb %xmm1,%xmm3");
> > +		asm volatile("pxor   %xmm3,%xmm2");
> > +		asm volatile("movdqa %%xmm2,%0" : "=m" (pa[i]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +/*
> > + * RAID recovering for two disks SSSE3 implementation
> > + */
> > +void raid_rec2_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	const int N = 2;
> > +	uint8_t *p[N];
> > +	uint8_t *pa[N];
> > +	uint8_t G[N*N];
> > +	uint8_t V[N*N];
> > +	size_t i;
> > +	int j, k;
> > +
> > +	(void)nr; /* unused, it's always 2 */
> > +
> > +#ifdef RAID_USE_RAID6_PQ
> > +	/* if it's RAID6 recovering with P and Q uses the faster function */
> > +	if (ip[0] == 0 && ip[1] == 1) {
> > +		raid6_2data_recov(nd + 2, size, id[0], id[1], vv);
> > +		return;
> > +	}
> > +#endif
> > +
> > +	/* setup the coefficients matrix */
> > +	for (j = 0; j < N; ++j)
> > +		for (k = 0; k < N; ++k)
> > +			G[j*N+k] = A(ip[j], id[k]);
> > +
> > +	/* invert it to solve the system of linear equations */
> > +	raid_invert(G, V, N);
> > +
> > +	/* compute delta parity */
> > +	raid_delta_gen(N, id, ip, nd, size, vv);
> > +
> > +	for (j = 0; j < N; ++j) {
> > +		p[j] = v[nd+ip[j]];
> > +		pa[j] = v[id[j]];
> > +	}
> > +
> > +	asm_begin();
> > +
> > +	asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
> > +
> > +	for (i = 0; i < size; i += 16) {
> > +		asm volatile("movdqa %0,%%xmm0" : : "m" (p[0][i]));
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (pa[0][i]));
> > +		asm volatile("movdqa %0,%%xmm1" : : "m" (p[1][i]));
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (pa[1][i]));
> > +		asm volatile("pxor   %xmm2,%xmm0");
> > +		asm volatile("pxor   %xmm3,%xmm1");
> > +
> > +		asm volatile("pxor %xmm6,%xmm6");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfmulpshufb[V[0]][0][0]));
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (gfmulpshufb[V[0]][1][0]));
> > +		asm volatile("movdqa %xmm0,%xmm4");
> > +		asm volatile("movdqa %xmm0,%xmm5");
> > +		asm volatile("psrlw  $4,%xmm5");
> > +		asm volatile("pand   %xmm7,%xmm4");
> > +		asm volatile("pand   %xmm7,%xmm5");
> > +		asm volatile("pshufb %xmm4,%xmm2");
> > +		asm volatile("pshufb %xmm5,%xmm3");
> > +		asm volatile("pxor   %xmm2,%xmm6");
> > +		asm volatile("pxor   %xmm3,%xmm6");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfmulpshufb[V[1]][0][0]));
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (gfmulpshufb[V[1]][1][0]));
> > +		asm volatile("movdqa %xmm1,%xmm4");
> > +		asm volatile("movdqa %xmm1,%xmm5");
> > +		asm volatile("psrlw  $4,%xmm5");
> > +		asm volatile("pand   %xmm7,%xmm4");
> > +		asm volatile("pand   %xmm7,%xmm5");
> > +		asm volatile("pshufb %xmm4,%xmm2");
> > +		asm volatile("pshufb %xmm5,%xmm3");
> > +		asm volatile("pxor   %xmm2,%xmm6");
> > +		asm volatile("pxor   %xmm3,%xmm6");
> > +
> > +		asm volatile("movdqa %%xmm6,%0" : "=m" (pa[0][i]));
> > +
> > +		asm volatile("pxor %xmm6,%xmm6");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfmulpshufb[V[2]][0][0]));
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (gfmulpshufb[V[2]][1][0]));
> > +		asm volatile("movdqa %xmm0,%xmm4");
> > +		asm volatile("movdqa %xmm0,%xmm5");
> > +		asm volatile("psrlw  $4,%xmm5");
> > +		asm volatile("pand   %xmm7,%xmm4");
> > +		asm volatile("pand   %xmm7,%xmm5");
> > +		asm volatile("pshufb %xmm4,%xmm2");
> > +		asm volatile("pshufb %xmm5,%xmm3");
> > +		asm volatile("pxor   %xmm2,%xmm6");
> > +		asm volatile("pxor   %xmm3,%xmm6");
> > +
> > +		asm volatile("movdqa %0,%%xmm2" : : "m" (gfmulpshufb[V[3]][0][0]));
> > +		asm volatile("movdqa %0,%%xmm3" : : "m" (gfmulpshufb[V[3]][1][0]));
> > +		asm volatile("movdqa %xmm1,%xmm4");
> > +		asm volatile("movdqa %xmm1,%xmm5");
> > +		asm volatile("psrlw  $4,%xmm5");
> > +		asm volatile("pand   %xmm7,%xmm4");
> > +		asm volatile("pand   %xmm7,%xmm5");
> > +		asm volatile("pshufb %xmm4,%xmm2");
> > +		asm volatile("pshufb %xmm5,%xmm3");
> > +		asm volatile("pxor   %xmm2,%xmm6");
> > +		asm volatile("pxor   %xmm3,%xmm6");
> > +
> > +		asm volatile("movdqa %%xmm6,%0" : "=m" (pa[1][i]));
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_X86
> > +/*
> > + * RAID recovering SSSE3 implementation
> > + */
> > +void raid_recX_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv)
> > +{
> > +	uint8_t **v = (uint8_t **)vv;
> > +	int N = nr;
> > +	uint8_t *p[RAID_PARITY_MAX];
> > +	uint8_t *pa[RAID_PARITY_MAX];
> > +	uint8_t G[RAID_PARITY_MAX*RAID_PARITY_MAX];
> > +	uint8_t V[RAID_PARITY_MAX*RAID_PARITY_MAX];
> > +	size_t i;
> > +	int j, k;
> > +
> > +	/* setup the coefficients matrix */
> > +	for (j = 0; j < N; ++j)
> > +		for (k = 0; k < N; ++k)
> > +			G[j*N+k] = A(ip[j], id[k]);
> > +
> > +	/* invert it to solve the system of linear equations */
> > +	raid_invert(G, V, N);
> > +
> > +	/* compute delta parity */
> > +	raid_delta_gen(N, id, ip, nd, size, vv);
> > +
> > +	for (j = 0; j < N; ++j) {
> > +		p[j] = v[nd+ip[j]];
> > +		pa[j] = v[id[j]];
> > +	}
> > +
> > +	asm_begin();
> > +
> > +	asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
> > +
> > +	for (i = 0; i < size; i += 16) {
> > +		uint8_t PD[RAID_PARITY_MAX][16] __aligned(16);
> > +
> > +		/* delta */
> > +		for (j = 0; j < N; ++j) {
> > +			asm volatile("movdqa %0,%%xmm0" : : "m" (p[j][i]));
> > +			asm volatile("movdqa %0,%%xmm1" : : "m" (pa[j][i]));
> > +			asm volatile("pxor   %xmm1,%xmm0");
> > +			asm volatile("movdqa %%xmm0,%0" : "=m" (PD[j][0]));
> > +		}
> > +
> > +		/* reconstruct */
> > +		for (j = 0; j < N; ++j) {
> > +			asm volatile("pxor %xmm0,%xmm0");
> > +			asm volatile("pxor %xmm1,%xmm1");
> > +
> > +			for (k = 0; k < N; ++k) {
> > +				uint8_t m = V[j*N+k];
> > +
> > +				asm volatile("movdqa %0,%%xmm2" : : "m" (gfmulpshufb[m][0][0]));
> > +				asm volatile("movdqa %0,%%xmm3" : : "m" (gfmulpshufb[m][1][0]));
> > +				asm volatile("movdqa %0,%%xmm4" : : "m" (PD[k][0]));
> > +				asm volatile("movdqa %xmm4,%xmm5");
> > +				asm volatile("psrlw  $4,%xmm5");
> > +				asm volatile("pand   %xmm7,%xmm4");
> > +				asm volatile("pand   %xmm7,%xmm5");
> > +				asm volatile("pshufb %xmm4,%xmm2");
> > +				asm volatile("pshufb %xmm5,%xmm3");
> > +				asm volatile("pxor   %xmm2,%xmm0");
> > +				asm volatile("pxor   %xmm3,%xmm1");
> > +			}
> > +
> > +			asm volatile("pxor %xmm1,%xmm0");
> > +			asm volatile("movdqa %%xmm0,%0" : "=m" (pa[j][i]));
> > +		}
> > +	}
> > +
> > +	asm_end();
> > +}
> > +#endif
> > +
> 



-- 
Andrea Mazzoleni
935A 2D3C 5C70 BCD6 CB0C  ED89 7C19 4321 6340 3F6D
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ