linux-kernel - [RFC] lib: raid: New RAID library supporting up to six parities

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1388742436-3754-1-git-send-email-amadvance@gmail.com>
Date:	Fri,  3 Jan 2014 10:47:16 +0100
From:	Andrea Mazzoleni <amadvance@...il.com>
To:	neilb@...e.de
Cc:	linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
	linux-btrfs@...r.kernel.org, amadvance@...il.com
Subject: [RFC] lib: raid: New RAID library supporting up to six parities

This patch adds a new lib/raid directory, containing a new RAID support
based on a Cauchy matrix working for up to six parities, and backward
compatible with the existing RAID6 support.

It was developed for kernel 3.13-rc4, but it should work with any other
version because it's mostly formed of new files. The only modification
is about adding a new CONFIG_RAID_CAUCHY option in the "lib" configuration
section.

The main interface is defined in include/linux/raid/raid.h and provides
easy to use functions able to generate parities and to recover data.
This interface is different than the one provided by the RAID6 library,
because with more parities the number of recovering cases grows exponentially
and it's not feasible to have a different function for each one.

The library provides fast implementations using SSE2 and SSSE3 for x86/x64
and a portable C implementation working everythere.
If the RAID6 library is enabled in the kernel, its functionality is also used
to maintain the existing level of performance for the first two parities in
all the supported architectures.

At startup the module runs a very fast self test (about 1ms) to ensure that
the used functions are correct.
I verified that the module builds, loads and passes the self test in the x86,
and x64 architectures. I expects no problems in other platforms, but at now
they are not really tested. I only simulated similar conditions.

In the lib/raid/test directory are present also some user mode test programs:
selftest - Runs the same selftest executed at the module startup.
fulltest - Runs a more extensive test that checks all the built-in functions.
speetest - Runs a speed test in memory.

As a reference, in my icore7 2.7GHz the speedtest program reports:

...
Speed test using 16 data buffers of 4096 bytes, for a total of 64 KiB.
Memory blocks have a displacement of 64 bytes to improve cache performance.
The reported value is the aggregate bandwidth of all data blocks in MiB/s,
not counting parity blocks.

Memory write speed using the C memset() function:
  memset   33518

RAID functions used for computing the parity:
            int8   int32   int64    sse2   sse2e   ssse3  ssse3e
    par1           11762   21450   44621
    par2            3520    6176   18100   20338
    par3     848                                    8009    9210
    par4     659                                    6518    7303
    par5     531                                    4931    5363
    par6     430                                    4069    4471

RAID functions used for recovering:
            int8   ssse3
    rec1     591    1126
    rec2     272     456
    rec3      80     305
    rec4      49     216
    rec5      34     151
...

Legend:
parX functions to generate X parities
recX functions to recover X data blocks
int8 implemention based on 8 bits arithmetics
int32 implemention based on 32 bits arithmetics
int64 implemention based on 64 bits arithmetics
sse2 implemention based on SSE2
sse2e implemention based on SSE2 with 16 registers (x64)
ssse3 implemention based on SSSE3
ssse3e implemention based on SSSE3 with 16 registers (x64)

Signed-off-by: Andrea Mazzoleni <amadvance@...il.com>
---
This a bunch of new code, and some context could help to understand what's
going on.

This is a port of the RAID engine that I'm currently using in my hobby project
called SnapRAID. This project was initially based on the Linux RAID6 library,
and then limited to support only two parities like the Linux kernel itself.

How to extend the RAID6 logic to support more parities was not so obvious,
and it was also a recurring argument in the linux-raid mailing list in the
past years. Unfortunately, with no general solution ever proposed.

It took some time, but finally I found a way to do it keeping backward
compatibility with RAID6, using a specially built Cauchy matrix.

After implementing this support for my hobby project I got this solution also
discussed in the linux-raid/linux-btrfs mailing list in November in the thread
"Triple parity and beyond":

http://thread.gmane.org/gmane.comp.file-systems.btrfs/30159

This patch is the implementation of such discussion, done porting my existing
code to the kernel environment.

The code compiles without warnings with gcc -Wall -Wextra, with the clang
analyzer, and it runs cleany with valgrind.
The checkpatch.pl robot still report some issues. The errors are all false
positives that I was not able to workaround. The remaining warnings are
intentionally not fixed because that would make the code more difficult to read,
or are others false positive.

Please let me know what do you think. Any kind of feedback is welcome.

Thanks,
Andrea

 include/linux/raid/raid.h |   86 +++
 lib/Kconfig               |   12 +
 lib/Makefile              |    1 +
 lib/raid/Makefile         |   14 +
 lib/raid/cpu.h            |   44 ++
 lib/raid/gf.h             |  109 ++++
 lib/raid/int.c            |  567 ++++++++++++++++
 lib/raid/internal.h       |  146 +++++
 lib/raid/mktables.c       |  338 ++++++++++
 lib/raid/module.c         |  437 +++++++++++++
 lib/raid/raid.c           |  425 ++++++++++++
 lib/raid/sort.c           |   72 +++
 lib/raid/test/Makefile    |   33 +
 lib/raid/test/combo.h     |  155 +++++
 lib/raid/test/fulltest.c  |   76 +++
 lib/raid/test/memory.c    |   79 +++
 lib/raid/test/memory.h    |   74 +++
 lib/raid/test/selftest.c  |   41 ++
 lib/raid/test/speedtest.c |  567 ++++++++++++++++
 lib/raid/test/test.c      |  314 +++++++++
 lib/raid/test/test.h      |   59 ++
 lib/raid/test/usermode.h  |   83 +++
 lib/raid/test/xor.c       |   41 ++
 lib/raid/x86.c            | 1569 +++++++++++++++++++++++++++++++++++++++++++++
 24 files changed, 5342 insertions(+)
 create mode 100644 include/linux/raid/raid.h
 create mode 100644 lib/raid/Makefile
 create mode 100644 lib/raid/cpu.h
 create mode 100644 lib/raid/gf.h
 create mode 100644 lib/raid/int.c
 create mode 100644 lib/raid/internal.h
 create mode 100644 lib/raid/mktables.c
 create mode 100644 lib/raid/module.c
 create mode 100644 lib/raid/raid.c
 create mode 100644 lib/raid/sort.c
 create mode 100644 lib/raid/test/Makefile
 create mode 100644 lib/raid/test/combo.h
 create mode 100644 lib/raid/test/fulltest.c
 create mode 100644 lib/raid/test/memory.c
 create mode 100644 lib/raid/test/memory.h
 create mode 100644 lib/raid/test/selftest.c
 create mode 100644 lib/raid/test/speedtest.c
 create mode 100644 lib/raid/test/test.c
 create mode 100644 lib/raid/test/test.h
 create mode 100644 lib/raid/test/usermode.h
 create mode 100644 lib/raid/test/xor.c
 create mode 100644 lib/raid/x86.c

diff --git a/include/linux/raid/raid.h b/include/linux/raid/raid.h
new file mode 100644
index 0000000..9b5e441
--- /dev/null
+++ b/include/linux/raid/raid.h
@@ -0,0 +1,86 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __RAID_H
+#define __RAID_H
+
+#ifdef __KERNEL__ /* to build the user mode test */
+#include <linux/types.h>
+#endif
+
+/**
+ * Max number of parity disks supported.
+ */
+#define RAID_PARITY_MAX 6
+
+/**
+ * Maximum number of data disks supported.
+ */
+#define RAID_DATA_MAX 251
+
+/**
+ * Computes the parity.
+ *
+ * @nd Number of data blocks.
+ * @np Number of parities blocks to compute.
+ * @size Size of the blocks pointed by @v. It must be a multipler of 64.
+ * @v Vector of pointers to the blocks of data and parity.
+ *   It has (@nd + @np) elements. The starting elements are the blocks for
+ *   data, following with the parity blocks.
+ *   Each blocks has @size bytes.
+ */
+void raid_par(int nd, int np, size_t size, void **v);
+
+/**
+ * Recovers failures in data and parity blocks.
+ *
+ * All the data and parity blocks marked as bad in the @id and @ip vector
+ * are recovered and recomputed.
+ *
+ * The parities blocks to use for recovering are automatically selected from
+ * the ones NOT present in the @ip vector.
+ *
+ * Ensure to have @nrd + @nrp <= @np, otherwise recovering is not possible.
+ *
+ * @nrd Number of failed data blocks to recover.
+ * @id[] Vector of @nrd indexes of the data blocks to recover.
+ *   The indexes start from 0. They must be in order.
+ * @nrp Number of failed parity blocks to recover.
+ * @ip[] Vector of @nrp indexes of the parity blocks to recover.
+ *   The indexes start from 0. They must be in order.
+ *   All the parities not specified here are assumed correct, and they are
+ *   not recomputed.
+ * @nd Number of data blocks.
+ * @np Number of parity blocks.
+ * @size Size of the blocks pointed by @v. It must be a multipler of 64.
+ * @v Vector of pointers to the blocks of data and parity.
+ *   It has (@nd + @np) elements. The starting elements are the blocks
+ *   for data, following with the parity blocks.
+ *   Each blocks has @size bytes.
+ */
+void raid_rec(int nrd, int *id, int nrp, int *ip, int nd, int np, size_t size, void **v);
+
+/**
+ * Sorts a small vector of integers.
+ *
+ * If you have block indexes not in order, you can use this function to sort
+ * them before callign raid_rec().
+ *
+ * @n Number of integers. No more than RAID_PARITY_MAX.
+ * @v Vector of integers.
+ */
+void raid_sort(int n, int *v);
+
+#endif
+
diff --git a/lib/Kconfig b/lib/Kconfig
index 991c98b..a77ffbe 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -10,6 +10,18 @@ menu "Library routines"
 config RAID6_PQ
 	tristate
 
+config RAID_CAUCHY
+	tristate "RAID Cauchy functions"
+	help
+	  This option enables the RAID parity library based on a Cauchy matrix
+	  that supports up to six parities, and it's compatible with the
+	  existing RAID6 support.
+	  This library provides optimized functions for architectures with
+	  SSSE3 support.
+	  If the RAID6 module is enabled, it's automatically used to
+	  maintain the same performance level in all the architectures.
+	  Module will be called raid_cauchy.
+
 config BITREVERSE
 	tristate
 
diff --git a/lib/Makefile b/lib/Makefile
index a459c31..8b76716 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -79,6 +79,7 @@ obj-$(CONFIG_LZ4HC_COMPRESS) += lz4/
 obj-$(CONFIG_LZ4_DECOMPRESS) += lz4/
 obj-$(CONFIG_XZ_DEC) += xz/
 obj-$(CONFIG_RAID6_PQ) += raid6/
+obj-$(CONFIG_RAID_CAUCHY) += raid/
 
 lib-$(CONFIG_DECOMPRESS_GZIP) += decompress_inflate.o
 lib-$(CONFIG_DECOMPRESS_BZIP2) += decompress_bunzip2.o
diff --git a/lib/raid/Makefile b/lib/raid/Makefile
new file mode 100644
index 0000000..b8c1df5
--- /dev/null
+++ b/lib/raid/Makefile
@@ -0,0 +1,14 @@
+obj-$(CONFIG_RAID_CAUCHY) += raid_cauchy.o
+
+raid_cauchy-y	+= module.o raid.o tables.o int.o
+
+raid_cauchy-$(CONFIG_X86) += x86.o
+
+hostprogs-y	+= mktables
+
+quiet_cmd_mktable = TABLE   $@
+      cmd_mktable = $(obj)/mktables > $@ || ( rm -f $@ && exit 1 )
+
+targets += tables.c
+$(obj)/tables.c: $(obj)/mktables FORCE
+	$(call if_changed,mktable)
diff --git a/lib/raid/cpu.h b/lib/raid/cpu.h
new file mode 100644
index 0000000..3202fa9
--- /dev/null
+++ b/lib/raid/cpu.h
@@ -0,0 +1,44 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __RAID_CPU_H
+#define __RAID_CPU_H
+
+#ifdef CONFIG_X86
+static inline int raid_cpu_has_sse2(void)
+{
+	return boot_cpu_has(X86_FEATURE_XMM2);
+}
+
+static inline int raid_cpu_has_ssse3(void)
+{
+	/* checks also for SSE2 */
+	/* likely it's implicit but it doesn't harm */
+	return boot_cpu_has(X86_FEATURE_XMM2)
+		&& boot_cpu_has(X86_FEATURE_SSSE3);
+}
+
+static inline int raid_cpu_has_avx2(void)
+{
+	/* checks also for SSE2 and SSSE3 */
+	/* likely it's implicit but it doesn't harm */
+	return boot_cpu_has(X86_FEATURE_XMM2)
+		&& boot_cpu_has(X86_FEATURE_SSSE3)
+		&& boot_cpu_has(X86_FEATURE_AVX)
+		&& boot_cpu_has(X86_FEATURE_AVX2);
+}
+#endif
+
+#endif
+
diff --git a/lib/raid/gf.h b/lib/raid/gf.h
new file mode 100644
index 0000000..f444e63
--- /dev/null
+++ b/lib/raid/gf.h
@@ -0,0 +1,109 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __RAID_GF_H
+#define __RAID_GF_H
+
+/*
+ * Galois field operations.
+ *
+ * Basic range checks are implemented using BUG_ON().
+ */
+
+/*
+ * GF a*b.
+ */
+static __always_inline uint8_t mul(uint8_t a, uint8_t b)
+{
+	return gfmul[a][b];
+}
+
+/*
+ * GF 1/a.
+ * Not defined for a == 0.
+ */
+static __always_inline uint8_t inv(uint8_t v)
+{
+	BUG_ON(v == 0); /* division by zero */
+
+	return gfinv[v];
+}
+
+/*
+ * GF 2^a.
+ */
+static __always_inline uint8_t pow2(int v)
+{
+	BUG_ON(v < 0 || v > 254); /* invalid exponent */
+
+	return gfexp[v];
+}
+
+/*
+ * Gets the multiplication table for a specified value.
+ */
+static __always_inline const uint8_t *table(uint8_t v)
+{
+	return gfmul[v];
+}
+
+/*
+ * Gets the generator matrix coefficient for parity 'p' and disk 'd'.
+ */
+static __always_inline uint8_t A(int p, int d)
+{
+	return gfgen[p][d];
+}
+
+/*
+ * Dereference as uint8_t
+ */
+#define v_8(p) (*(uint8_t *)&(p))
+
+/*
+ * Dereference as uint32_t
+ */
+#define v_32(p) (*(uint32_t *)&(p))
+
+/*
+ * Dereference as uint64_t
+ */
+#define v_64(p) (*(uint64_t *)&(p))
+
+/*
+ * Multiply each byte of a uint32 by 2 in the GF(2^8).
+ */
+static __always_inline uint32_t x2_32(uint32_t v)
+{
+	uint32_t mask = v & 0x80808080U;
+	mask = (mask << 1) - (mask >> 7);
+	v = (v << 1) & 0xfefefefeU;
+	v ^= mask & 0x1d1d1d1dU;
+	return v;
+}
+
+/*
+ * Multiply each byte of a uint64 by 2 in the GF(2^8).
+ */
+static __always_inline uint64_t x2_64(uint64_t v)
+{
+	uint64_t mask = v & 0x8080808080808080ULL;
+	mask = (mask << 1) - (mask >> 7);
+	v = (v << 1) & 0xfefefefefefefefeULL;
+	v ^= mask & 0x1d1d1d1d1d1d1d1dULL;
+	return v;
+}
+
+#endif
+
diff --git a/lib/raid/int.c b/lib/raid/int.c
new file mode 100644
index 0000000..cd1e147
--- /dev/null
+++ b/lib/raid/int.c
@@ -0,0 +1,567 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "internal.h"
+#include "gf.h"
+
+/*
+ * PAR1 (RAID5 with xor) 32bit C implementation
+ */
+void raid_par1_int32(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	int d, l;
+	size_t i;
+
+	uint32_t p0;
+	uint32_t p1;
+
+	l = nd - 1;
+	p = v[nd];
+
+	for (i = 0; i < size; i += 8) {
+		p0 = v_32(v[l][i]);
+		p1 = v_32(v[l][i+4]);
+		for (d = l-1; d >= 0; --d) {
+			p0 ^= v_32(v[d][i]);
+			p1 ^= v_32(v[d][i+4]);
+		}
+		v_32(p[i]) = p0;
+		v_32(p[i+4]) = p1;
+	}
+}
+
+/*
+ * PAR1 (RAID5 with xor) 64bit C implementation
+ */
+void raid_par1_int64(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	int d, l;
+	size_t i;
+
+	uint64_t p0;
+	uint64_t p1;
+
+	l = nd - 1;
+	p = v[nd];
+
+	for (i = 0; i < size; i += 16) {
+		p0 = v_64(v[l][i]);
+		p1 = v_64(v[l][i+8]);
+		for (d = l-1; d >= 0; --d) {
+			p0 ^= v_64(v[d][i]);
+			p1 ^= v_64(v[d][i+8]);
+		}
+		v_64(p[i]) = p0;
+		v_64(p[i+8]) = p1;
+	}
+}
+
+/*
+ * PAR2 (RAID6 with powers of 2) 32bit C implementation
+ */
+void raid_par2_int32(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	int d, l;
+	size_t i;
+
+	uint32_t d0, q0, p0;
+	uint32_t d1, q1, p1;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+
+	for (i = 0; i < size; i += 8) {
+		q0 = p0 = v_32(v[l][i]);
+		q1 = p1 = v_32(v[l][i+4]);
+		for (d = l-1; d >= 0; --d) {
+			d0 = v_32(v[d][i]);
+			d1 = v_32(v[d][i+4]);
+
+			p0 ^= d0;
+			p1 ^= d1;
+
+			q0 = x2_32(q0);
+			q1 = x2_32(q1);
+
+			q0 ^= d0;
+			q1 ^= d1;
+		}
+		v_32(p[i]) = p0;
+		v_32(p[i+4]) = p1;
+		v_32(q[i]) = q0;
+		v_32(q[i+4]) = q1;
+	}
+}
+
+/*
+ * PAR2 (RAID6 with powers of 2) 64bit C implementation
+ */
+void raid_par2_int64(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	int d, l;
+	size_t i;
+
+	uint64_t d0, q0, p0;
+	uint64_t d1, q1, p1;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+
+	for (i = 0; i < size; i += 16) {
+		q0 = p0 = v_64(v[l][i]);
+		q1 = p1 = v_64(v[l][i+8]);
+		for (d = l-1; d >= 0; --d) {
+			d0 = v_64(v[d][i]);
+			d1 = v_64(v[d][i+8]);
+
+			p0 ^= d0;
+			p1 ^= d1;
+
+			q0 = x2_64(q0);
+			q1 = x2_64(q1);
+
+			q0 ^= d0;
+			q1 ^= d1;
+		}
+		v_64(p[i]) = p0;
+		v_64(p[i+8]) = p1;
+		v_64(q[i]) = q0;
+		v_64(q[i+8]) = q1;
+	}
+}
+
+/*
+ * PAR3 (triple parity with Cauchy matrix) 8bit C implementation
+ *
+ * Note that instead of a generic multiplication table, likely resulting
+ * in multiple cache misses, a precomputed table could be used.
+ * But this is only a kind of reference function, and we are not really
+ * interested in speed.
+ */
+void raid_par3_int8(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	int d, l;
+	size_t i;
+
+	uint8_t d0, r0, q0, p0;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+
+	for (i = 0; i < size; i += 1) {
+		p0 = q0 = r0 = 0;
+		for (d = l; d > 0; --d) {
+			d0 = v_8(v[d][i]);
+
+			p0 ^= d0;
+			q0 ^= gfmul[d0][gfgen[1][d]];
+			r0 ^= gfmul[d0][gfgen[2][d]];
+		}
+
+		/* first disk with all coefficients at 1 */
+		d0 = v_8(v[0][i]);
+
+		p0 ^= d0;
+		q0 ^= d0;
+		r0 ^= d0;
+
+		v_8(p[i]) = p0;
+		v_8(q[i]) = q0;
+		v_8(r[i]) = r0;
+	}
+}
+
+/*
+ * PAR4 (quad parity with Cauchy matrix) 8bit C implementation
+ *
+ * Note that instead of a generic multiplication table, likely resulting
+ * in multiple cache misses, a precomputed table could be used.
+ * But this is only a kind of reference function, and we are not really
+ * interested in speed.
+ */
+void raid_par4_int8(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	uint8_t *s;
+	int d, l;
+	size_t i;
+
+	uint8_t d0, s0, r0, q0, p0;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+	s = v[nd+3];
+
+	for (i = 0; i < size; i += 1) {
+		p0 = q0 = r0 = s0 = 0;
+		for (d = l; d > 0; --d) {
+			d0 = v_8(v[d][i]);
+
+			p0 ^= d0;
+			q0 ^= gfmul[d0][gfgen[1][d]];
+			r0 ^= gfmul[d0][gfgen[2][d]];
+			s0 ^= gfmul[d0][gfgen[3][d]];
+		}
+
+		/* first disk with all coefficients at 1 */
+		d0 = v_8(v[0][i]);
+
+		p0 ^= d0;
+		q0 ^= d0;
+		r0 ^= d0;
+		s0 ^= d0;
+
+		v_8(p[i]) = p0;
+		v_8(q[i]) = q0;
+		v_8(r[i]) = r0;
+		v_8(s[i]) = s0;
+	}
+}
+
+/*
+ * PAR5 (penta parity with Cauchy matrix) 8bit C implementation
+ *
+ * Note that instead of a generic multiplication table, likely resulting
+ * in multiple cache misses, a precomputed table could be used.
+ * But this is only a kind of reference function, and we are not really
+ * interested in speed.
+ */
+void raid_par5_int8(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	uint8_t *s;
+	uint8_t *t;
+	int d, l;
+	size_t i;
+
+	uint8_t d0, t0, s0, r0, q0, p0;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+	s = v[nd+3];
+	t = v[nd+4];
+
+	for (i = 0; i < size; i += 1) {
+		p0 = q0 = r0 = s0 = t0 = 0;
+		for (d = l; d > 0; --d) {
+			d0 = v_8(v[d][i]);
+
+			p0 ^= d0;
+			q0 ^= gfmul[d0][gfgen[1][d]];
+			r0 ^= gfmul[d0][gfgen[2][d]];
+			s0 ^= gfmul[d0][gfgen[3][d]];
+			t0 ^= gfmul[d0][gfgen[4][d]];
+		}
+
+		/* first disk with all coefficients at 1 */
+		d0 = v_8(v[0][i]);
+
+		p0 ^= d0;
+		q0 ^= d0;
+		r0 ^= d0;
+		s0 ^= d0;
+		t0 ^= d0;
+
+		v_8(p[i]) = p0;
+		v_8(q[i]) = q0;
+		v_8(r[i]) = r0;
+		v_8(s[i]) = s0;
+		v_8(t[i]) = t0;
+	}
+}
+
+/*
+ * PAR6 (hexa parity with Cauchy matrix) 8bit C implementation
+ *
+ * Note that instead of a generic multiplication table, likely resulting
+ * in multiple cache misses, a precomputed table could be used.
+ * But this is only a kind of reference function, and we are not really
+ * interested in speed.
+ */
+void raid_par6_int8(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	uint8_t *s;
+	uint8_t *t;
+	uint8_t *u;
+	int d, l;
+	size_t i;
+
+	uint8_t d0, u0, t0, s0, r0, q0, p0;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+	s = v[nd+3];
+	t = v[nd+4];
+	u = v[nd+5];
+
+	for (i = 0; i < size; i += 1) {
+		p0 = q0 = r0 = s0 = t0 = u0 = 0;
+		for (d = l; d > 0; --d) {
+			d0 = v_8(v[d][i]);
+
+			p0 ^= d0;
+			q0 ^= gfmul[d0][gfgen[1][d]];
+			r0 ^= gfmul[d0][gfgen[2][d]];
+			s0 ^= gfmul[d0][gfgen[3][d]];
+			t0 ^= gfmul[d0][gfgen[4][d]];
+			u0 ^= gfmul[d0][gfgen[5][d]];
+		}
+
+		/* first disk with all coefficients at 1 */
+		d0 = v_8(v[0][i]);
+
+		p0 ^= d0;
+		q0 ^= d0;
+		r0 ^= d0;
+		s0 ^= d0;
+		t0 ^= d0;
+		u0 ^= d0;
+
+		v_8(p[i]) = p0;
+		v_8(q[i]) = q0;
+		v_8(r[i]) = r0;
+		v_8(s[i]) = s0;
+		v_8(t[i]) = t0;
+		v_8(u[i]) = u0;
+	}
+}
+
+/*
+ * Recover failure of one data block at index id[0] using parity at index
+ * ip[0] for any RAID level.
+ *
+ * Starting from the equation:
+ *
+ * Pd = A[ip[0],id[0]] * Dx
+ *
+ * and solving we get:
+ *
+ * Dx = A[ip[0],id[0]]^-1 * Pd
+ */
+void raid_rec1_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *pa;
+	const uint8_t *T;
+	uint8_t G;
+	uint8_t V;
+	size_t i;
+
+	(void)nr; /* unused, it's always 1 */
+
+	/* if it's RAID5 uses the faster function */
+	if (ip[0] == 0) {
+		raid_rec1_par1(id, nd, size, vv);
+		return;
+	}
+
+#ifdef RAID_USE_RAID6_PQ
+	/* if it's RAID6 recovering with Q uses the faster function */
+	if (ip[0] == 1) {
+		raid6_datap_recov(nd + 2, size, id[0], vv);
+		return;
+	}
+#endif
+
+	/* setup the coefficients matrix */
+	G = A(ip[0], id[0]);
+
+	/* invert it to solve the system of linear equations */
+	V = inv(G);
+
+	/* get multiplication tables */
+	T = table(V);
+
+	/* compute delta parity */
+	raid_delta_gen(1, id, ip, nd, size, vv);
+
+	p = v[nd+ip[0]];
+	pa = v[id[0]];
+
+	for (i = 0; i < size; ++i) {
+		/* delta */
+		uint8_t Pd = p[i] ^ pa[i];
+
+		/* reconstruct */
+		pa[i] = T[Pd];
+	}
+}
+
+/*
+ * Recover failure of two data blocks at indexes id[0],id[1] using parity at
+ * indexes ip[0],ip[1] for any RAID level.
+ *
+ * Starting from the equations:
+ *
+ * Pd = A[ip[0],id[0]] * Dx + A[ip[0],id[1]] * Dy
+ * Qd = A[ip[1],id[0]] * Dx + A[ip[1],id[1]] * Dy
+ *
+ * we solve inverting the coefficients matrix.
+ */
+void raid_rec2_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *pa;
+	uint8_t *q;
+	uint8_t *qa;
+	const int N = 2;
+	const uint8_t *T[N][N];
+	uint8_t G[N*N];
+	uint8_t V[N*N];
+	size_t i;
+	int j, k;
+
+	(void)nr; /* unused, it's always 2 */
+
+	/* if it's RAID6 recovering with P and Q uses the faster function */
+	if (ip[0] == 0 && ip[1] == 1) {
+#ifdef RAID_USE_RAID6_PQ
+		raid6_2data_recov(nd + 2, size, id[0], id[1], vv);
+#else
+		raid_rec2_par2(id, ip, nd, size, vv);
+#endif
+		return;
+	}
+
+	/* setup the coefficients matrix */
+	for (j = 0; j < N; ++j)
+		for (k = 0; k < N; ++k)
+			G[j*N+k] = A(ip[j], id[k]);
+
+	/* invert it to solve the system of linear equations */
+	raid_invert(G, V, N);
+
+	/* get multiplication tables */
+	for (j = 0; j < N; ++j)
+		for (k = 0; k < N; ++k)
+			T[j][k] = table(V[j*N+k]);
+
+	/* compute delta parity */
+	raid_delta_gen(2, id, ip, nd, size, vv);
+
+	p = v[nd+ip[0]];
+	q = v[nd+ip[1]];
+	pa = v[id[0]];
+	qa = v[id[1]];
+
+	for (i = 0; i < size; ++i) {
+		/* delta */
+		uint8_t Pd = p[i] ^ pa[i];
+		uint8_t Qd = q[i] ^ qa[i];
+
+		/* reconstruct */
+		pa[i] = T[0][0][Pd] ^ T[0][1][Qd];
+		qa[i] = T[1][0][Pd] ^ T[1][1][Qd];
+	}
+}
+
+/*
+ * Recover failure of N data blocks at indexes id[N] using parity at indexes
+ * ip[N] for any RAID level.
+ *
+ * Starting from the N equations, with 0<=i<N :
+ *
+ * PD[i] = sum(A[ip[i],id[j]] * D[i]) 0<=j<N
+ *
+ * we solve inverting the coefficients matrix.
+ *
+ * Note that referring at previous equations you have:
+ * PD[0] = Pd, PD[1] = Qd, PD[2] = Rd, ...
+ * D[0] = Dx, D[1] = Dy, D[2] = Dz, ...
+ */
+void raid_recX_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p[RAID_PARITY_MAX];
+	uint8_t *pa[RAID_PARITY_MAX];
+	const uint8_t *T[RAID_PARITY_MAX][RAID_PARITY_MAX];
+	uint8_t G[RAID_PARITY_MAX*RAID_PARITY_MAX];
+	uint8_t V[RAID_PARITY_MAX*RAID_PARITY_MAX];
+	size_t i;
+	int j, k;
+
+	/* setup the coefficients matrix */
+	for (j = 0; j < nr; ++j)
+		for (k = 0; k < nr; ++k)
+			G[j*nr+k] = A(ip[j], id[k]);
+
+	/* invert it to solve the system of linear equations */
+	raid_invert(G, V, nr);
+
+	/* get multiplication tables */
+	for (j = 0; j < nr; ++j)
+		for (k = 0; k < nr; ++k)
+			T[j][k] = table(V[j*nr+k]);
+
+	/* compute delta parity */
+	raid_delta_gen(nr, id, ip, nd, size, vv);
+
+	for (j = 0; j < nr; ++j) {
+		p[j] = v[nd+ip[j]];
+		pa[j] = v[id[j]];
+	}
+
+	for (i = 0; i < size; ++i) {
+		uint8_t PD[RAID_PARITY_MAX];
+
+		/* delta */
+		for (j = 0; j < nr; ++j)
+			PD[j] = p[j][i] ^ pa[j][i];
+
+		/* reconstruct */
+		for (j = 0; j < nr; ++j) {
+			uint8_t b = 0;
+			for (k = 0; k < nr; ++k)
+				b ^= T[j][k][PD[k]];
+			pa[j][i] = b;
+		}
+	}
+}
+
diff --git a/lib/raid/internal.h b/lib/raid/internal.h
new file mode 100644
index 0000000..7d84458
--- /dev/null
+++ b/lib/raid/internal.h
@@ -0,0 +1,146 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __RAID_INTERNAL_H
+#define __RAID_INTERNAL_H
+
+/*
+ * Includes anything required for compatibility.
+ */
+#ifdef __KERNEL__ /* to build the user mode test */
+
+#include <linux/module.h>
+#include <linux/kconfig.h> /* for IS_* macros */
+#include <linux/export.h> /* for EXPORT_SYMBOL/EXPORT_SYMBOL_GPL */
+#include <linux/bug.h> /* for BUG_ON */
+#include <linux/gfp.h> /* for __get_free_pages */
+#include <linux/raid/xor.h> /* for xor_blocks */
+
+#ifdef CONFIG_X86
+#include <asm/i387.h> /* for kernel_fpu_begin/end() */
+#endif
+
+/* if we can use the RAID6 library */
+#if IS_BUILTIN(CONFIG_RAID6_PQ) \
+	|| (IS_MODULE(CONFIG_RAID6_PQ) && IS_MODULE(CONFIG_RAID6_CAUCHY))
+#define RAID_USE_RAID6_PQ 1
+#include <linux/raid/pq.h> /* for tables/functions */
+#endif
+
+#else
+
+#include "test/usermode.h"
+
+#endif
+
+/*
+ * Include the main header.
+ */
+#include <linux/raid/raid.h>
+
+/*
+ * Internal functions.
+ *
+ * These are intented to provide access for testing.
+ */
+void raid_init(void);
+int raid_selftest(void);
+int raid_speedtest(void);
+void raid_par_ref(int nd, int np, size_t size, void **vv);
+void raid_invert(uint8_t *M, uint8_t *V, int n);
+void raid_delta_gen(int nr, int *id, int *ip, int nd, size_t size, void **v);
+void raid_rec1_par1(int *id, int nd, size_t size, void **v);
+void raid_rec2_par2(int *id, int *ip, int nd, size_t size, void **vv);
+void raid_par1_xorblocks(int nd, size_t size, void **v);
+void raid_par1_int32(int nd, size_t size, void **vv);
+void raid_par1_int64(int nd, size_t size, void **vv);
+void raid_par1_sse2(int nd, size_t size, void **vv);
+void raid_par2_raid6(int nd, size_t size, void **vv);
+void raid_par2_int32(int nd, size_t size, void **vv);
+void raid_par2_int64(int nd, size_t size, void **vv);
+void raid_par2_sse2(int nd, size_t size, void **vv);
+void raid_par2_sse2ext(int nd, size_t size, void **vv);
+void raid_par3_int8(int nd, size_t size, void **vv);
+void raid_par3_ssse3(int nd, size_t size, void **vv);
+void raid_par3_ssse3ext(int nd, size_t size, void **vv);
+void raid_par4_int8(int nd, size_t size, void **vv);
+void raid_par4_ssse3(int nd, size_t size, void **vv);
+void raid_par4_ssse3ext(int nd, size_t size, void **vv);
+void raid_par5_int8(int nd, size_t size, void **vv);
+void raid_par5_ssse3(int nd, size_t size, void **vv);
+void raid_par5_ssse3ext(int nd, size_t size, void **vv);
+void raid_par6_int8(int nd, size_t size, void **vv);
+void raid_par6_ssse3(int nd, size_t size, void **vv);
+void raid_par6_ssse3ext(int nd, size_t size, void **vv);
+void raid_rec1_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv);
+void raid_rec2_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv);
+void raid_recX_int8(int nr, int *id, int *ip, int nd, size_t size, void **vv);
+void raid_rec1_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv);
+void raid_rec2_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv);
+void raid_recX_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv);
+
+/*
+ * Internal forwarders.
+ */
+extern void (*raid_par_ptr[RAID_PARITY_MAX])(
+	int nd, size_t size, void **vv);
+extern void (*raid_rec_ptr[RAID_PARITY_MAX])(
+	int nr, int *id, int *ip, int nd, size_t size, void **vv);
+
+/*
+ * Tables.
+ *
+ * Uses RAID6 tables if available, otherwise the ones in tables.c.
+ */
+#ifdef RAID_USE_RAID6_PQ
+#define gfmul raid6_gfmul
+#define gfinv raid6_gfinv
+#define gfexp raid6_gfexp
+#else
+extern const uint8_t raid_gfmul[256][256] __aligned(256);
+extern const uint8_t raid_gfexp[256] __aligned(256);
+extern const uint8_t raid_gfinv[256] __aligned(256);
+#define gfmul raid_gfmul
+#define gfexp raid_gfexp
+#define gfinv raid_gfinv
+#endif
+
+extern const uint8_t raid_gfcauchy[6][256] __aligned(256);
+extern const uint8_t raid_gfcauchypshufb[251][4][2][16] __aligned(256);
+extern const uint8_t raid_gfmulpshufb[256][2][16] __aligned(256);
+#define gfgen raid_gfcauchy
+#define gfgenpshufb raid_gfcauchypshufb
+#define gfmulpshufb raid_gfmulpshufb
+
+/*
+ * Assembler blocks.
+ */
+#ifdef __KERNEL__ /* to build the user mode test */
+#define asm_begin() \
+	kernel_fpu_begin()
+
+#define asm_end() \
+	do { \
+		asm volatile("sfence" : : : "memory"); \
+		kernel_fpu_end(); \
+	} while (0)
+#else
+#define asm_begin() \
+	do { } while (0)
+#define asm_end() \
+	asm volatile("sfence" : : : "memory")
+#endif
+
+#endif
+
diff --git a/lib/raid/mktables.c b/lib/raid/mktables.c
new file mode 100644
index 0000000..9c8e0e0
--- /dev/null
+++ b/lib/raid/mktables.c
@@ -0,0 +1,338 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+
+/**
+ * Multiplication in GF(2^8).
+ */
+static uint8_t gfmul(uint8_t a, uint8_t b)
+{
+	uint8_t v;
+
+	v = 0;
+	while (b)  {
+		if ((b & 1) != 0)
+			v ^= a;
+
+		if ((a & 0x80) != 0) {
+			a <<= 1;
+			a ^= 0x1d;
+		} else {
+			a <<= 1;
+		}
+
+		b >>= 1;
+	}
+
+	return v;
+}
+
+/**
+ * Inversion table in GF(2^8).
+ */
+uint8_t gfinv[256];
+
+/**
+ * Number of parity.
+ * This is the number of rows of the generation matrix.
+ */
+#define PARITY 6
+
+/**
+ * Number of disks.
+ * This is the number of columns of the generation matrix.
+ */
+#define DISK (257-PARITY)
+
+/**
+ * Setup the Cauchy matrix used to generate the parity.
+ */
+static void set_cauchy(uint8_t *matrix)
+{
+	int i, j;
+	uint8_t inv_x, y;
+
+	/*
+	 * First row is formed by all 1.
+	 *
+	 * This is an Extended Cauchy matrix built from a Cauchy matrix
+	 * adding the first row of all 1.
+	 */
+	for (i = 0; i < DISK; ++i)
+		matrix[0*DISK+i] = 1;
+
+	/*
+	 * Second row is formed by power of 2^i.
+	 *
+	 * This is the first row of the Cauchy matrix.
+	 *
+	 * Each element of the Cauchy matrix is in the form 1/(xi+yj)
+	 * where all xi, and yj must be different.
+	 *
+	 * Choosing xi = 2^-i and y0 = 0, we obtain for the first row:
+	 *
+	 * 1/(xi+y0) = 1/(2^-i + 0) = 2^i
+	 *
+	 * with 2^-i != 0 for any i
+	 */
+	inv_x = 1;
+	for (i = 0; i < DISK; ++i) {
+		matrix[1*DISK+i] = inv_x;
+		inv_x = gfmul(2, inv_x);
+	}
+
+	/*
+	 * Next rows of the Cauchy matrix.
+	 *
+	 * Continue forming the Cauchy matrix with yj = 2^j obtaining :
+	 *
+	 * 1/(xi+yj) = 1/(2^-i + 2^j)
+	 *
+	 * with xi != yj for any i,j with i>=0,j>=1,i+j<255
+	 */
+	y = 2;
+	for (j = 0; j < PARITY-2; ++j) {
+		inv_x = 1;
+		for (i = 0; i < DISK; ++i) {
+			uint8_t x = gfinv[inv_x];
+			matrix[(j+2)*DISK+i] = gfinv[y ^ x];
+			inv_x = gfmul(2, inv_x);
+		}
+
+		y = gfmul(2, y);
+	}
+
+	/*
+	 * Adjusts the matrix multipling each row for
+	 * the inverse of the first element in the row.
+	 *
+	 * This operation doesn't invalidate the property that all the square
+	 * submatrices are not singular.
+	 */
+	for (j = 0; j < PARITY-2; ++j) {
+		uint8_t f = gfinv[matrix[(j+2)*DISK]];
+
+		for (i = 0; i < DISK; ++i)
+			matrix[(j+2)*DISK+i] = gfmul(matrix[(j+2)*DISK+i], f);
+	}
+}
+
+/**
+ * Next power of 2.
+ */
+static unsigned np(unsigned v)
+{
+	--v;
+	v |= v >> 1;
+	v |= v >> 2;
+	v |= v >> 4;
+	v |= v >> 8;
+	v |= v >> 16;
+	++v;
+
+	return v;
+}
+
+int main(void)
+{
+	uint8_t v;
+	int i, j, k, p;
+	uint8_t matrix[PARITY * 256];
+
+	printf("/*\n");
+	printf(" * Copyright (C) 2013 Andrea Mazzoleni\n");
+	printf(" *\n");
+	printf(" * This program is free software: you can redistribute it and/or modify\n");
+	printf(" * it under the terms of the GNU General Public License as published by\n");
+	printf(" * the Free Software Foundation, either version 2 of the License, or\n");
+	printf(" * (at your option) any later version.\n");
+	printf(" *\n");
+	printf(" * This program is distributed in the hope that it will be useful,\n");
+	printf(" * but WITHOUT ANY WARRANTY; without even the implied warranty of\n");
+	printf(" * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n");
+	printf(" * GNU General Public License for more details.\n");
+	printf(" */\n");
+	printf("\n");
+
+	printf("#include \"internal.h\"\n");
+	printf("\n");
+
+	/* a*b */
+	printf("#ifndef RAID_USE_RAID6_PQ\n");
+	printf("const uint8_t __aligned(256) raid_gfmul[256][256] =\n");
+	printf("{\n");
+	for (i = 0; i < 256; ++i) {
+		printf("\t{\n");
+		for (j = 0; j < 256; ++j) {
+			if (j % 8 == 0)
+				printf("\t\t");
+			v = gfmul(i, j);
+			if (v == 1)
+				gfinv[i] = j;
+			printf("0x%02x,", (unsigned)v);
+			if (j % 8 == 7)
+				printf("\n");
+			else
+				printf(" ");
+		}
+		printf("\t},\n");
+	}
+	printf("};\n");
+	printf("EXPORT_SYMBOL(raid_gfmul);\n");
+	printf("#endif\n");
+	printf("\n");
+
+	/* 2^a */
+	printf("#ifndef RAID_USE_RAID6_PQ\n");
+	printf("const uint8_t __aligned(256) raid_gfexp[256] =\n");
+	printf("{\n");
+	v = 1;
+	for (i = 0; i < 256; ++i) {
+		if (i % 8 == 0)
+			printf("\t");
+		printf("0x%02x,", v);
+		v = gfmul(v, 2);
+		if (i % 8 == 7)
+			printf("\n");
+		else
+			printf(" ");
+	}
+	printf("};\n");
+	printf("EXPORT_SYMBOL(raid_gfexp);\n");
+	printf("#endif\n");
+	printf("\n");
+
+	/* 1/a */
+	printf("#ifndef RAID_USE_RAID6_PQ\n");
+	printf("const uint8_t __aligned(256) raid_gfinv[256] =\n");
+	printf("{\n");
+	printf("\t/* note that the first element is not significative */\n");
+	for (i = 0; i < 256; ++i) {
+		if (i % 8 == 0)
+			printf("\t");
+		if (i == 0)
+			v = 0;
+		else
+			v = gfinv[i];
+		printf("0x%02x,", v);
+		if (i % 8 == 7)
+			printf("\n");
+		else
+			printf(" ");
+	}
+	printf("};\n");
+	printf("EXPORT_SYMBOL(raid_gfinv);\n");
+	printf("#endif\n");
+	printf("\n");
+
+	/* cauchy matrix */
+	set_cauchy(matrix);
+
+	printf("/**\n");
+	printf(" * Cauchy matrix used to generate parity.\n");
+	printf(" * This matrix is valid for up to %u parity with %u data disks.\n", PARITY, DISK);
+	printf(" *\n");
+	for (p = 0; p < PARITY; ++p) {
+		printf(" * ");
+		for (i = 0; i < DISK; ++i)
+			printf("%02x ", matrix[p*DISK+i]);
+		printf("\n");
+	}
+	printf(" */\n");
+	printf("const uint8_t __aligned(256) raid_gfcauchy[%u][256] =\n", PARITY);
+	printf("{\n");
+	for (p = 0; p < PARITY; ++p) {
+		printf("\t{\n");
+		for (i = 0; i < DISK; ++i) {
+			if (i % 8 == 0)
+				printf("\t\t");
+			printf("0x%02x,", matrix[p*DISK+i]);
+			if (i % 8 == 7)
+				printf("\n");
+			else
+				printf(" ");
+		}
+		printf("\n\t},\n");
+	}
+	printf("};\n");
+	printf("EXPORT_SYMBOL(raid_gfcauchy);\n");
+	printf("\n");
+
+	printf("#ifdef CONFIG_X86\n");
+	printf("/**\n");
+	printf(" * PSHUFB tables for the Cauchy matrix.\n");
+	printf(" *\n");
+	printf(" * Indexes are [DISK][PARITY - 2][LH].\n");
+	printf(" * Where DISK is from 0 to %u, PARITY from 2 to %u, LH from 0 to 1.\n", DISK - 1, PARITY - 1);
+	printf(" */\n");
+	printf("const uint8_t __aligned(256) raid_gfcauchypshufb[%u][%u][2][16] =\n", DISK, np(PARITY - 2));
+	printf("{\n");
+	for (i = 0; i < DISK; ++i) {
+		printf("\t{\n");
+		for (p = 2; p < PARITY; ++p) {
+			printf("\t\t{\n");
+			for (j = 0; j < 2; ++j) {
+				printf("\t\t\t{ ");
+				for (k = 0; k < 16; ++k) {
+					v = gfmul(matrix[p*DISK+i], k);
+					if (j == 1)
+						v = gfmul(v, 16);
+					printf("0x%02x", (unsigned)v);
+					if (k != 15)
+						printf(", ");
+				}
+				printf(" },\n");
+			}
+			printf("\t\t},\n");
+		}
+		printf("\t},\n");
+	}
+	printf("};\n");
+	printf("EXPORT_SYMBOL(raid_gfcauchypshufb);\n");
+	printf("#endif\n\n");
+
+	printf("#ifdef CONFIG_X86\n");
+	printf("/**\n");
+	printf(" * PSHUFB tables for generic multiplication.\n");
+	printf(" *\n");
+	printf(" * Indexes are [MULTIPLER][LH].\n");
+	printf(" * Where MULTIPLER is from 0 to 255, LH from 0 to 1.\n");
+	printf(" */\n");
+	printf("const uint8_t __aligned(256) raid_gfmulpshufb[256][2][16] =\n");
+	printf("{\n");
+	for (i = 0; i < 256; ++i) {
+		printf("\t{\n");
+		for (j = 0; j < 2; ++j) {
+			printf("\t\t{ ");
+			for (k = 0; k < 16; ++k) {
+				v = gfmul(i, k);
+				if (j == 1)
+					v = gfmul(v, 16);
+				printf("0x%02x", (unsigned)v);
+				if (k != 15)
+					printf(", ");
+			}
+			printf(" },\n");
+		}
+		printf("\t},\n");
+	}
+	printf("};\n");
+	printf("EXPORT_SYMBOL(raid_gfmulpshufb);\n");
+	printf("#endif\n\n");
+
+	return 0;
+}
+
diff --git a/lib/raid/module.c b/lib/raid/module.c
new file mode 100644
index 0000000..3a4fba3
--- /dev/null
+++ b/lib/raid/module.c
@@ -0,0 +1,437 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "internal.h"
+#include "cpu.h"
+
+/*
+ * Initializes and selects the best algorithm.
+ */
+void raid_init(void)
+{
+	/* setup parity functions */
+	if (sizeof(void *) == 8) {
+		raid_par_ptr[0] = raid_par1_int64;
+		raid_par_ptr[1] = raid_par2_int64;
+	} else {
+		raid_par_ptr[0] = raid_par1_int32;
+		raid_par_ptr[1] = raid_par2_int32;
+	}
+	raid_par_ptr[2] = raid_par3_int8;
+	raid_par_ptr[3] = raid_par4_int8;
+	raid_par_ptr[4] = raid_par5_int8;
+	raid_par_ptr[5] = raid_par6_int8;
+
+	/* in kernel mode use the optimized xor_blocks() */
+#ifdef __KERNEL__
+	raid_par_ptr[0] = raid_par1_xorblocks;
+#endif
+	/* if RAID6 is present, use it */
+#ifdef RAID_USE_RAID6_PQ
+	raid_par_ptr[1] = raid_par2_raid6;
+#endif
+
+	/* optimized SSE2 functions */
+#ifdef CONFIG_X86
+	if (raid_cpu_has_sse2()) {
+		raid_par_ptr[0] = raid_par1_sse2;
+		raid_par_ptr[1] = raid_par2_sse2;
+#ifdef CONFIG_X86_64
+		raid_par_ptr[1] = raid_par2_sse2ext;
+#endif
+	}
+#endif
+
+	/* optimized SSSE3 functions */
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		raid_par_ptr[2] = raid_par3_ssse3;
+		raid_par_ptr[3] = raid_par4_ssse3;
+		raid_par_ptr[4] = raid_par5_ssse3;
+		raid_par_ptr[5] = raid_par6_ssse3;
+#ifdef CONFIG_X86_64
+		raid_par_ptr[2] = raid_par3_ssse3ext;
+		raid_par_ptr[3] = raid_par4_ssse3ext;
+		raid_par_ptr[4] = raid_par5_ssse3ext;
+		raid_par_ptr[5] = raid_par6_ssse3ext;
+#endif
+	}
+#endif
+
+	/* setup recovering functions */
+	raid_rec_ptr[0] = raid_rec1_int8;
+	raid_rec_ptr[1] = raid_rec2_int8;
+	raid_rec_ptr[2] = raid_recX_int8;
+	raid_rec_ptr[3] = raid_recX_int8;
+	raid_rec_ptr[4] = raid_recX_int8;
+	raid_rec_ptr[5] = raid_recX_int8;
+
+	/* optimized SSSE3 functions */
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		raid_rec_ptr[0] = raid_rec1_ssse3;
+		raid_rec_ptr[1] = raid_rec2_ssse3;
+		raid_rec_ptr[2] = raid_recX_ssse3;
+		raid_rec_ptr[3] = raid_recX_ssse3;
+		raid_rec_ptr[4] = raid_recX_ssse3;
+		raid_rec_ptr[5] = raid_recX_ssse3;
+	}
+#endif
+}
+
+/*
+ * Refence parity computation.
+ */
+void raid_par_ref(int nd, int np, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	size_t i;
+
+	for (i = 0; i < size; ++i) {
+		uint8_t p[RAID_PARITY_MAX];
+		int j, d;
+
+		for (j = 0; j < np; ++j)
+			p[j] = 0;
+
+		for (d = 0; d < nd; ++d) {
+			uint8_t b = v[d][i];
+
+			for (j = 0; j < np; ++j)
+				p[j] ^= gfmul[b][gfgen[j][d]];
+		}
+
+		for (j = 0; j < np; ++j)
+			v[nd + j][i] = p[j];
+	}
+}
+
+/*
+ * Size of the blocks to test.
+ */
+#define TEST_SIZE PAGE_SIZE
+
+/*
+ * Number of data blocks to test.
+ */
+#define TEST_COUNT (65536 / TEST_SIZE)
+
+/*
+ * Period for the speed test.
+ */
+#ifdef __KERNEL__ /* to build the user mode test */
+#define TEST_PERIOD 16
+#else
+#define TEST_PERIOD 512 /* more time in usermode */
+#endif
+
+/*
+ * Parity generation test.
+ */
+static int raid_test_par(int nd, int np, size_t size, void **v, void **ref)
+{
+	int i;
+	void *t[TEST_COUNT + RAID_PARITY_MAX];
+
+	/* setup data */
+	for (i = 0; i < nd; ++i)
+		t[i] = ref[i];
+
+	/* setup parity */
+	for (i = 0; i < np; ++i)
+		t[nd+i] = v[nd+i];
+
+	raid_par(nd, np, size, t);
+
+	/* compare parity */
+	for (i = 0; i < np; ++i) {
+		if (memcmp(t[nd+i], ref[nd+i], size) != 0) {
+			pr_err("raid: Self test failed!\n");
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Recovering test.
+ */
+static int raid_test_rec(int nrd, int *id, int nrp, int *ip, int nd, int np, size_t size, void **v, void **ref)
+{
+	int i, j;
+	void *t[TEST_COUNT + RAID_PARITY_MAX];
+
+	/* setup data */
+	for (i = 0, j = 0; i < nd; ++i) {
+		if (j < nrd && id[j] == i) {
+			t[i] = v[i];
+			++j;
+		} else {
+			t[i] = ref[i];
+		}
+	}
+
+	/* setup parity */
+	for (i = 0, j = 0; i < np; ++i) {
+		if (j < nrp && ip[j] == i) {
+			t[nd+i] = v[nd+i];
+			++j;
+		} else {
+			t[nd+i] = ref[nd+i];
+		}
+	}
+
+	raid_rec(nrd, id, nrp, ip, nd, np, size, t);
+
+	/* compare all data and parity */
+	for (i = 0; i < nd+np; ++i) {
+		if (t[i] != ref[i] && memcmp(t[i], ref[i], size) != 0) {
+			pr_err("raid: Self test failed!\n");
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Basic functionality self test.
+ */
+int raid_selftest(void)
+{
+	const int nd = TEST_COUNT;
+	const size_t size = TEST_SIZE;
+	const int nv = nd + RAID_PARITY_MAX * 2;
+	int order;
+	uint8_t *pages;
+	void *v[nd + RAID_PARITY_MAX * 2];
+	void *ref[nd + RAID_PARITY_MAX];
+	int id[RAID_PARITY_MAX];
+	int ip[RAID_PARITY_MAX];
+	int i, np;
+	int ret = 0;
+
+	/* ensure to have enough space for data */
+	BUG_ON(nd * size > 65536);
+
+	/* allocates pages for data and parity */
+	order = get_order(nv * size);
+	pages = (void *)__get_free_pages(GFP_KERNEL, order);
+	if (!pages) {
+		pr_err("raid: No memory available.\n");
+		return -ENOMEM;
+	}
+
+	/* setup working vector */
+	for (i = 0; i < nv; ++i)
+		v[i] = pages + size * i;
+
+	/* use the multiplication table as data */
+	for (i = 0; i < nd; ++i)
+		ref[i] = ((uint8_t *)gfmul) + size * i;
+
+	/* setup reference parity */
+	for (i = 0; i < RAID_PARITY_MAX; ++i)
+		ref[nd+i] = v[nd+RAID_PARITY_MAX+i];
+
+	/* compute reference parity */
+	raid_par_ref(nd, RAID_PARITY_MAX, size, ref);
+
+	/* test for each parity level */
+	for (np = 1; np <= RAID_PARITY_MAX; ++np) {
+		/* test parity generation */
+		ret = raid_test_par(nd, np, size, v, ref);
+		if (ret != 0)
+			goto bail;
+
+		/* test recovering with full broken data disks */
+		for (i = 0; i < np; ++i)
+			id[i] = nd - np + i;
+
+		ret = raid_test_rec(np, id, 0, ip, nd, np, size, v, ref);
+		if (ret != 0)
+			goto bail;
+
+		/* test recovering with half broken data and ending parity */
+		for (i = 0; i < np / 2; ++i)
+			id[i] = i;
+
+		for (i = 0; i < (np + 1) / 2; ++i)
+			ip[i] = np - np / 2 + i;
+
+		ret = raid_test_rec(np / 2, id, np / 2, ip, nd, np, size, v, ref);
+		if (ret != 0)
+			goto bail;
+
+		/* test recovering with half broken data and leading parity */
+		for (i = 0; i < np / 2; ++i)
+			id[i] = i;
+
+		for (i = 0; i < (np + 1) / 2; ++i)
+			ip[i] = i;
+
+		ret = raid_test_rec(np / 2, id, np / 2, ip, nd, np, size, v, ref);
+		if (ret != 0)
+			goto bail;
+	}
+
+bail:
+	free_pages((unsigned long)pages, order);
+
+	return ret;
+}
+
+/*
+ * Test the speed of a single function.
+ */
+static void raid_test_speed(
+	void (*func)(int nd, size_t size, void **vv),
+	const char *tag, const char *imp,
+	void **vv)
+{
+	unsigned count;
+	unsigned long j_start, j_stop;
+	unsigned long speed;
+
+	count = 0;
+
+	preempt_disable();
+
+	j_start = jiffies;
+	while (jiffies == j_start)
+		cpu_relax();
+
+	j_start = jiffies;
+	j_stop = j_start + TEST_PERIOD;
+	while (time_before(jiffies, j_stop)) {
+		func(TEST_COUNT, TEST_SIZE, vv);
+		++count;
+	}
+
+	preempt_enable();
+
+	speed = count * HZ / (TEST_PERIOD * 1024 * 1024 / (TEST_SIZE * TEST_COUNT));
+
+	pr_info("raid: %-4s %-6s %5ld MB/s\n", tag, imp, speed);
+}
+
+/*
+ * Basic speed test.
+ */
+int raid_speedtest(void)
+{
+	const int nd = TEST_COUNT;
+	const size_t size = TEST_SIZE;
+	int order;
+	uint8_t *pages;
+	void *v[nd + RAID_PARITY_MAX];
+	int i;
+
+	/* ensure to have enough space for data */
+	BUG_ON(nd * size > 65536);
+
+	/* allocates pages for parity */
+	order = get_order(RAID_PARITY_MAX * size);
+	pages = (void *)__get_free_pages(GFP_KERNEL, order);
+	if (!pages) {
+		pr_err("raid: No memory available.\n");
+		return -ENOMEM;
+	}
+
+	/* use the multiplication table as data */
+	for (i = 0; i < nd; ++i)
+		v[i] = ((uint8_t *)gfmul) + size * i;
+
+	/* setup parity */
+	for (i = 0; i < RAID_PARITY_MAX; ++i)
+		v[nd+i] = pages + size * i;
+
+	raid_test_speed(raid_par1_int32, "par1", "int32", v);
+	raid_test_speed(raid_par2_int32, "par2", "int32", v);
+	raid_test_speed(raid_par1_int64, "par1", "int64", v);
+	raid_test_speed(raid_par2_int64, "par2", "int64", v);
+	raid_test_speed(raid_par3_int8, "par3", "int8", v);
+	raid_test_speed(raid_par4_int8, "par4", "int8", v);
+	raid_test_speed(raid_par5_int8, "par5", "int8", v);
+	raid_test_speed(raid_par6_int8, "par6", "int8", v);
+#ifdef __KERNEL__
+	raid_test_speed(raid_par1_xorblocks, "par1", "xor", v);
+#endif
+#ifdef RAID_USE_RAID6_PQ
+	raid_test_speed(raid_par2_raid6, "par2", "raid6", v);
+#endif
+#ifdef CONFIG_X86
+	if (raid_cpu_has_sse2()) {
+		raid_test_speed(raid_par1_sse2, "par1", "sse2", v);
+		raid_test_speed(raid_par2_sse2, "par2", "sse2", v);
+	}
+	if (raid_cpu_has_ssse3()) {
+		raid_test_speed(raid_par3_ssse3, "par3", "ssse3", v);
+		raid_test_speed(raid_par4_ssse3, "par4", "ssse3", v);
+		raid_test_speed(raid_par5_ssse3, "par5", "ssse3", v);
+		raid_test_speed(raid_par6_ssse3, "par6", "ssse3", v);
+#ifdef CONFIG_X86_64
+		raid_test_speed(raid_par2_sse2ext, "par2", "sse2e", v);
+		raid_test_speed(raid_par3_ssse3ext, "par3", "ssse3e", v);
+		raid_test_speed(raid_par4_ssse3ext, "par4", "ssse3e", v);
+		raid_test_speed(raid_par5_ssse3ext, "par5", "ssse3e", v);
+		raid_test_speed(raid_par6_ssse3ext, "par6", "ssse3e", v);
+#endif
+	}
+#endif
+
+	free_pages((unsigned long)pages, order);
+
+	return 0;
+}
+
+#ifdef __KERNEL__ /* to build the user mode test */
+static int speedtest;
+
+int __init raid_cauchy_init(void)
+{
+	int ret;
+
+	raid_init();
+
+	ret = raid_selftest();
+	if (ret != 0)
+		return ret;
+
+#ifdef RAID_USE_RAID6_PQ
+	pr_info("raid: Self test passed (using raid6)\n");
+#else
+	pr_info("raid: Self test passed\n");
+#endif
+
+	if (speedtest)
+		raid_speedtest();
+
+	return 0;
+}
+
+static void raid_cauchy_exit(void)
+{
+}
+
+subsys_initcall(raid_cauchy_init);
+module_exit(raid_cauchy_exit);
+module_param(speedtest, int, 0);
+MODULE_PARM_DESC(speedtest, "Runs a startup speed test");
+MODULE_AUTHOR("Andrea Mazzoleni <amadvance@...il.com>");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("RAID Cauchy functions");
+#endif
+
diff --git a/lib/raid/raid.c b/lib/raid/raid.c
new file mode 100644
index 0000000..918cb67
--- /dev/null
+++ b/lib/raid/raid.c
@@ -0,0 +1,425 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "internal.h"
+#include "gf.h"
+
+/*
+ * This is a RAID implementation working in the Galois Field GF(2^8) with
+ * the primitive polynomial x^8 + x^4 + x^3 + x^2 + 1 (285 decimal), and
+ * supporting up to six parity levels.
+ *
+ * For RAID5 and RAID6 it works as as described in the H. Peter Anvin's
+ * paper "The mathematics of RAID-6" [1]. Please refer to this paper for a
+ * complete explanation.
+ *
+ * To support triple parity, it was first evaluated and then dropped, an
+ * extension of the same approach, with additional parity coefficients set
+ * as powers of 2^-1, with equations:
+ *
+ * P = sum(Di)
+ * Q = sum(2^i * Di)
+ * R = sum(2^-i * Di) with 0<=i<N
+ *
+ * This approach works well for triple parity and it's very efficient,
+ * because we can implement very fast parallel multiplications and
+ * divisions by 2 in GF(2^8).
+ *
+ * It's also similar at the approach used by ZFS RAIDZ3, with the
+ * difference that ZFS uses powers of 4 instead of 2^-1.
+ *
+ * Unfortunately it doesn't work beyond triple parity, because whatever
+ * value we choose to generate the power coefficients to compute other
+ * parities, the resulting equations are not solvable for some
+ * combinations of missing disks.
+ *
+ * This is expected, because the Vandermonde matrix used to compute the
+ * parity has no guarantee to have all submatrices not singular
+ * [2, Chap 11, Problem 7] and this is a requirement to have
+ * a MDS (Maximum Distance Separable) code [2, Chap 11, Theorem 8].
+ *
+ * To overcome this limitation, we use a Cauchy matrix [3][4] to compute
+ * the parity. A Cauchy matrix has the property to have all the square
+ * submatrices not singular, resulting in always solvable equations,
+ * for any combination of missing disks.
+ *
+ * The problem of this approach is that it requires the use of
+ * generic multiplications, and not only by 2 or 2^-1, potentially
+ * affecting badly the performance.
+ *
+ * Hopefully there is a method to implement parallel multiplications
+ * using SSSE3 instructions [1][5]. Method competitive with the
+ * computation of triple parity using power coefficients.
+ *
+ * Another important property of the Cauchy matrix is that we can setup
+ * the first two rows with coeffients equal at the RAID5 and RAID6 approach
+ * decribed, resulting in a compatible extension, and requiring SSSE3
+ * instructions only if triple parity or beyond is used.
+ *
+ * The matrix is also adjusted, multipling each row by a constant factor
+ * to make the first column of all 1, to optimize the computation for
+ * the first disk.
+ *
+ * This results in the matrix A[row,col] defined as:
+ *
+ * 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01...
+ * 01 02 04 08 10 20 40 80 1d 3a 74 e8 cd 87 13 26 4c 98 2d 5a b4 75...
+ * 01 f5 d2 c4 9a 71 f1 7f fc 87 c1 c6 19 2f 40 55 3d ba 53 04 9c 61...
+ * 01 bb a6 d7 c7 07 ce 82 4a 2f a5 9b b6 60 f1 ad e7 f4 06 d2 df 2e...
+ * 01 97 7f 9c 7c 18 bd a2 58 1a da 74 70 a3 e5 47 29 07 f5 80 23 e9...
+ * 01 2b 3f cf 73 2c d6 ed cb 74 15 78 8a c1 17 c9 89 68 21 ab 76 3b...
+ *
+ * This matrix supports 6 level of parity, one for each row, for up to 251
+ * data disks, one for each column, with all the 377,342,351,231 square
+ * submatrices not singular, verified also with brute-force.
+ *
+ * This matrix can be extended to support any number of parities, just
+ * adding additional rows, and removing one column for each new row.
+ * (see mktables.c for more details in how the matrix is generated)
+ *
+ * In details, parity is computed as:
+ *
+ * P = sum(Di)
+ * Q = sum(2^i *  Di)
+ * R = sum(A[2,i] * Di)
+ * S = sum(A[3,i] * Di)
+ * T = sum(A[4,i] * Di)
+ * U = sum(A[5,i] * Di) with 0<=i<N
+ *
+ * To recover from a failure of six disks at indexes x,y,z,h,v,w,
+ * with 0<=x<y<z<h<v<w<N, we compute the parity of the available N-6
+ * disks as:
+ *
+ * Pa = sum(Di)
+ * Qa = sum(2^i * Di)
+ * Ra = sum(A[2,i] * Di)
+ * Sa = sum(A[3,i] * Di)
+ * Ta = sum(A[4,i] * Di)
+ * Ua = sum(A[5,i] * Di) with 0<=i<N,i!=x,i!=y,i!=z,i!=h,i!=v,i!=w.
+ *
+ * And if we define:
+ *
+ * Pd = Pa + P
+ * Qd = Qa + Q
+ * Rd = Ra + R
+ * Sd = Sa + S
+ * Td = Ta + T
+ * Ud = Ua + U
+ *
+ * we can sum these two sets of equations, obtaining:
+ *
+ * Pd =          Dx +          Dy +          Dz +          Dh +          Dv +          Dw
+ * Qd =    2^x * Dx +    2^y * Dy +    2^z * Dz +    2^h * Dh +    2^v * Dv +    2^w * Dw
+ * Rd = A[2,x] * Dx + A[2,y] * Dy + A[2,z] * Dz + A[2,h] * Dh + A[2,v] * Dv + A[2,w] * Dw
+ * Sd = A[3,x] * Dx + A[3,y] * Dy + A[3,z] * Dz + A[3,h] * Dh + A[3,v] * Dv + A[3,w] * Dw
+ * Td = A[4,x] * Dx + A[4,y] * Dy + A[4,z] * Dz + A[4,h] * Dh + A[4,v] * Dv + A[4,w] * Dw
+ * Ud = A[5,x] * Dx + A[5,y] * Dy + A[5,z] * Dz + A[5,h] * Dh + A[5,v] * Dv + A[5,w] * Dw
+ *
+ * A linear system always solvable because the coefficients matrix is
+ * always not singular due the properties of the matrix A[].
+ *
+ * Resulting speed in x64, with 16 data disks, using a stripe of 4 KiB,
+ * for a Core i7-3740QM CPU @ 2.7GHz is:
+ *
+ *           int8   int32   int64    sse2   sse2e   ssse3  ssse3e
+ *   par1           11469   21579   44743
+ *   par2            3474    6176   17930   20435
+ *   par3     850                                    7908    9069
+ *   par4     647                                    6357    7159
+ *   par5     527                                    5041    5412
+ *   par6     432                                    4094    4470
+ *
+ * Values are in MiB/s of data processed, not counting generated parity.
+ *
+ * References:
+ * [1] Anvin, "The mathematics of RAID-6", 2004
+ * [2] MacWilliams, Sloane, "The Theory of Error-Correcting Codes", 1977
+ * [3] Blomer, "An XOR-Based Erasure-Resilient Coding Scheme", 1995
+ * [4] Roth, "Introduction to Coding Theory", 2006
+ * [5] Plank, "Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions", 2013
+ */
+
+#ifdef __KERNEL__ /* to build the user mode test */
+/**
+ * Buffer filled with 0 used in recovering.
+ */
+static uint8_t raid_zero_block[PAGE_SIZE] __aligned(256);
+#else
+extern uint8_t raid_zero_block[] __aligned(256);
+#endif
+
+/*
+ * PAR1 (RAID5 with xor) implementation using the kernel xor_blocks()
+ * function.
+ */
+void raid_par1_xorblocks(int nd, size_t size, void **v)
+{
+	int i;
+
+	/* copy the first block */
+	memcpy(v[nd], v[0], size);
+
+	i = 1;
+	while (i < nd) {
+		int run = nd - i;
+
+		/* xor_blocks supports no more than MAX_XOR_BLOCKS blocks */
+		if (run > MAX_XOR_BLOCKS)
+			run = MAX_XOR_BLOCKS;
+
+		xor_blocks(run, size, v[nd], v + i);
+
+		i += run;
+	}
+}
+
+#ifdef RAID_USE_RAID6_PQ
+/**
+ * PAR2 (RAID6 with powers of 2) implementation using raid6 library.
+ */
+void raid_par2_raid6(int nd, size_t size, void **vv)
+{
+	raid6_call.gen_syndrome(nd + 2, size, vv);
+}
+#endif
+
+/* internal forwarder */
+void (*raid_par_ptr[RAID_PARITY_MAX])(int nd, size_t size, void **vv);
+
+void raid_par(int nd, int np, size_t size, void **v)
+{
+	BUG_ON(np < 1 || np > RAID_PARITY_MAX);
+	BUG_ON(size % 64 != 0);
+
+	raid_par_ptr[np - 1](nd, size, v);
+}
+EXPORT_SYMBOL_GPL(raid_par);
+
+/**
+ * Inverts the square matrix M of size nxn into V.
+ * We use Gauss elimination to invert.
+ */
+void raid_invert(uint8_t *M, uint8_t *V, int n)
+{
+	int i, j, k;
+
+	/* set the identity matrix in V */
+	for (i = 0; i < n; ++i)
+		for (j = 0; j < n; ++j)
+			V[i*n+j] = i == j;
+
+	/* for each element in the diagonal */
+	for (k = 0; k < n; ++k) {
+		uint8_t f;
+
+		/* the diagonal element cannot be 0 because */
+		/* we are inverting matrices with all the square submatrices */
+		/* not singular */
+		BUG_ON(M[k*n+k] == 0);
+
+		/* make the diagonal element to be 1 */
+		f = inv(M[k*n+k]);
+		for (j = 0; j < n; ++j) {
+			M[k*n+j] = mul(f, M[k*n+j]);
+			V[k*n+j] = mul(f, V[k*n+j]);
+		}
+
+		/* make all the elements over and under the diagonal to be 0 */
+		for (i = 0; i < n; ++i) {
+			if (i == k)
+				continue;
+			f = M[i*n+k];
+			for (j = 0; j < n; ++j) {
+				M[i*n+j] ^= mul(f, M[k*n+j]);
+				V[i*n+j] ^= mul(f, V[k*n+j]);
+			}
+		}
+	}
+}
+
+/**
+ * Computes the parity without the missing data blocks
+ * and store it in the buffers of such data blocks.
+ *
+ * This is the parity expressed as Pa,Qa,Ra,Sa,Ta,Ua
+ * in the equations.
+ *
+ * Note that all the other parities not in the ip[] vector
+ * are destroyed.
+ */
+void raid_delta_gen(int nr, int *id, int *ip, int nd, size_t size, void **v)
+{
+	void *p[RAID_PARITY_MAX];
+	void *pa[RAID_PARITY_MAX];
+	int i;
+
+	for (i = 0; i < nr; ++i) {
+		/* keep a copy of the parity buffer */
+		p[i] = v[nd+ip[i]];
+
+		/* buffer for missing data blocks */
+		pa[i] = v[id[i]];
+
+		/* set at zero the missing data blocks */
+		v[id[i]] = raid_zero_block;
+
+		/* compute the parity over the missing data blocks */
+		v[nd+ip[i]] = pa[i];
+	}
+
+	/* recompute the minimal parity required */
+	raid_par(nd, ip[nr - 1] + 1, size, v);
+
+	for (i = 0; i < nr; ++i) {
+		/* restore disk buffers as before */
+		v[id[i]] = pa[i];
+
+		/* restore parity buffers as before */
+		v[nd+ip[i]] = p[i];
+	}
+}
+
+/**
+ * Recover failure of one data block for PAR1.
+ *
+ * Starting from the equation:
+ *
+ * Pd = Dx
+ *
+ * and solving we get:
+ *
+ * Dx = Pd
+ */
+void raid_rec1_par1(int *id, int nd, size_t size, void **v)
+{
+	void *p;
+	void *pa;
+
+	/* for PAR1 we can directly compute the missing block */
+	/* and we don't need to use the zero buffer */
+	p = v[nd];
+	pa = v[id[0]];
+
+	/* use the parity as missing data block */
+	v[id[0]] = p;
+
+	/* compute the parity over the missing data block */
+	v[nd] = pa;
+
+	/* compute */
+	raid_par(nd, 1, size, v);
+
+	/* restore as before */
+	v[id[0]] = pa;
+	v[nd] = p;
+}
+
+/**
+ * Recover failure of two data blocks for PAR2.
+ *
+ * Starting from the equations:
+ *
+ * Pd = Dx + Dy
+ * Qd = 2^id[0] * Dx + 2^id[1] * Dy
+ *
+ * and solving we get:
+ *
+ *               1                     2^(-id[0])
+ * Dy = ------------------- * Pd + ------------------- * Qd
+ *      2^(id[1]-id[0]) + 1        2^(id[1]-id[0]) + 1
+ *
+ * Dx = Dy + Pd
+ *
+ * with conditions:
+ *
+ * 2^id[0] != 0
+ * 2^(id[1]-id[0]) + 1 != 0
+ *
+ * That are always satisfied for any 0<=id[0]<id[1]<255.
+ */
+void raid_rec2_par2(int *id, int *ip, int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	size_t i;
+	uint8_t *p;
+	uint8_t *pa;
+	uint8_t *q;
+	uint8_t *qa;
+	const uint8_t *T[2];
+
+	/* get multiplication tables */
+	T[0] = table(inv(pow2(id[1]-id[0]) ^ 1));
+	T[1] = table(inv(pow2(id[0]) ^ pow2(id[1])));
+
+	/* compute delta parity */
+	raid_delta_gen(2, id, ip, nd, size, vv);
+
+	p = v[nd];
+	q = v[nd+1];
+	pa = v[id[0]];
+	qa = v[id[1]];
+
+	for (i = 0; i < size; ++i) {
+		/* delta */
+		uint8_t Pd = p[i] ^ pa[i];
+		uint8_t Qd = q[i] ^ qa[i];
+
+		/* reconstruct */
+		uint8_t Dy = T[0][Pd] ^ T[1][Qd];
+		uint8_t Dx = Pd ^ Dy;
+
+		/* set */
+		pa[i] = Dx;
+		qa[i] = Dy;
+	}
+}
+
+/* internal forwarder */
+void (*raid_rec_ptr[RAID_PARITY_MAX])(
+	int nr, int *id, int *ip, int nd, size_t size, void **vv);
+
+void raid_rec(int nrd, int *id, int nrp, int *ip, int nd, int np, size_t size, void **v)
+{
+	BUG_ON(nrd > nd);
+	BUG_ON(nrd + nrp > np);
+	BUG_ON(size % 64 != 0);
+	BUG_ON(size > PAGE_SIZE);
+
+	/* if failed data is present */
+	if (nrd != 0) {
+		int iu[RAID_PARITY_MAX];
+		int i, j, k;
+
+		/* setup the vector of parities to use */
+		for (i = 0, j = 0, k = 0; i < np; ++i) {
+			if (j < nrp && ip[j] == i) {
+				/* this parity has to be recovered */
+				++j;
+			} else {
+				/* this parity is used for recovering */
+				iu[k] = i;
+				++k;
+			}
+		}
+
+		/* recover the data, and limit the parity use to needed ones */
+		raid_rec_ptr[nrd - 1](nrd, id, iu, nd, size, v);
+	}
+
+	/* recompute all the parities up to the last bad one */
+	if (nrp != 0)
+		raid_par(nd, ip[nrp - 1] + 1, size, v);
+}
+EXPORT_SYMBOL_GPL(raid_rec);
+
diff --git a/lib/raid/sort.c b/lib/raid/sort.c
new file mode 100644
index 0000000..8afac72
--- /dev/null
+++ b/lib/raid/sort.c
@@ -0,0 +1,72 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "internal.h"
+
+#define RAID_SWAP(a, b) \
+	do { \
+		if (v[a] > v[b]) { \
+			int t = v[a]; \
+			v[a] = v[b]; \
+			v[b] = t; \
+		} \
+	} while (0)
+
+void raid_sort(int n, int *v)
+{
+	/* sorting networks generated with Batcher's Merge-Exchange */
+	switch (n) {
+	case 2:
+		RAID_SWAP(0, 1);
+		break;
+	case 3:
+		RAID_SWAP(0, 2);
+		RAID_SWAP(0, 1);
+		RAID_SWAP(1, 2);
+		break;
+	case 4:
+		RAID_SWAP(0, 2);
+		RAID_SWAP(1, 3);
+		RAID_SWAP(0, 1);
+		RAID_SWAP(2, 3);
+		RAID_SWAP(1, 2);
+		break;
+	case 5:
+		RAID_SWAP(0, 4);
+		RAID_SWAP(0, 2);
+		RAID_SWAP(1, 3);
+		RAID_SWAP(2, 4);
+		RAID_SWAP(0, 1);
+		RAID_SWAP(2, 3);
+		RAID_SWAP(1, 4);
+		RAID_SWAP(1, 2);
+		RAID_SWAP(3, 4);
+		break;
+	case 6:
+		RAID_SWAP(0, 4);
+		RAID_SWAP(1, 5);
+		RAID_SWAP(0, 2);
+		RAID_SWAP(1, 3);
+		RAID_SWAP(2, 4);
+		RAID_SWAP(3, 5);
+		RAID_SWAP(0, 1);
+		RAID_SWAP(2, 3);
+		RAID_SWAP(4, 5);
+		RAID_SWAP(1, 4);
+		RAID_SWAP(1, 2);
+		RAID_SWAP(3, 4);
+		break;
+	}
+}
+
diff --git a/lib/raid/test/Makefile b/lib/raid/test/Makefile
new file mode 100644
index 0000000..04e8e1e
--- /dev/null
+++ b/lib/raid/test/Makefile
@@ -0,0 +1,33 @@
+#
+# This is a simple Makefile to test some of the RAID code
+# from userspace.
+#
+
+CC  = gcc
+CFLAGS = -I.. -I../../../include -Wall -Wextra -g -O2
+LD = ld
+OBJS = raid.o int.o x86.o tables.o memory.o test.o sort.o module.o xor.o
+
+.c.o:
+	$(CC) $(CFLAGS) -c -o $@ $<
+
+%.c: ../%.c
+	cp -f $< $@
+
+all: fulltest speedtest selftest
+
+fulltest: $(OBJS) fulltest.o
+	$(CC) $(CFLAGS) -o fulltest $^
+
+speedtest: $(OBJS) speedtest.o
+	$(CC) $(CFLAGS) -o speedtest $^
+
+selftest: $(OBJS) selftest.o
+	$(CC) $(CFLAGS) -o selftest $^
+
+tables.c: mktables
+	./mktables > tables.c
+
+clean:
+	rm -f *.o mktables.c mktables tables.c fulltest speedtest selftest
+
diff --git a/lib/raid/test/combo.h b/lib/raid/test/combo.h
new file mode 100644
index 0000000..31530a2
--- /dev/null
+++ b/lib/raid/test/combo.h
@@ -0,0 +1,155 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __RAID_COMBO_H
+#define __RAID_COMBO_H
+
+#include <assert.h>
+
+/**
+ * Get the first permutation with repetition of r of n elements.
+ *
+ * Typical use is with permutation_next() in the form :
+ *
+ * int i[R];
+ * permutation_first(R, N, i);
+ * do {
+ *    code using i[0], i[1], ..., i[R-1]
+ * } while (permutation_next(R, N, i));
+ *
+ * It's equivalent at the code :
+ *
+ * for(i[0]=0;i[0]<N;++i[0])
+ *     for(i[1]=0;i[1]<N;++i[1])
+ *        ...
+ *            for(i[R-2]=0;i[R-2]<N;++i[R-2])
+ *                for(i[R-1]=0;i[R-1]<N;++i[R-1])
+ *                    code using i[0], i[1], ..., i[R-1]
+ */
+static inline void permutation_first(int r, int n, int *c)
+{
+	int i;
+
+	(void)n; /* unused, but kept for clarity */
+	assert(0 < r && r <= n);
+
+	for (i = 0; i < r; ++i)
+		c[i] = 0;
+}
+
+/**
+ * Get the next permutation with repetition of r of n elements.
+ * Return ==0 when finished.
+ */
+static inline int permutation_next(int r, int n, int *c)
+{
+	int i = r - 1; /* present position */
+
+recurse:
+	/* next element at position i */
+	++c[i];
+
+	/* if the position has reached the max */
+	if (c[i] >= n) {
+
+		/* if we are at the first level, we have finished */
+		if (i == 0)
+			return 0;
+
+		/* increase the previous position */
+		--i;
+		goto recurse;
+	}
+
+	++i;
+
+	/* initialize all the next positions, if any */
+	while (i < r) {
+		c[i] = 0;
+		++i;
+	}
+
+	return 1;
+}
+
+/**
+ * Get the first combination without repetition of r of n elements.
+ *
+ * Typical use is with combination_next() in the form :
+ *
+ * int i[R];
+ * combination_first(R, N, i);
+ * do {
+ *    code using i[0], i[1], ..., i[R-1]
+ * } while (combination_next(R, N, i));
+ *
+ * It's equivalent at the code :
+ *
+ * for(i[0]=0;i[0]<N-(R-1);++i[0])
+ *     for(i[1]=i[0]+1;i[1]<N-(R-2);++i[1])
+ *        ...
+ *            for(i[R-2]=i[R-3]+1;i[R-2]<N-1;++i[R-2])
+ *                for(i[R-1]=i[R-2]+1;i[R-1]<N;++i[R-1])
+ *                    code using i[0], i[1], ..., i[R-1]
+ */
+static inline void combination_first(int r, int n, int *c)
+{
+	int i;
+
+	(void)n; /* unused, but kept for clarity */
+	assert(0 < r && r <= n);
+
+	for (i = 0; i < r; ++i)
+		c[i] = i;
+}
+
+/**
+ * Get the next combination without repetition of r of n elements.
+ * Return ==0 when finished.
+ */
+static inline int combination_next(int r, int n, int *c)
+{
+	int i = r - 1; /* present position */
+	int h = n; /* high limit for this position */
+
+recurse:
+	/* next element at position i */
+	++c[i];
+
+	/* if the position has reached the max */
+	if (c[i] >= h) {
+
+		/* if we are at the first level, we have finished */
+		if (i == 0)
+			return 0;
+
+		/* increase the previous position */
+		--i;
+		--h;
+		goto recurse;
+	}
+
+	++i;
+
+	/* initialize all the next positions, if any */
+	while (i < r) {
+		/* each position start at the next value of the previous one */
+		c[i] = c[i-1] + 1;
+		++i;
+	}
+
+	return 1;
+}
+#endif
+
diff --git a/lib/raid/test/fulltest.c b/lib/raid/test/fulltest.c
new file mode 100644
index 0000000..fa7c3dd
--- /dev/null
+++ b/lib/raid/test/fulltest.c
@@ -0,0 +1,76 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "internal.h"
+#include "test.h"
+#include "cpu.h"
+
+#include <stdio.h>
+#include <stdlib.h>
+
+/*
+ * Size of the blocks to test.
+ */
+#define TEST_SIZE 256
+
+uint8_t raid_zero_block[TEST_SIZE] __aligned(256);
+
+int main(void)
+{
+	raid_init();
+
+	printf("RAID Cauchy test suite\n\n");
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_sse2())
+		printf("Including x86 SSE2 functions\n");
+	if (raid_cpu_has_ssse3())
+		printf("Including x86 SSSE3 functions\n");
+#endif
+#ifdef CONFIG_X86_64
+	printf("Including x64 extended SSE register set\n");
+#endif
+
+	printf("\nPlease wait about 60 seconds...\n\n");
+
+	printf("Test sorting...\n");
+	if (raid_test_sort() != 0) {
+		printf("FAILED!\n");
+		exit(EXIT_FAILURE);
+	}
+	printf("Test combinations/permutations...\n");
+	if (raid_test_combo() != 0) {
+		printf("FAILED!\n");
+		exit(EXIT_FAILURE);
+	}
+	printf("Test parity generation with %u data disks...\n", RAID_DATA_MAX);
+	if (raid_test_par(RAID_DATA_MAX, TEST_SIZE) != 0) {
+		printf("FAILED!\n");
+		exit(EXIT_FAILURE);
+	}
+	printf("Test parity generation with 1 data disk...\n");
+	if (raid_test_par(1, TEST_SIZE) != 0) {
+		printf("FAILED!\n");
+		exit(EXIT_FAILURE);
+	}
+	printf("Test recovering with all combinations of 32 data and 6 parity blocks...\n");
+	if (raid_test_rec(32, TEST_SIZE) != 0) {
+		printf("FAILED!\n");
+		exit(EXIT_FAILURE);
+	}
+
+	printf("OK\n");
+	return 0;
+}
+
diff --git a/lib/raid/test/memory.c b/lib/raid/test/memory.c
new file mode 100644
index 0000000..6807ee4
--- /dev/null
+++ b/lib/raid/test/memory.c
@@ -0,0 +1,79 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "internal.h"
+#include "memory.h"
+
+void *raid_malloc_align(size_t size, void **freeptr)
+{
+	unsigned char *ptr;
+	uintptr_t offset;
+
+	ptr = malloc(size + RAID_MALLOC_ALIGN);
+	if (!ptr)
+		return 0;
+
+	*freeptr = ptr;
+
+	offset = ((uintptr_t)ptr) % RAID_MALLOC_ALIGN;
+
+	if (offset != 0)
+		ptr += RAID_MALLOC_ALIGN - offset;
+
+	return ptr;
+}
+
+void **raid_malloc_vector(int nd, int n, size_t size, void **freeptr)
+{
+	void **v;
+	unsigned char *va;
+	int i;
+
+	v = malloc(n * sizeof(void *));
+	if (!v)
+		return 0;
+
+	va = raid_malloc_align(n * (size + RAID_MALLOC_DISPLACEMENT), freeptr);
+	if (!va) {
+		free(v);
+		return 0;
+	}
+
+	for (i = 0; i < n; ++i) {
+		v[i] = va;
+		va += size + RAID_MALLOC_DISPLACEMENT;
+	}
+
+	/* reverse order of the data blocks */
+	/* because they are usually accessed from the last one */
+	for (i = 0; i < nd/2; ++i) {
+		void *ptr = v[i];
+		v[i] = v[nd - 1 - i];
+		v[nd - 1 - i] = ptr;
+	}
+
+	return v;
+}
+
+void raid_mrand_vector(int n, size_t size, void **vv)
+{
+	unsigned char **v = (unsigned char **)vv;
+	int i;
+	size_t j;
+
+	for (i = 0; i < n; ++i)
+		for (j = 0; j < size; ++j)
+			v[i][j] = rand();
+}
+
diff --git a/lib/raid/test/memory.h b/lib/raid/test/memory.h
new file mode 100644
index 0000000..b5c4e9a
--- /dev/null
+++ b/lib/raid/test/memory.h
@@ -0,0 +1,74 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __RAID_MEMORY_H
+#define __RAID_MEMORY_H
+
+/**
+ * Memory alignment provided by raid_malloc_align().
+ *
+ * It should guarantee good cache performance everywhere.
+ */
+#define RAID_MALLOC_ALIGN 256
+
+/**
+ * Memory displacement to avoid cache collisions on contiguous blocks,
+ * used by raid_malloc_vector().
+ *
+ * When allocating a sequence of blocks with a size of power of 2,
+ * there is the risk that the start of each block is mapped into the same
+ * cache line or prefetching prediction, resulting in collisions if you
+ * access all the blocks in parallel, from the start to the end.
+ *
+ * The selected value was choosen empirically with some speed tests
+ * with 16 data buffers of 4 KB.
+ *
+ * These are the results with no displacement:
+ *
+ *            int8   int32   int64    sse2   sse2e   ssse3  ssse3e
+ *    par1            6940   13971   29824
+ *    par2            2530    4675   14840   16485
+ *    par3     490                                    6859    7710
+ *
+ * These are the results with displacement resulting in improvments
+ * in the order of 20% or more:
+ *
+ *            int8   int32   int64    sse2   sse2e   ssse3  ssse3e
+ *    par1           11762   21450   44621
+ *    par2            3520    6176   18100   20338
+ *    par3     848                                    8009    9210
+ *
+ */
+#define RAID_MALLOC_DISPLACEMENT 64
+
+/**
+ * Aligned malloc.
+ */
+void *raid_malloc_align(size_t size, void **freeptr);
+
+/**
+ * Aligned vector allocation.
+ * Returns a vector of @n pointers, each one pointing to a block of
+ * the specified @size.
+ * The first @nd elements are reversed in order.
+ */
+void **raid_malloc_vector(int nd, int n, size_t size, void **freeptr);
+
+/**
+ * Fills the memory vector with random data.
+ */
+void raid_mrand_vector(int n, size_t size, void **vv);
+
+#endif
+
diff --git a/lib/raid/test/selftest.c b/lib/raid/test/selftest.c
new file mode 100644
index 0000000..c10db29
--- /dev/null
+++ b/lib/raid/test/selftest.c
@@ -0,0 +1,41 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "internal.h"
+#include "cpu.h"
+
+#include <stdio.h>
+#include <stdlib.h>
+
+uint8_t raid_zero_block[PAGE_SIZE] __aligned(256);
+
+int main(void)
+{
+	raid_init();
+
+	printf("RAID Cauchy selftest\n\n");
+
+	printf("Self test...\n");
+	if (raid_selftest() != 0) {
+		printf("FAILED!\n");
+		exit(EXIT_FAILURE);
+	}
+	printf("OK\n\n");
+
+	printf("Speed test...\n");
+	raid_speedtest();
+
+	return 0;
+}
+
diff --git a/lib/raid/test/speedtest.c b/lib/raid/test/speedtest.c
new file mode 100644
index 0000000..d3784c3
--- /dev/null
+++ b/lib/raid/test/speedtest.c
@@ -0,0 +1,567 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "internal.h"
+#include "memory.h"
+#include "cpu.h"
+
+#include <sys/time.h>
+#include <stdio.h>
+#include <inttypes.h>
+
+/*
+ * Size of the blocks to test.
+ */
+#define TEST_SIZE PAGE_SIZE
+
+/*
+ * Number of data blocks to test.
+ */
+#define TEST_COUNT (65536 / TEST_SIZE)
+
+uint8_t raid_zero_block[TEST_SIZE] __aligned(256);
+
+/**
+ * Differential us of two timeval.
+ */
+static int64_t diffgettimeofday(struct timeval *start, struct timeval *stop)
+{
+	int64_t d;
+
+	d = 1000000LL * (stop->tv_sec - start->tv_sec);
+	d += stop->tv_usec - start->tv_usec;
+
+	return d;
+}
+
+/**
+ * Start time measurement.
+ */
+#define SPEED_START \
+	count = 0; \
+	gettimeofday(&start, 0); \
+	do { \
+		for (i = 0; i < delta; ++i)
+
+/**
+ * Stop time measurement.
+ */
+#define SPEED_STOP \
+		count += delta; \
+		gettimeofday(&stop, 0); \
+	} while (diffgettimeofday(&start, &stop) < 1000000LL); \
+	ds = size * (int64_t)count * nd; \
+	dt = diffgettimeofday(&start, &stop);
+
+void speed(void)
+{
+	struct timeval start;
+	struct timeval stop;
+	int64_t ds;
+	int64_t dt;
+	int i, j;
+	int id[RAID_PARITY_MAX];
+	int ip[RAID_PARITY_MAX];
+	int count;
+	int delta = 10;
+	int size = TEST_SIZE;
+	int nd = TEST_COUNT;
+	int nv;
+	void *v_alloc;
+	void **v;
+
+	nv = nd + RAID_PARITY_MAX;
+
+	v = raid_malloc_vector(nd, nv, size, &v_alloc);
+
+	/* initialize disks with fixed data */
+	for (i = 0; i < nd; ++i)
+		memset(v[i], i, size);
+
+	/* basic disks and parity mapping */
+	for (i = 0; i < RAID_PARITY_MAX; ++i) {
+		id[i] = i;
+		ip[i] = i;
+	}
+
+	printf("Speed test using %u data buffers of %u bytes, for a total of %u KiB.\n", nd, size, nd * size / 1024);
+	printf("Memory blocks have a displacement of %u bytes to improve cache performance.\n", RAID_MALLOC_DISPLACEMENT);
+	printf("The reported value is the aggregate bandwidth of all data blocks in MiB/s,\n");
+	printf("not counting parity blocks.\n");
+	printf("\n");
+
+	printf("Memory write speed using the C memset() function:\n");
+	printf("%8s", "memset");
+	fflush(stdout);
+
+	SPEED_START {
+		for (j = 0; j < nd; ++j)
+			memset(v[j], j, size);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	printf("\n");
+	printf("\n");
+
+	/* RAID table */
+	printf("RAID functions used for computing the parity:\n");
+	printf("%8s", "");
+	printf("%8s", "int8");
+	printf("%8s", "int32");
+	printf("%8s", "int64");
+#ifdef CONFIG_X86
+	printf("%8s", "sse2");
+#ifdef CONFIG_X86_64
+	printf("%8s", "sse2e");
+#endif
+	printf("%8s", "ssse3");
+#ifdef CONFIG_X86_64
+	printf("%8s", "ssse3e");
+#endif
+#endif
+	printf("\n");
+
+	/* PAR1 */
+	printf("%8s", "par1");
+	fflush(stdout);
+
+	printf("%8s", "");
+
+	SPEED_START {
+		raid_par1_int32(nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+	SPEED_START {
+		raid_par1_int64(nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_sse2()) {
+		SPEED_START {
+			raid_par1_sse2(nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+		fflush(stdout);
+	}
+#endif
+	printf("\n");
+
+	/* PAR2 */
+	printf("%8s", "par2");
+	fflush(stdout);
+
+	printf("%8s", "");
+
+	SPEED_START {
+		raid_par2_int32(nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+	SPEED_START {
+		raid_par2_int64(nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_sse2()) {
+		SPEED_START {
+			raid_par2_sse2(nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+		fflush(stdout);
+
+#ifdef CONFIG_X86_64
+		SPEED_START {
+			raid_par2_sse2ext(nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+		fflush(stdout);
+#endif
+	}
+#endif
+	printf("\n");
+
+	/* PAR3 */
+	printf("%8s", "par3");
+	fflush(stdout);
+
+	SPEED_START {
+		raid_par3_int8(nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+	printf("%8s", "");
+	printf("%8s", "");
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_sse2()) {
+		printf("%8s", "");
+
+#ifdef CONFIG_X86_64
+		printf("%8s", "");
+#endif
+	}
+#endif
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		SPEED_START {
+			raid_par3_ssse3(nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+		fflush(stdout);
+
+#ifdef CONFIG_X86_64
+		SPEED_START {
+			raid_par3_ssse3ext(nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+		fflush(stdout);
+#endif
+	}
+#endif
+	printf("\n");
+
+	/* PAR4 */
+	printf("%8s", "par4");
+	fflush(stdout);
+
+	SPEED_START {
+		raid_par4_int8(nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+	printf("%8s", "");
+	printf("%8s", "");
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_sse2()) {
+		printf("%8s", "");
+
+#ifdef CONFIG_X86_64
+		printf("%8s", "");
+#endif
+	}
+#endif
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		SPEED_START {
+			raid_par4_ssse3(nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+		fflush(stdout);
+
+#ifdef CONFIG_X86_64
+		SPEED_START {
+			raid_par4_ssse3ext(nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+		fflush(stdout);
+#endif
+	}
+#endif
+	printf("\n");
+
+	/* PAR5 */
+	printf("%8s", "par5");
+	fflush(stdout);
+
+	SPEED_START {
+		raid_par5_int8(nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+	printf("%8s", "");
+	printf("%8s", "");
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_sse2()) {
+		printf("%8s", "");
+
+#ifdef CONFIG_X86_64
+		printf("%8s", "");
+#endif
+	}
+#endif
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		SPEED_START {
+			raid_par5_ssse3(nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+		fflush(stdout);
+
+#ifdef CONFIG_X86_64
+		SPEED_START {
+			raid_par5_ssse3ext(nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+		fflush(stdout);
+#endif
+	}
+#endif
+	printf("\n");
+
+	/* PAR6 */
+	printf("%8s", "par6");
+	fflush(stdout);
+
+	SPEED_START {
+		raid_par6_int8(nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+	printf("%8s", "");
+	printf("%8s", "");
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_sse2()) {
+		printf("%8s", "");
+
+#ifdef CONFIG_X86_64
+		printf("%8s", "");
+#endif
+	}
+#endif
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		SPEED_START {
+			raid_par6_ssse3(nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+		fflush(stdout);
+
+#ifdef CONFIG_X86_64
+		SPEED_START {
+			raid_par6_ssse3ext(nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+		fflush(stdout);
+#endif
+	}
+#endif
+	printf("\n");
+	printf("\n");
+
+	/* recover table */
+	printf("RAID functions used for recovering:\n");
+	printf("%8s", "");
+	printf("%8s", "int8");
+#ifdef CONFIG_X86
+	printf("%8s", "ssse3");
+#endif
+	printf("\n");
+
+	printf("%8s", "rec1");
+	fflush(stdout);
+
+	SPEED_START {
+		for (j = 0; j < nd; ++j)
+			/* +1 to avoid PAR1 optimized case */
+			raid_rec1_int8(1, id, ip + 1, nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		SPEED_START {
+			for (j = 0; j < nd; ++j)
+				/* +1 to avoid PAR1 optimized case */
+				raid_rec1_ssse3(1, id, ip + 1, nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+	}
+#endif
+	printf("\n");
+
+	printf("%8s", "rec2");
+	fflush(stdout);
+
+	SPEED_START {
+		for (j = 0; j < nd; ++j)
+			/* +1 to avoid PAR2 optimized case */
+			raid_rec2_int8(2, id, ip + 1, nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		SPEED_START {
+			for (j = 0; j < nd; ++j)
+				/* +1 to avoid PAR2 optimized case */
+				raid_rec2_ssse3(2, id, ip + 1, nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+	}
+#endif
+	printf("\n");
+
+	printf("%8s", "rec3");
+	fflush(stdout);
+
+	SPEED_START {
+		for (j = 0; j < nd; ++j)
+			raid_recX_int8(3, id, ip, nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		SPEED_START {
+			for (j = 0; j < nd; ++j)
+				raid_recX_ssse3(3, id, ip, nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+	}
+#endif
+	printf("\n");
+
+	printf("%8s", "rec4");
+	fflush(stdout);
+
+	SPEED_START {
+		for (j = 0; j < nd; ++j)
+			raid_recX_int8(4, id, ip, nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		SPEED_START {
+			for (j = 0; j < nd; ++j)
+				raid_recX_ssse3(4, id, ip, nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+	}
+#endif
+	printf("\n");
+
+	printf("%8s", "rec5");
+	fflush(stdout);
+
+	SPEED_START {
+		for (j = 0; j < nd; ++j)
+			raid_recX_int8(5, id, ip, nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		SPEED_START {
+			for (j = 0; j < nd; ++j)
+				raid_recX_ssse3(5, id, ip, nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+	}
+#endif
+	printf("\n");
+
+	printf("%8s", "rec6");
+	fflush(stdout);
+
+	SPEED_START {
+		for (j = 0; j < nd; ++j)
+			raid_recX_int8(6, id, ip, nd, size, v);
+	} SPEED_STOP
+
+	printf("%8"PRIu64, ds / dt);
+	fflush(stdout);
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		SPEED_START {
+			for (j = 0; j < nd; ++j)
+				raid_recX_ssse3(6, id, ip, nd, size, v);
+		} SPEED_STOP
+
+		printf("%8"PRIu64, ds / dt);
+	}
+#endif
+	printf("\n");
+	printf("\n");
+
+	free(v_alloc);
+	free(v);
+}
+
+int main(void)
+{
+	raid_init();
+
+	printf("RAID Cauchy speed test\n\n");
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_sse2())
+		printf("Including x86 SSE2 functions\n");
+	if (raid_cpu_has_ssse3())
+		printf("Including x86 SSSE3 functions\n");
+#endif
+#ifdef CONFIG_X86_64
+	printf("Including x64 extended SSE register set\n");
+#endif
+
+	printf("\nPlease wait about 30 seconds...\n\n");
+
+	speed();
+
+	return 0;
+}
+
diff --git a/lib/raid/test/test.c b/lib/raid/test/test.c
new file mode 100644
index 0000000..69deacb
--- /dev/null
+++ b/lib/raid/test/test.c
@@ -0,0 +1,314 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "internal.h"
+#include "cpu.h"
+#include "combo.h"
+#include "memory.h"
+
+/**
+ * Binomial coefficient of n over r.
+ */
+static int ibc(int n, int r)
+{
+	if (r == 0 || n == r)
+		return 1;
+	else
+		return ibc(n - 1, r - 1) + ibc(n - 1, r);
+}
+
+/**
+ * Power n ^ r;
+ */
+static int ipow(int n, int r)
+{
+	int v = 1;
+	while (r) {
+		v *= n;
+		--r;
+	}
+	return v;
+}
+
+int raid_test_combo(void)
+{
+	int r;
+	int count;
+	int p[RAID_PARITY_MAX];
+
+	for (r = 1; r <= RAID_PARITY_MAX; ++r) {
+		/* count combination (r of RAID_PARITY_MAX) elements */
+		count = 0;
+		combination_first(r, RAID_PARITY_MAX, p);
+
+		do {
+			++count;
+		} while (combination_next(r, RAID_PARITY_MAX, p));
+
+		if (count != ibc(RAID_PARITY_MAX, r))
+			return -1;
+	}
+
+	for (r = 1; r <= RAID_PARITY_MAX; ++r) {
+		/* count permutation (r of RAID_PARITY_MAX) elements */
+		count = 0;
+		permutation_first(r, RAID_PARITY_MAX, p);
+
+		do {
+			++count;
+		} while (permutation_next(r, RAID_PARITY_MAX, p));
+
+		if (count != ipow(RAID_PARITY_MAX, r))
+			return -1;
+	}
+
+	return 0;
+}
+
+int raid_test_sort(void)
+{
+	int p[RAID_PARITY_MAX];
+	int r;
+
+	for (r = 1; r <= RAID_PARITY_MAX; ++r) {
+		permutation_first(r, RAID_PARITY_MAX, p);
+		do {
+			int i[RAID_PARITY_MAX];
+			int j;
+
+			/* make a copy */
+			for (j = 0; j < r; ++j)
+				i[j] = p[j];
+
+			raid_sort(r, i);
+
+			/* check order */
+			for (j = 1; j < r; ++j)
+				if (i[j-1] > i[j])
+					return -1;
+		} while (permutation_next(r, RAID_PARITY_MAX, p));
+	}
+
+	return 0;
+}
+
+int raid_test_rec(int nd, size_t size)
+{
+	void *v_alloc;
+	void **v;
+	void **data;
+	void **parity;
+	void **test;
+	void *data_save[RAID_PARITY_MAX];
+	void *parity_save[RAID_PARITY_MAX];
+	void *waste;
+	int nv;
+	int id[RAID_PARITY_MAX];
+	int ip[RAID_PARITY_MAX];
+	int i;
+	int j;
+	int nr;
+	void (*f[RAID_PARITY_MAX][4])(
+		int nr, int *id, int *ip, int nd, size_t size, void **vbuf);
+	int nf[RAID_PARITY_MAX];
+	int np;
+
+	np = RAID_PARITY_MAX;
+
+	nv = nd + np * 2 + 1;
+
+	v = raid_malloc_vector(nd, nv, size, &v_alloc);
+	if (!v)
+		return -1;
+
+	data = v;
+	parity = v + nd;
+	test = v + nd + np;
+
+	for (i = 0; i < np; ++i)
+		parity_save[i] = parity[i];
+
+	waste = v[nv-1];
+
+	/* fill data disk with random */
+	raid_mrand_vector(nd, size, v);
+
+	/* setup recov functions */
+	for (i = 0; i < np; ++i) {
+		nf[i] = 0;
+		if (i == 0) {
+			f[i][nf[i]++] = raid_rec1_int8;
+#ifdef CONFIG_X86
+			if (raid_cpu_has_ssse3())
+				f[i][nf[i]++] = raid_rec1_ssse3;
+#endif
+		} else if (i == 1) {
+			f[i][nf[i]++] = raid_rec2_int8;
+#ifdef CONFIG_X86
+			if (raid_cpu_has_ssse3())
+				f[i][nf[i]++] = raid_rec2_ssse3;
+#endif
+		} else {
+			f[i][nf[i]++] = raid_recX_int8;
+#ifdef CONFIG_X86
+			if (raid_cpu_has_ssse3())
+				f[i][nf[i]++] = raid_recX_ssse3;
+#endif
+		}
+	}
+
+	/* compute the parity */
+	raid_par_ref(nd, np, size, v);
+
+	/* set all the parity to the waste v */
+	for (i = 0; i < np; ++i)
+		parity[i] = waste;
+
+	/* all parity levels */
+	for (nr = 1; nr <= np; ++nr) {
+		/* all combinations (nr of nd) disks */
+		combination_first(nr, nd, id);
+		do {
+			/* all combinations (nr of np) parities */
+			combination_first(nr, np, ip);
+			do {
+				/* for each recover function */
+				for (j = 0; j < nf[nr-1]; ++j) {
+					/* set */
+					for (i = 0; i < nr; ++i) {
+						/* remove the missing data */
+						data_save[i] = data[id[i]];
+						data[id[i]] = test[i];
+						/* set the parity to use */
+						parity[ip[i]] = parity_save[ip[i]];
+					}
+
+					/* recover */
+					f[nr-1][j](nr, id, ip, nd, size, v);
+
+					/* check */
+					for (i = 0; i < nr; ++i)
+						if (memcmp(test[i], data_save[i], size) != 0)
+							goto bail;
+
+					/* restore */
+					for (i = 0; i < nr; ++i) {
+						/* restore the data */
+						data[id[i]] = data_save[i];
+						/* restore the parity */
+						parity[ip[i]] = waste;
+					}
+				}
+			} while (combination_next(nr, np, ip));
+		} while (combination_next(nr, nd, id));
+	}
+
+	free(v_alloc);
+	free(v);
+	return 0;
+
+bail:
+	free(v_alloc);
+	free(v);
+	return -1;
+}
+
+int raid_test_par(int nd, size_t size)
+{
+	void *v_alloc;
+	void **v;
+	int nv;
+	int i, j;
+	void (*f[64])(int nd, size_t size, void **vbuf);
+	int nf;
+	int np;
+
+	np = RAID_PARITY_MAX;
+
+	nv = nd + np * 2;
+
+	v = raid_malloc_vector(nd, nv, size, &v_alloc);
+	if (!v)
+		return -1;
+
+	/* fill with random */
+	raid_mrand_vector(nv, size, v);
+
+	/* compute the parity */
+	raid_par_ref(nd, np, size, v);
+
+	/* copy in back buffers */
+	for (i = 0; i < np; ++i)
+		memcpy(v[nd + np + i], v[nd + i], size);
+
+	/* load all the available functions */
+	nf = 0;
+
+	f[nf++] = raid_par1_xorblocks;
+	f[nf++] = raid_par1_int32;
+	f[nf++] = raid_par1_int64;
+	f[nf++] = raid_par2_int32;
+	f[nf++] = raid_par2_int64;
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_sse2()) {
+		f[nf++] = raid_par1_sse2;
+		f[nf++] = raid_par2_sse2;
+#ifdef CONFIG_X86_64
+		f[nf++] = raid_par2_sse2ext;
+#endif
+	}
+#endif
+
+	f[nf++] = raid_par3_int8;
+	f[nf++] = raid_par4_int8;
+	f[nf++] = raid_par5_int8;
+	f[nf++] = raid_par6_int8;
+
+#ifdef CONFIG_X86
+	if (raid_cpu_has_ssse3()) {
+		f[nf++] = raid_par3_ssse3;
+		f[nf++] = raid_par4_ssse3;
+		f[nf++] = raid_par5_ssse3;
+		f[nf++] = raid_par6_ssse3;
+#ifdef CONFIG_X86_64
+		f[nf++] = raid_par3_ssse3ext;
+		f[nf++] = raid_par4_ssse3ext;
+		f[nf++] = raid_par5_ssse3ext;
+		f[nf++] = raid_par6_ssse3ext;
+#endif
+	}
+#endif
+
+	/* check all the functions */
+	for (j = 0; j < nf; ++j) {
+		/* compute parity */
+		f[j](nd, size, v);
+
+		/* check it */
+		for (i = 0; i < np; ++i)
+			if (memcmp(v[nd + np + i], v[nd + i], size) != 0)
+				goto bail;
+	}
+
+	free(v_alloc);
+	free(v);
+	return 0;
+
+bail:
+	free(v_alloc);
+	free(v);
+	return -1;
+}
+
diff --git a/lib/raid/test/test.h b/lib/raid/test/test.h
new file mode 100644
index 0000000..67684fe
--- /dev/null
+++ b/lib/raid/test/test.h
@@ -0,0 +1,59 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __RAID_TEST_H
+#define __RAID_TEST_H
+
+/**
+ * Tests sorting functions.
+ *
+ * Test raid_sort() with all the possible combinations of elements to sort.
+ *
+ * Returns 0 on success.
+ */
+int raid_test_sort(void);
+
+/**
+ * Tests combination functions.
+ *
+ * Tests combination_first() and combination_next() for all the parity levels.
+ *
+ * Returns 0 on success.
+ */
+int raid_test_combo(void);
+
+/**
+ * Tests recovering functions.
+ *
+ * All the recovering functions are tested with all the combinations
+ * of failing disks and recovering parities.
+ *
+ * Take care that the test time grows exponentially with the number of disks.
+ *
+ * Returns 0 on success.
+ */
+int raid_test_rec(int nd, size_t size);
+
+/**
+ * Tests parity generation functions.
+ *
+ * All the parity generation functions are tested with the specified
+ * number of disks.
+ *
+ * Returns 0 on success.
+ */
+int raid_test_par(int nd, size_t size);
+
+#endif
+
diff --git a/lib/raid/test/usermode.h b/lib/raid/test/usermode.h
new file mode 100644
index 0000000..0f73cec
--- /dev/null
+++ b/lib/raid/test/usermode.h
@@ -0,0 +1,83 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __RAID_USERMODE_H
+#define __RAID_USERMODE_H
+
+/*
+ * Compatibility layer for user mode applications.
+ */
+#include <stdlib.h>
+#include <stdint.h>
+#include <assert.h>
+#include <string.h>
+#include <malloc.h>
+#include <errno.h>
+#include <sys/time.h>
+
+#define pr_err printf
+#define pr_info printf
+#define __aligned(a) __attribute__((aligned(a)))
+#define PAGE_SIZE 4096
+#define EXPORT_SYMBOL_GPL(a) int dummy_##a
+#define EXPORT_SYMBOL(a) int dummy_##a
+#if defined(__i386__)
+#define CONFIG_X86 1
+#define CONFIG_X86_32 1
+#endif
+#if defined(__x86_64__)
+#define CONFIG_X86 1
+#define CONFIG_X86_64 1
+#endif
+#define BUG_ON(a) assert(!(a))
+#define MAX_XOR_BLOCKS 1
+void xor_blocks(unsigned int count, unsigned int bytes, void *dest, void **srcs);
+#define GFP_KERNEL 0
+#define get_order(x) 5
+#define __get_free_pages(x, order) memalign(PAGE_SIZE, PAGE_SIZE * (1 << order))
+#define free_pages(x, order) free((void *)x)
+#define preempt_disable() do { } while (0)
+#define preempt_enable() do { } while (0)
+#define cpu_relax() do { } while (0)
+#define HZ 1000
+#define jiffies get_jiffies()
+static inline unsigned long get_jiffies(void)
+{
+	struct timeval t;
+	gettimeofday(&t, 0);
+	return t.tv_sec * 1000 + t.tv_usec / 1000;
+}
+#define time_before(x, y) ((x) < (y))
+
+#ifdef CONFIG_X86
+#define X86_FEATURE_XMM2 (0*32+26)
+#define X86_FEATURE_SSSE3 (4*32+9)
+#define X86_FEATURE_AVX (4*32+28)
+#define X86_FEATURE_AVX2 (9*32+5)
+
+static inline int boot_cpu_has(int flag)
+{
+	uint32_t eax, ebx, ecx, edx;
+
+	eax = (flag & 0x100) ? 7 : (flag & 0x20) ? 0x80000001 : 1;
+	ecx = 0;
+
+	asm volatile("cpuid" : "+a" (eax), "=b" (ebx), "=d" (edx), "+c" (ecx));
+
+	return ((flag & 0x100 ? ebx : (flag & 0x80) ? ecx : edx) >> (flag & 31)) & 1;
+}
+#endif /* CONFIG_X86 */
+
+#endif
+
diff --git a/lib/raid/test/xor.c b/lib/raid/test/xor.c
new file mode 100644
index 0000000..2d68636
--- /dev/null
+++ b/lib/raid/test/xor.c
@@ -0,0 +1,41 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "internal.h"
+
+/**
+ * Implementation of the kernel xor_blocks().
+ */
+void xor_blocks(unsigned int count, unsigned int bytes, void *dest, void **srcs)
+{
+	uint32_t *p1 = dest;
+	uint32_t *p2 = srcs[0];
+	long lines = bytes / (sizeof(uint32_t)) / 8;
+
+	BUG_ON(count != 1);
+
+	do {
+		p1[0] ^= p2[0];
+		p1[1] ^= p2[1];
+		p1[2] ^= p2[2];
+		p1[3] ^= p2[3];
+		p1[4] ^= p2[4];
+		p1[5] ^= p2[5];
+		p1[6] ^= p2[6];
+		p1[7] ^= p2[7];
+		p1 += 8;
+		p2 += 8;
+	} while (--lines > 0);
+}
+
diff --git a/lib/raid/x86.c b/lib/raid/x86.c
new file mode 100644
index 0000000..fac8a67
--- /dev/null
+++ b/lib/raid/x86.c
@@ -0,0 +1,1569 @@
+/*
+ * Copyright (C) 2013 Andrea Mazzoleni
+ *
+ * This program is free software: you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "internal.h"
+#include "gf.h"
+
+#ifdef CONFIG_X86
+/*
+ * PAR1 (RAID5 with xor) SSE2 implementation
+ *
+ * Note that we don't have the corresponding x64 sse2ext function using more
+ * registers because processing a block of 64 bytes already fills
+ * the typical cache block, and processing 128 bytes doesn't increase
+ * performance.
+ */
+void raid_par1_sse2(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	int d, l;
+	size_t i;
+
+	l = nd - 1;
+	p = v[nd];
+
+	asm_begin();
+
+	for (i = 0; i < size; i += 64) {
+		asm volatile("movdqa %0,%%xmm0" : : "m" (v[l][i]));
+		asm volatile("movdqa %0,%%xmm1" : : "m" (v[l][i+16]));
+		asm volatile("movdqa %0,%%xmm2" : : "m" (v[l][i+32]));
+		asm volatile("movdqa %0,%%xmm3" : : "m" (v[l][i+48]));
+		for (d = l-1; d >= 0; --d) {
+			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
+			asm volatile("movdqa %0,%%xmm5" : : "m" (v[d][i+16]));
+			asm volatile("movdqa %0,%%xmm6" : : "m" (v[d][i+32]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (v[d][i+48]));
+			asm volatile("pxor %xmm4,%xmm0");
+			asm volatile("pxor %xmm5,%xmm1");
+			asm volatile("pxor %xmm6,%xmm2");
+			asm volatile("pxor %xmm7,%xmm3");
+		}
+		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
+		asm volatile("movntdq %%xmm1,%0" : "=m" (p[i+16]));
+		asm volatile("movntdq %%xmm2,%0" : "=m" (p[i+32]));
+		asm volatile("movntdq %%xmm3,%0" : "=m" (p[i+48]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86
+static const struct gfconst16 {
+	uint8_t poly[16];
+	uint8_t low4[16];
+} gfconst16  __aligned(32) = {
+	{ 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d, 0x1d },
+	{ 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f },
+};
+#endif
+
+#ifdef CONFIG_X86
+/*
+ * PAR2 (RAID6 with powers of 2) SSE2 implementation
+ */
+void raid_par2_sse2(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	int d, l;
+	size_t i;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+
+	asm_begin();
+
+	asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
+
+	for (i = 0; i < size; i += 32) {
+		asm volatile("movdqa %0,%%xmm0" : : "m" (v[l][i]));
+		asm volatile("movdqa %0,%%xmm1" : : "m" (v[l][i+16]));
+		asm volatile("movdqa %xmm0,%xmm2");
+		asm volatile("movdqa %xmm1,%xmm3");
+		for (d = l-1; d >= 0; --d) {
+			asm volatile("pxor %xmm4,%xmm4");
+			asm volatile("pxor %xmm5,%xmm5");
+			asm volatile("pcmpgtb %xmm2,%xmm4");
+			asm volatile("pcmpgtb %xmm3,%xmm5");
+			asm volatile("paddb %xmm2,%xmm2");
+			asm volatile("paddb %xmm3,%xmm3");
+			asm volatile("pand %xmm7,%xmm4");
+			asm volatile("pand %xmm7,%xmm5");
+			asm volatile("pxor %xmm4,%xmm2");
+			asm volatile("pxor %xmm5,%xmm3");
+
+			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
+			asm volatile("movdqa %0,%%xmm5" : : "m" (v[d][i+16]));
+			asm volatile("pxor %xmm4,%xmm0");
+			asm volatile("pxor %xmm5,%xmm1");
+			asm volatile("pxor %xmm4,%xmm2");
+			asm volatile("pxor %xmm5,%xmm3");
+		}
+		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
+		asm volatile("movntdq %%xmm1,%0" : "=m" (p[i+16]));
+		asm volatile("movntdq %%xmm2,%0" : "=m" (q[i]));
+		asm volatile("movntdq %%xmm3,%0" : "=m" (q[i+16]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86_64
+/*
+ * PAR2 (RAID6 with powers of 2) SSE2 implementation
+ *
+ * Note that it uses 16 registers, meaning that x64 is required.
+ */
+void raid_par2_sse2ext(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	int d, l;
+	size_t i;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+
+	asm_begin();
+
+	asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.poly[0]));
+
+	for (i = 0; i < size; i += 64) {
+		asm volatile("movdqa %0,%%xmm0" : : "m" (v[l][i]));
+		asm volatile("movdqa %0,%%xmm1" : : "m" (v[l][i+16]));
+		asm volatile("movdqa %0,%%xmm2" : : "m" (v[l][i+32]));
+		asm volatile("movdqa %0,%%xmm3" : : "m" (v[l][i+48]));
+		asm volatile("movdqa %xmm0,%xmm4");
+		asm volatile("movdqa %xmm1,%xmm5");
+		asm volatile("movdqa %xmm2,%xmm6");
+		asm volatile("movdqa %xmm3,%xmm7");
+		for (d = l-1; d >= 0; --d) {
+			asm volatile("pxor %xmm8,%xmm8");
+			asm volatile("pxor %xmm9,%xmm9");
+			asm volatile("pxor %xmm10,%xmm10");
+			asm volatile("pxor %xmm11,%xmm11");
+			asm volatile("pcmpgtb %xmm4,%xmm8");
+			asm volatile("pcmpgtb %xmm5,%xmm9");
+			asm volatile("pcmpgtb %xmm6,%xmm10");
+			asm volatile("pcmpgtb %xmm7,%xmm11");
+			asm volatile("paddb %xmm4,%xmm4");
+			asm volatile("paddb %xmm5,%xmm5");
+			asm volatile("paddb %xmm6,%xmm6");
+			asm volatile("paddb %xmm7,%xmm7");
+			asm volatile("pand %xmm15,%xmm8");
+			asm volatile("pand %xmm15,%xmm9");
+			asm volatile("pand %xmm15,%xmm10");
+			asm volatile("pand %xmm15,%xmm11");
+			asm volatile("pxor %xmm8,%xmm4");
+			asm volatile("pxor %xmm9,%xmm5");
+			asm volatile("pxor %xmm10,%xmm6");
+			asm volatile("pxor %xmm11,%xmm7");
+
+			asm volatile("movdqa %0,%%xmm8" : : "m" (v[d][i]));
+			asm volatile("movdqa %0,%%xmm9" : : "m" (v[d][i+16]));
+			asm volatile("movdqa %0,%%xmm10" : : "m" (v[d][i+32]));
+			asm volatile("movdqa %0,%%xmm11" : : "m" (v[d][i+48]));
+			asm volatile("pxor %xmm8,%xmm0");
+			asm volatile("pxor %xmm9,%xmm1");
+			asm volatile("pxor %xmm10,%xmm2");
+			asm volatile("pxor %xmm11,%xmm3");
+			asm volatile("pxor %xmm8,%xmm4");
+			asm volatile("pxor %xmm9,%xmm5");
+			asm volatile("pxor %xmm10,%xmm6");
+			asm volatile("pxor %xmm11,%xmm7");
+		}
+		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
+		asm volatile("movntdq %%xmm1,%0" : "=m" (p[i+16]));
+		asm volatile("movntdq %%xmm2,%0" : "=m" (p[i+32]));
+		asm volatile("movntdq %%xmm3,%0" : "=m" (p[i+48]));
+		asm volatile("movntdq %%xmm4,%0" : "=m" (q[i]));
+		asm volatile("movntdq %%xmm5,%0" : "=m" (q[i+16]));
+		asm volatile("movntdq %%xmm6,%0" : "=m" (q[i+32]));
+		asm volatile("movntdq %%xmm7,%0" : "=m" (q[i+48]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86
+/*
+ * PAR3 (triple parity with Cauchy matrix) SSSE3 implementation
+ */
+void raid_par3_ssse3(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	int d, l;
+	size_t i;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+
+	/* special case with only one data disk */
+	if (l == 0) {
+		for (i = 0; i < 3; ++i)
+			memcpy(v[1+i], v[0], size);
+		return;
+	}
+
+	asm_begin();
+
+	/* generic case with at least two data disks */
+	asm volatile("movdqa %0,%%xmm3" : : "m" (gfconst16.poly[0]));
+	asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
+
+	for (i = 0; i < size; i += 16) {
+		/* last disk without the by two multiplication */
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
+
+		asm volatile("movdqa %xmm4,%xmm0");
+		asm volatile("movdqa %xmm4,%xmm1");
+
+		asm volatile("movdqa %xmm4,%xmm5");
+		asm volatile("psrlw  $4,%xmm5");
+		asm volatile("pand   %xmm7,%xmm4");
+		asm volatile("pand   %xmm7,%xmm5");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
+		asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[l][0][1][0]));
+		asm volatile("pshufb %xmm4,%xmm2");
+		asm volatile("pshufb %xmm5,%xmm6");
+		asm volatile("pxor   %xmm6,%xmm2");
+
+		/* intermediate disks */
+		for (d = l-1; d > 0; --d) {
+			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
+
+			asm volatile("pxor %xmm5,%xmm5");
+			asm volatile("pcmpgtb %xmm1,%xmm5");
+			asm volatile("paddb %xmm1,%xmm1");
+			asm volatile("pand %xmm3,%xmm5");
+			asm volatile("pxor %xmm5,%xmm1");
+
+			asm volatile("pxor %xmm4,%xmm0");
+			asm volatile("pxor %xmm4,%xmm1");
+
+			asm volatile("movdqa %xmm4,%xmm5");
+			asm volatile("psrlw  $4,%xmm5");
+			asm volatile("pand   %xmm7,%xmm4");
+			asm volatile("pand   %xmm7,%xmm5");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pxor   %xmm6,%xmm2");
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][1][0]));
+			asm volatile("pshufb %xmm5,%xmm6");
+			asm volatile("pxor   %xmm6,%xmm2");
+		}
+
+		/* first disk with all coefficients at 1 */
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
+
+		asm volatile("pxor %xmm5,%xmm5");
+		asm volatile("pcmpgtb %xmm1,%xmm5");
+		asm volatile("paddb %xmm1,%xmm1");
+		asm volatile("pand %xmm3,%xmm5");
+		asm volatile("pxor %xmm5,%xmm1");
+
+		asm volatile("pxor %xmm4,%xmm0");
+		asm volatile("pxor %xmm4,%xmm1");
+		asm volatile("pxor %xmm4,%xmm2");
+
+		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
+		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
+		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86_64
+/*
+ * PAR3 (triple parity with Cauchy matrix) SSSE3 implementation
+ *
+ * Note that it uses 16 registers, meaning that x64 is required.
+ */
+void raid_par3_ssse3ext(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	int d, l;
+	size_t i;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+
+	/* special case with only one data disk */
+	if (l == 0) {
+		for (i = 0; i < 3; ++i)
+			memcpy(v[1+i], v[0], size);
+		return;
+	}
+
+	asm_begin();
+
+	/* generic case with at least two data disks */
+	asm volatile("movdqa %0,%%xmm3" : : "m" (gfconst16.poly[0]));
+	asm volatile("movdqa %0,%%xmm11" : : "m" (gfconst16.low4[0]));
+
+	for (i = 0; i < size; i += 32) {
+		/* last disk without the by two multiplication */
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
+		asm volatile("movdqa %0,%%xmm12" : : "m" (v[l][i+16]));
+
+		asm volatile("movdqa %xmm4,%xmm0");
+		asm volatile("movdqa %xmm4,%xmm1");
+		asm volatile("movdqa %xmm12,%xmm8");
+		asm volatile("movdqa %xmm12,%xmm9");
+
+		asm volatile("movdqa %xmm4,%xmm5");
+		asm volatile("movdqa %xmm12,%xmm13");
+		asm volatile("psrlw  $4,%xmm5");
+		asm volatile("psrlw  $4,%xmm13");
+		asm volatile("pand   %xmm11,%xmm4");
+		asm volatile("pand   %xmm11,%xmm12");
+		asm volatile("pand   %xmm11,%xmm5");
+		asm volatile("pand   %xmm11,%xmm13");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][0][1][0]));
+		asm volatile("movdqa %xmm2,%xmm10");
+		asm volatile("movdqa %xmm7,%xmm15");
+		asm volatile("pshufb %xmm4,%xmm2");
+		asm volatile("pshufb %xmm12,%xmm10");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pshufb %xmm13,%xmm15");
+		asm volatile("pxor   %xmm7,%xmm2");
+		asm volatile("pxor   %xmm15,%xmm10");
+
+		/* intermediate disks */
+		for (d = l-1; d > 0; --d) {
+			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
+			asm volatile("movdqa %0,%%xmm12" : : "m" (v[d][i+16]));
+
+			asm volatile("pxor %xmm5,%xmm5");
+			asm volatile("pxor %xmm13,%xmm13");
+			asm volatile("pcmpgtb %xmm1,%xmm5");
+			asm volatile("pcmpgtb %xmm9,%xmm13");
+			asm volatile("paddb %xmm1,%xmm1");
+			asm volatile("paddb %xmm9,%xmm9");
+			asm volatile("pand %xmm3,%xmm5");
+			asm volatile("pand %xmm3,%xmm13");
+			asm volatile("pxor %xmm5,%xmm1");
+			asm volatile("pxor %xmm13,%xmm9");
+
+			asm volatile("pxor %xmm4,%xmm0");
+			asm volatile("pxor %xmm4,%xmm1");
+			asm volatile("pxor %xmm12,%xmm8");
+			asm volatile("pxor %xmm12,%xmm9");
+
+			asm volatile("movdqa %xmm4,%xmm5");
+			asm volatile("movdqa %xmm12,%xmm13");
+			asm volatile("psrlw  $4,%xmm5");
+			asm volatile("psrlw  $4,%xmm13");
+			asm volatile("pand   %xmm11,%xmm4");
+			asm volatile("pand   %xmm11,%xmm12");
+			asm volatile("pand   %xmm11,%xmm5");
+			asm volatile("pand   %xmm11,%xmm13");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][0][1][0]));
+			asm volatile("movdqa %xmm6,%xmm14");
+			asm volatile("movdqa %xmm7,%xmm15");
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm12,%xmm14");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pshufb %xmm13,%xmm15");
+			asm volatile("pxor   %xmm6,%xmm2");
+			asm volatile("pxor   %xmm14,%xmm10");
+			asm volatile("pxor   %xmm7,%xmm2");
+			asm volatile("pxor   %xmm15,%xmm10");
+		}
+
+		/* first disk with all coefficients at 1 */
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
+		asm volatile("movdqa %0,%%xmm12" : : "m" (v[0][i+16]));
+
+		asm volatile("pxor %xmm5,%xmm5");
+		asm volatile("pxor %xmm13,%xmm13");
+		asm volatile("pcmpgtb %xmm1,%xmm5");
+		asm volatile("pcmpgtb %xmm9,%xmm13");
+		asm volatile("paddb %xmm1,%xmm1");
+		asm volatile("paddb %xmm9,%xmm9");
+		asm volatile("pand %xmm3,%xmm5");
+		asm volatile("pand %xmm3,%xmm13");
+		asm volatile("pxor %xmm5,%xmm1");
+		asm volatile("pxor %xmm13,%xmm9");
+
+		asm volatile("pxor %xmm4,%xmm0");
+		asm volatile("pxor %xmm4,%xmm1");
+		asm volatile("pxor %xmm4,%xmm2");
+		asm volatile("pxor %xmm12,%xmm8");
+		asm volatile("pxor %xmm12,%xmm9");
+		asm volatile("pxor %xmm12,%xmm10");
+
+		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
+		asm volatile("movntdq %%xmm8,%0" : "=m" (p[i+16]));
+		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
+		asm volatile("movntdq %%xmm9,%0" : "=m" (q[i+16]));
+		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
+		asm volatile("movntdq %%xmm10,%0" : "=m" (r[i+16]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86
+/*
+ * PAR4 (quad parity with Cauchy matrix) SSSE3 implementation
+ */
+void raid_par4_ssse3(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	uint8_t *s;
+	int d, l;
+	size_t i;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+	s = v[nd+3];
+
+	/* special case with only one data disk */
+	if (l == 0) {
+		for (i = 0; i < 4; ++i)
+			memcpy(v[1+i], v[0], size);
+		return;
+	}
+
+	asm_begin();
+
+	/* generic case with at least two data disks */
+	for (i = 0; i < size; i += 16) {
+		/* last disk without the by two multiplication */
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
+
+		asm volatile("movdqa %xmm4,%xmm0");
+		asm volatile("movdqa %xmm4,%xmm1");
+
+		asm volatile("movdqa %xmm4,%xmm5");
+		asm volatile("psrlw  $4,%xmm5");
+		asm volatile("pand   %xmm7,%xmm4");
+		asm volatile("pand   %xmm7,%xmm5");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][0][1][0]));
+		asm volatile("pshufb %xmm4,%xmm2");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pxor   %xmm7,%xmm2");
+
+		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][1][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][1][1][0]));
+		asm volatile("pshufb %xmm4,%xmm3");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pxor   %xmm7,%xmm3");
+
+		/* intermediate disks */
+		for (d = l-1; d > 0; --d) {
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
+			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
+
+			asm volatile("pxor %xmm5,%xmm5");
+			asm volatile("pcmpgtb %xmm1,%xmm5");
+			asm volatile("paddb %xmm1,%xmm1");
+			asm volatile("pand %xmm7,%xmm5");
+			asm volatile("pxor %xmm5,%xmm1");
+
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
+
+			asm volatile("pxor %xmm4,%xmm0");
+			asm volatile("pxor %xmm4,%xmm1");
+
+			asm volatile("movdqa %xmm4,%xmm5");
+			asm volatile("psrlw  $4,%xmm5");
+			asm volatile("pand   %xmm7,%xmm4");
+			asm volatile("pand   %xmm7,%xmm5");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][0][1][0]));
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pxor   %xmm6,%xmm2");
+			asm volatile("pxor   %xmm7,%xmm2");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][1][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][1][1][0]));
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pxor   %xmm6,%xmm3");
+			asm volatile("pxor   %xmm7,%xmm3");
+		}
+
+		/* first disk with all coefficients at 1 */
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
+
+		asm volatile("pxor %xmm5,%xmm5");
+		asm volatile("pcmpgtb %xmm1,%xmm5");
+		asm volatile("paddb %xmm1,%xmm1");
+		asm volatile("pand %xmm7,%xmm5");
+		asm volatile("pxor %xmm5,%xmm1");
+
+		asm volatile("pxor %xmm4,%xmm0");
+		asm volatile("pxor %xmm4,%xmm1");
+		asm volatile("pxor %xmm4,%xmm2");
+		asm volatile("pxor %xmm4,%xmm3");
+
+		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
+		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
+		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
+		asm volatile("movntdq %%xmm3,%0" : "=m" (s[i]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86_64
+/*
+ * PAR4 (quad parity with Cauchy matrix) SSSE3 implementation
+ *
+ * Note that it uses 16 registers, meaning that x64 is required.
+ */
+void raid_par4_ssse3ext(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	uint8_t *s;
+	int d, l;
+	size_t i;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+	s = v[nd+3];
+
+	/* special case with only one data disk */
+	if (l == 0) {
+		for (i = 0; i < 4; ++i)
+			memcpy(v[1+i], v[0], size);
+		return;
+	}
+
+	asm_begin();
+
+	/* generic case with at least two data disks */
+	for (i = 0; i < size; i += 32) {
+		/* last disk without the by two multiplication */
+		asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.low4[0]));
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
+		asm volatile("movdqa %0,%%xmm12" : : "m" (v[l][i+16]));
+
+		asm volatile("movdqa %xmm4,%xmm0");
+		asm volatile("movdqa %xmm4,%xmm1");
+		asm volatile("movdqa %xmm12,%xmm8");
+		asm volatile("movdqa %xmm12,%xmm9");
+
+		asm volatile("movdqa %xmm4,%xmm5");
+		asm volatile("movdqa %xmm12,%xmm13");
+		asm volatile("psrlw  $4,%xmm5");
+		asm volatile("psrlw  $4,%xmm13");
+		asm volatile("pand   %xmm15,%xmm4");
+		asm volatile("pand   %xmm15,%xmm12");
+		asm volatile("pand   %xmm15,%xmm5");
+		asm volatile("pand   %xmm15,%xmm13");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][0][1][0]));
+		asm volatile("movdqa %xmm2,%xmm10");
+		asm volatile("movdqa %xmm7,%xmm15");
+		asm volatile("pshufb %xmm4,%xmm2");
+		asm volatile("pshufb %xmm12,%xmm10");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pshufb %xmm13,%xmm15");
+		asm volatile("pxor   %xmm7,%xmm2");
+		asm volatile("pxor   %xmm15,%xmm10");
+
+		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][1][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][1][1][0]));
+		asm volatile("movdqa %xmm3,%xmm11");
+		asm volatile("movdqa %xmm7,%xmm15");
+		asm volatile("pshufb %xmm4,%xmm3");
+		asm volatile("pshufb %xmm12,%xmm11");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pshufb %xmm13,%xmm15");
+		asm volatile("pxor   %xmm7,%xmm3");
+		asm volatile("pxor   %xmm15,%xmm11");
+
+		/* intermediate disks */
+		for (d = l-1; d > 0; --d) {
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
+			asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.low4[0]));
+			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
+			asm volatile("movdqa %0,%%xmm12" : : "m" (v[d][i+16]));
+
+			asm volatile("pxor %xmm5,%xmm5");
+			asm volatile("pxor %xmm13,%xmm13");
+			asm volatile("pcmpgtb %xmm1,%xmm5");
+			asm volatile("pcmpgtb %xmm9,%xmm13");
+			asm volatile("paddb %xmm1,%xmm1");
+			asm volatile("paddb %xmm9,%xmm9");
+			asm volatile("pand %xmm7,%xmm5");
+			asm volatile("pand %xmm7,%xmm13");
+			asm volatile("pxor %xmm5,%xmm1");
+			asm volatile("pxor %xmm13,%xmm9");
+
+			asm volatile("pxor %xmm4,%xmm0");
+			asm volatile("pxor %xmm4,%xmm1");
+			asm volatile("pxor %xmm12,%xmm8");
+			asm volatile("pxor %xmm12,%xmm9");
+
+			asm volatile("movdqa %xmm4,%xmm5");
+			asm volatile("movdqa %xmm12,%xmm13");
+			asm volatile("psrlw  $4,%xmm5");
+			asm volatile("psrlw  $4,%xmm13");
+			asm volatile("pand   %xmm15,%xmm4");
+			asm volatile("pand   %xmm15,%xmm12");
+			asm volatile("pand   %xmm15,%xmm5");
+			asm volatile("pand   %xmm15,%xmm13");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][0][1][0]));
+			asm volatile("movdqa %xmm6,%xmm14");
+			asm volatile("movdqa %xmm7,%xmm15");
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm12,%xmm14");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pshufb %xmm13,%xmm15");
+			asm volatile("pxor   %xmm6,%xmm2");
+			asm volatile("pxor   %xmm14,%xmm10");
+			asm volatile("pxor   %xmm7,%xmm2");
+			asm volatile("pxor   %xmm15,%xmm10");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][1][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][1][1][0]));
+			asm volatile("movdqa %xmm6,%xmm14");
+			asm volatile("movdqa %xmm7,%xmm15");
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm12,%xmm14");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pshufb %xmm13,%xmm15");
+			asm volatile("pxor   %xmm6,%xmm3");
+			asm volatile("pxor   %xmm14,%xmm11");
+			asm volatile("pxor   %xmm7,%xmm3");
+			asm volatile("pxor   %xmm15,%xmm11");
+		}
+
+		/* first disk with all coefficients at 1 */
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
+		asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.low4[0]));
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
+		asm volatile("movdqa %0,%%xmm12" : : "m" (v[0][i+16]));
+
+		asm volatile("pxor %xmm5,%xmm5");
+		asm volatile("pxor %xmm13,%xmm13");
+		asm volatile("pcmpgtb %xmm1,%xmm5");
+		asm volatile("pcmpgtb %xmm9,%xmm13");
+		asm volatile("paddb %xmm1,%xmm1");
+		asm volatile("paddb %xmm9,%xmm9");
+		asm volatile("pand %xmm7,%xmm5");
+		asm volatile("pand %xmm7,%xmm13");
+		asm volatile("pxor %xmm5,%xmm1");
+		asm volatile("pxor %xmm13,%xmm9");
+
+		asm volatile("pxor %xmm4,%xmm0");
+		asm volatile("pxor %xmm4,%xmm1");
+		asm volatile("pxor %xmm4,%xmm2");
+		asm volatile("pxor %xmm4,%xmm3");
+		asm volatile("pxor %xmm12,%xmm8");
+		asm volatile("pxor %xmm12,%xmm9");
+		asm volatile("pxor %xmm12,%xmm10");
+		asm volatile("pxor %xmm12,%xmm11");
+
+		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
+		asm volatile("movntdq %%xmm8,%0" : "=m" (p[i+16]));
+		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
+		asm volatile("movntdq %%xmm9,%0" : "=m" (q[i+16]));
+		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
+		asm volatile("movntdq %%xmm10,%0" : "=m" (r[i+16]));
+		asm volatile("movntdq %%xmm3,%0" : "=m" (s[i]));
+		asm volatile("movntdq %%xmm11,%0" : "=m" (s[i+16]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86
+/*
+ * PAR5 (penta parity with Cauchy matrix) SSSE3 implementation
+ */
+void raid_par5_ssse3(int nd, size_t size, void **vv)
+/* ensures that stack is aligned at 16 bytes because we allocate SSE registers in it */
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	uint8_t *s;
+	uint8_t *t;
+	int d, l;
+	size_t i;
+	uint8_t p0[16] __aligned(16);
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+	s = v[nd+3];
+	t = v[nd+4];
+
+	/* special case with only one data disk */
+	if (l == 0) {
+		for (i = 0; i < 5; ++i)
+			memcpy(v[1+i], v[0], size);
+		return;
+	}
+
+	asm_begin();
+
+	/* generic case with at least two data disks */
+	for (i = 0; i < size; i += 16) {
+		/* last disk without the by two multiplication */
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
+
+		asm volatile("movdqa %xmm4,%xmm0");
+		asm volatile("movdqa %%xmm4,%0" : "=m" (p0[0]));
+
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
+		asm volatile("movdqa %xmm4,%xmm5");
+		asm volatile("psrlw  $4,%xmm5");
+		asm volatile("pand   %xmm7,%xmm4");
+		asm volatile("pand   %xmm7,%xmm5");
+
+		asm volatile("movdqa %0,%%xmm1" : : "m" (gfgenpshufb[l][0][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][0][1][0]));
+		asm volatile("pshufb %xmm4,%xmm1");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pxor   %xmm7,%xmm1");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][1][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][1][1][0]));
+		asm volatile("pshufb %xmm4,%xmm2");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pxor   %xmm7,%xmm2");
+
+		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][2][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][2][1][0]));
+		asm volatile("pshufb %xmm4,%xmm3");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pxor   %xmm7,%xmm3");
+
+		/* intermediate disks */
+		for (d = l-1; d > 0; --d) {
+			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
+			asm volatile("movdqa %0,%%xmm6" : : "m" (p0[0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
+
+			asm volatile("pxor %xmm5,%xmm5");
+			asm volatile("pcmpgtb %xmm0,%xmm5");
+			asm volatile("paddb %xmm0,%xmm0");
+			asm volatile("pand %xmm7,%xmm5");
+			asm volatile("pxor %xmm5,%xmm0");
+
+			asm volatile("pxor %xmm4,%xmm0");
+			asm volatile("pxor %xmm4,%xmm6");
+			asm volatile("movdqa %%xmm6,%0" : "=m" (p0[0]));
+
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
+			asm volatile("movdqa %xmm4,%xmm5");
+			asm volatile("psrlw  $4,%xmm5");
+			asm volatile("pand   %xmm7,%xmm4");
+			asm volatile("pand   %xmm7,%xmm5");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][0][1][0]));
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pxor   %xmm6,%xmm1");
+			asm volatile("pxor   %xmm7,%xmm1");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][1][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][1][1][0]));
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pxor   %xmm6,%xmm2");
+			asm volatile("pxor   %xmm7,%xmm2");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][2][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][2][1][0]));
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pxor   %xmm6,%xmm3");
+			asm volatile("pxor   %xmm7,%xmm3");
+		}
+
+		/* first disk with all coefficients at 1 */
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
+		asm volatile("movdqa %0,%%xmm6" : : "m" (p0[0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
+
+		asm volatile("pxor %xmm5,%xmm5");
+		asm volatile("pcmpgtb %xmm0,%xmm5");
+		asm volatile("paddb %xmm0,%xmm0");
+		asm volatile("pand %xmm7,%xmm5");
+		asm volatile("pxor %xmm5,%xmm0");
+
+		asm volatile("pxor %xmm4,%xmm0");
+		asm volatile("pxor %xmm4,%xmm1");
+		asm volatile("pxor %xmm4,%xmm2");
+		asm volatile("pxor %xmm4,%xmm3");
+		asm volatile("pxor %xmm4,%xmm6");
+
+		asm volatile("movntdq %%xmm6,%0" : "=m" (p[i]));
+		asm volatile("movntdq %%xmm0,%0" : "=m" (q[i]));
+		asm volatile("movntdq %%xmm1,%0" : "=m" (r[i]));
+		asm volatile("movntdq %%xmm2,%0" : "=m" (s[i]));
+		asm volatile("movntdq %%xmm3,%0" : "=m" (t[i]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86_64
+/*
+ * PAR5 (penta parity with Cauchy matrix) SSSE3 implementation
+ *
+ * Note that it uses 16 registers, meaning that x64 is required.
+ */
+void raid_par5_ssse3ext(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	uint8_t *s;
+	uint8_t *t;
+	int d, l;
+	size_t i;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+	s = v[nd+3];
+	t = v[nd+4];
+
+	/* special case with only one data disk */
+	if (l == 0) {
+		for (i = 0; i < 5; ++i)
+			memcpy(v[1+i], v[0], size);
+		return;
+	}
+
+	asm_begin();
+
+	/* generic case with at least two data disks */
+	asm volatile("movdqa %0,%%xmm14" : : "m" (gfconst16.poly[0]));
+	asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.low4[0]));
+
+	for (i = 0; i < size; i += 16) {
+		/* last disk without the by two multiplication */
+		asm volatile("movdqa %0,%%xmm10" : : "m" (v[l][i]));
+
+		asm volatile("movdqa %xmm10,%xmm0");
+		asm volatile("movdqa %xmm10,%xmm1");
+
+		asm volatile("movdqa %xmm10,%xmm11");
+		asm volatile("psrlw  $4,%xmm11");
+		asm volatile("pand   %xmm15,%xmm10");
+		asm volatile("pand   %xmm15,%xmm11");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
+		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][0][1][0]));
+		asm volatile("pshufb %xmm10,%xmm2");
+		asm volatile("pshufb %xmm11,%xmm13");
+		asm volatile("pxor   %xmm13,%xmm2");
+
+		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][1][0][0]));
+		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][1][1][0]));
+		asm volatile("pshufb %xmm10,%xmm3");
+		asm volatile("pshufb %xmm11,%xmm13");
+		asm volatile("pxor   %xmm13,%xmm3");
+
+		asm volatile("movdqa %0,%%xmm4" : : "m" (gfgenpshufb[l][2][0][0]));
+		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][2][1][0]));
+		asm volatile("pshufb %xmm10,%xmm4");
+		asm volatile("pshufb %xmm11,%xmm13");
+		asm volatile("pxor   %xmm13,%xmm4");
+
+		/* intermediate disks */
+		for (d = l-1; d > 0; --d) {
+			asm volatile("movdqa %0,%%xmm10" : : "m" (v[d][i]));
+
+			asm volatile("pxor %xmm11,%xmm11");
+			asm volatile("pcmpgtb %xmm1,%xmm11");
+			asm volatile("paddb %xmm1,%xmm1");
+			asm volatile("pand %xmm14,%xmm11");
+			asm volatile("pxor %xmm11,%xmm1");
+
+			asm volatile("pxor %xmm10,%xmm0");
+			asm volatile("pxor %xmm10,%xmm1");
+
+			asm volatile("movdqa %xmm10,%xmm11");
+			asm volatile("psrlw  $4,%xmm11");
+			asm volatile("pand   %xmm15,%xmm10");
+			asm volatile("pand   %xmm15,%xmm11");
+
+			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][0][0][0]));
+			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][0][1][0]));
+			asm volatile("pshufb %xmm10,%xmm12");
+			asm volatile("pshufb %xmm11,%xmm13");
+			asm volatile("pxor   %xmm12,%xmm2");
+			asm volatile("pxor   %xmm13,%xmm2");
+
+			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][1][0][0]));
+			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][1][1][0]));
+			asm volatile("pshufb %xmm10,%xmm12");
+			asm volatile("pshufb %xmm11,%xmm13");
+			asm volatile("pxor   %xmm12,%xmm3");
+			asm volatile("pxor   %xmm13,%xmm3");
+
+			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][2][0][0]));
+			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][2][1][0]));
+			asm volatile("pshufb %xmm10,%xmm12");
+			asm volatile("pshufb %xmm11,%xmm13");
+			asm volatile("pxor   %xmm12,%xmm4");
+			asm volatile("pxor   %xmm13,%xmm4");
+		}
+
+		/* first disk with all coefficients at 1 */
+		asm volatile("movdqa %0,%%xmm10" : : "m" (v[0][i]));
+
+		asm volatile("pxor %xmm11,%xmm11");
+		asm volatile("pcmpgtb %xmm1,%xmm11");
+		asm volatile("paddb %xmm1,%xmm1");
+		asm volatile("pand %xmm14,%xmm11");
+		asm volatile("pxor %xmm11,%xmm1");
+
+		asm volatile("pxor %xmm10,%xmm0");
+		asm volatile("pxor %xmm10,%xmm1");
+		asm volatile("pxor %xmm10,%xmm2");
+		asm volatile("pxor %xmm10,%xmm3");
+		asm volatile("pxor %xmm10,%xmm4");
+
+		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
+		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
+		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
+		asm volatile("movntdq %%xmm3,%0" : "=m" (s[i]));
+		asm volatile("movntdq %%xmm4,%0" : "=m" (t[i]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86
+/*
+ * PAR6 (hexa parity with Cauchy matrix) SSSE3 implementation
+ */
+void raid_par6_ssse3(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	uint8_t *s;
+	uint8_t *t;
+	uint8_t *u;
+	int d, l;
+	size_t i;
+	uint8_t p0[16] __aligned(16);
+	uint8_t q0[16] __aligned(16);
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+	s = v[nd+3];
+	t = v[nd+4];
+	u = v[nd+5];
+
+	/* special case with only one data disk */
+	if (l == 0) {
+		for (i = 0; i < 6; ++i)
+			memcpy(v[1+i], v[0], size);
+		return;
+	}
+
+	asm_begin();
+
+	/* generic case with at least two data disks */
+	for (i = 0; i < size; i += 16) {
+		/* last disk without the by two multiplication */
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[l][i]));
+
+		asm volatile("movdqa %%xmm4,%0" : "=m" (p0[0]));
+		asm volatile("movdqa %%xmm4,%0" : "=m" (q0[0]));
+
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
+		asm volatile("movdqa %xmm4,%xmm5");
+		asm volatile("psrlw  $4,%xmm5");
+		asm volatile("pand   %xmm7,%xmm4");
+		asm volatile("pand   %xmm7,%xmm5");
+
+		asm volatile("movdqa %0,%%xmm0" : : "m" (gfgenpshufb[l][0][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][0][1][0]));
+		asm volatile("pshufb %xmm4,%xmm0");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pxor   %xmm7,%xmm0");
+
+		asm volatile("movdqa %0,%%xmm1" : : "m" (gfgenpshufb[l][1][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][1][1][0]));
+		asm volatile("pshufb %xmm4,%xmm1");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pxor   %xmm7,%xmm1");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][2][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][2][1][0]));
+		asm volatile("pshufb %xmm4,%xmm2");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pxor   %xmm7,%xmm2");
+
+		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][3][0][0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[l][3][1][0]));
+		asm volatile("pshufb %xmm4,%xmm3");
+		asm volatile("pshufb %xmm5,%xmm7");
+		asm volatile("pxor   %xmm7,%xmm3");
+
+		/* intermediate disks */
+		for (d = l-1; d > 0; --d) {
+			asm volatile("movdqa %0,%%xmm5" : : "m" (p0[0]));
+			asm volatile("movdqa %0,%%xmm6" : : "m" (q0[0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
+
+			asm volatile("pxor %xmm4,%xmm4");
+			asm volatile("pcmpgtb %xmm6,%xmm4");
+			asm volatile("paddb %xmm6,%xmm6");
+			asm volatile("pand %xmm7,%xmm4");
+			asm volatile("pxor %xmm4,%xmm6");
+
+			asm volatile("movdqa %0,%%xmm4" : : "m" (v[d][i]));
+
+			asm volatile("pxor %xmm4,%xmm5");
+			asm volatile("pxor %xmm4,%xmm6");
+			asm volatile("movdqa %%xmm5,%0" : "=m" (p0[0]));
+			asm volatile("movdqa %%xmm6,%0" : "=m" (q0[0]));
+
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
+			asm volatile("movdqa %xmm4,%xmm5");
+			asm volatile("psrlw  $4,%xmm5");
+			asm volatile("pand   %xmm7,%xmm4");
+			asm volatile("pand   %xmm7,%xmm5");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][0][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][0][1][0]));
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pxor   %xmm6,%xmm0");
+			asm volatile("pxor   %xmm7,%xmm0");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][1][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][1][1][0]));
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pxor   %xmm6,%xmm1");
+			asm volatile("pxor   %xmm7,%xmm1");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][2][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][2][1][0]));
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pxor   %xmm6,%xmm2");
+			asm volatile("pxor   %xmm7,%xmm2");
+
+			asm volatile("movdqa %0,%%xmm6" : : "m" (gfgenpshufb[d][3][0][0]));
+			asm volatile("movdqa %0,%%xmm7" : : "m" (gfgenpshufb[d][3][1][0]));
+			asm volatile("pshufb %xmm4,%xmm6");
+			asm volatile("pshufb %xmm5,%xmm7");
+			asm volatile("pxor   %xmm6,%xmm3");
+			asm volatile("pxor   %xmm7,%xmm3");
+		}
+
+		/* first disk with all coefficients at 1 */
+		asm volatile("movdqa %0,%%xmm5" : : "m" (p0[0]));
+		asm volatile("movdqa %0,%%xmm6" : : "m" (q0[0]));
+		asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.poly[0]));
+
+		asm volatile("pxor %xmm4,%xmm4");
+		asm volatile("pcmpgtb %xmm6,%xmm4");
+		asm volatile("paddb %xmm6,%xmm6");
+		asm volatile("pand %xmm7,%xmm4");
+		asm volatile("pxor %xmm4,%xmm6");
+
+		asm volatile("movdqa %0,%%xmm4" : : "m" (v[0][i]));
+		asm volatile("pxor %xmm4,%xmm0");
+		asm volatile("pxor %xmm4,%xmm1");
+		asm volatile("pxor %xmm4,%xmm2");
+		asm volatile("pxor %xmm4,%xmm3");
+		asm volatile("pxor %xmm4,%xmm5");
+		asm volatile("pxor %xmm4,%xmm6");
+
+		asm volatile("movntdq %%xmm5,%0" : "=m" (p[i]));
+		asm volatile("movntdq %%xmm6,%0" : "=m" (q[i]));
+		asm volatile("movntdq %%xmm0,%0" : "=m" (r[i]));
+		asm volatile("movntdq %%xmm1,%0" : "=m" (s[i]));
+		asm volatile("movntdq %%xmm2,%0" : "=m" (t[i]));
+		asm volatile("movntdq %%xmm3,%0" : "=m" (u[i]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86_64
+/*
+ * PAR6 (hexa parity with Cauchy matrix) SSSE3 implementation
+ *
+ * Note that it uses 16 registers, meaning that x64 is required.
+ */
+void raid_par6_ssse3ext(int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *q;
+	uint8_t *r;
+	uint8_t *s;
+	uint8_t *t;
+	uint8_t *u;
+	int d, l;
+	size_t i;
+
+	l = nd - 1;
+	p = v[nd];
+	q = v[nd+1];
+	r = v[nd+2];
+	s = v[nd+3];
+	t = v[nd+4];
+	u = v[nd+5];
+
+	/* special case with only one data disk */
+	if (l == 0) {
+		for (i = 0; i < 6; ++i)
+			memcpy(v[1+i], v[0], size);
+		return;
+	}
+
+	asm_begin();
+
+	/* generic case with at least two data disks */
+	asm volatile("movdqa %0,%%xmm14" : : "m" (gfconst16.poly[0]));
+	asm volatile("movdqa %0,%%xmm15" : : "m" (gfconst16.low4[0]));
+
+	for (i = 0; i < size; i += 16) {
+		/* last disk without the by two multiplication */
+		asm volatile("movdqa %0,%%xmm10" : : "m" (v[l][i]));
+
+		asm volatile("movdqa %xmm10,%xmm0");
+		asm volatile("movdqa %xmm10,%xmm1");
+
+		asm volatile("movdqa %xmm10,%xmm11");
+		asm volatile("psrlw  $4,%xmm11");
+		asm volatile("pand   %xmm15,%xmm10");
+		asm volatile("pand   %xmm15,%xmm11");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfgenpshufb[l][0][0][0]));
+		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][0][1][0]));
+		asm volatile("pshufb %xmm10,%xmm2");
+		asm volatile("pshufb %xmm11,%xmm13");
+		asm volatile("pxor   %xmm13,%xmm2");
+
+		asm volatile("movdqa %0,%%xmm3" : : "m" (gfgenpshufb[l][1][0][0]));
+		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][1][1][0]));
+		asm volatile("pshufb %xmm10,%xmm3");
+		asm volatile("pshufb %xmm11,%xmm13");
+		asm volatile("pxor   %xmm13,%xmm3");
+
+		asm volatile("movdqa %0,%%xmm4" : : "m" (gfgenpshufb[l][2][0][0]));
+		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][2][1][0]));
+		asm volatile("pshufb %xmm10,%xmm4");
+		asm volatile("pshufb %xmm11,%xmm13");
+		asm volatile("pxor   %xmm13,%xmm4");
+
+		asm volatile("movdqa %0,%%xmm5" : : "m" (gfgenpshufb[l][3][0][0]));
+		asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[l][3][1][0]));
+		asm volatile("pshufb %xmm10,%xmm5");
+		asm volatile("pshufb %xmm11,%xmm13");
+		asm volatile("pxor   %xmm13,%xmm5");
+
+		/* intermediate disks */
+		for (d = l-1; d > 0; --d) {
+			asm volatile("movdqa %0,%%xmm10" : : "m" (v[d][i]));
+
+			asm volatile("pxor %xmm11,%xmm11");
+			asm volatile("pcmpgtb %xmm1,%xmm11");
+			asm volatile("paddb %xmm1,%xmm1");
+			asm volatile("pand %xmm14,%xmm11");
+			asm volatile("pxor %xmm11,%xmm1");
+
+			asm volatile("pxor %xmm10,%xmm0");
+			asm volatile("pxor %xmm10,%xmm1");
+
+			asm volatile("movdqa %xmm10,%xmm11");
+			asm volatile("psrlw  $4,%xmm11");
+			asm volatile("pand   %xmm15,%xmm10");
+			asm volatile("pand   %xmm15,%xmm11");
+
+			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][0][0][0]));
+			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][0][1][0]));
+			asm volatile("pshufb %xmm10,%xmm12");
+			asm volatile("pshufb %xmm11,%xmm13");
+			asm volatile("pxor   %xmm12,%xmm2");
+			asm volatile("pxor   %xmm13,%xmm2");
+
+			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][1][0][0]));
+			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][1][1][0]));
+			asm volatile("pshufb %xmm10,%xmm12");
+			asm volatile("pshufb %xmm11,%xmm13");
+			asm volatile("pxor   %xmm12,%xmm3");
+			asm volatile("pxor   %xmm13,%xmm3");
+
+			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][2][0][0]));
+			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][2][1][0]));
+			asm volatile("pshufb %xmm10,%xmm12");
+			asm volatile("pshufb %xmm11,%xmm13");
+			asm volatile("pxor   %xmm12,%xmm4");
+			asm volatile("pxor   %xmm13,%xmm4");
+
+			asm volatile("movdqa %0,%%xmm12" : : "m" (gfgenpshufb[d][3][0][0]));
+			asm volatile("movdqa %0,%%xmm13" : : "m" (gfgenpshufb[d][3][1][0]));
+			asm volatile("pshufb %xmm10,%xmm12");
+			asm volatile("pshufb %xmm11,%xmm13");
+			asm volatile("pxor   %xmm12,%xmm5");
+			asm volatile("pxor   %xmm13,%xmm5");
+		}
+
+		/* first disk with all coefficients at 1 */
+		asm volatile("movdqa %0,%%xmm10" : : "m" (v[0][i]));
+
+		asm volatile("pxor %xmm11,%xmm11");
+		asm volatile("pcmpgtb %xmm1,%xmm11");
+		asm volatile("paddb %xmm1,%xmm1");
+		asm volatile("pand %xmm14,%xmm11");
+		asm volatile("pxor %xmm11,%xmm1");
+
+		asm volatile("pxor %xmm10,%xmm0");
+		asm volatile("pxor %xmm10,%xmm1");
+		asm volatile("pxor %xmm10,%xmm2");
+		asm volatile("pxor %xmm10,%xmm3");
+		asm volatile("pxor %xmm10,%xmm4");
+		asm volatile("pxor %xmm10,%xmm5");
+
+		asm volatile("movntdq %%xmm0,%0" : "=m" (p[i]));
+		asm volatile("movntdq %%xmm1,%0" : "=m" (q[i]));
+		asm volatile("movntdq %%xmm2,%0" : "=m" (r[i]));
+		asm volatile("movntdq %%xmm3,%0" : "=m" (s[i]));
+		asm volatile("movntdq %%xmm4,%0" : "=m" (t[i]));
+		asm volatile("movntdq %%xmm5,%0" : "=m" (u[i]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86
+/*
+ * RAID recovering for one disk SSSE3 implementation
+ */
+void raid_rec1_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	uint8_t *p;
+	uint8_t *pa;
+	uint8_t G;
+	uint8_t V;
+	size_t i;
+
+	(void)nr; /* unused, it's always 1 */
+
+	/* if it's RAID5 uses the faster function */
+	if (ip[0] == 0) {
+		raid_rec1_par1(id, nd, size, vv);
+		return;
+	}
+
+#ifdef RAID_USE_RAID6_PQ
+	/* if it's RAID6 recovering with Q uses the faster function */
+	if (ip[0] == 1) {
+		raid6_datap_recov(nd + 2, size, id[0], vv);
+		return;
+	}
+#endif
+
+	/* setup the coefficients matrix */
+	G = A(ip[0], id[0]);
+
+	/* invert it to solve the system of linear equations */
+	V = inv(G);
+
+	/* compute delta parity */
+	raid_delta_gen(1, id, ip, nd, size, vv);
+
+	p = v[nd+ip[0]];
+	pa = v[id[0]];
+
+	asm_begin();
+
+	asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
+	asm volatile("movdqa %0,%%xmm4" : : "m" (gfmulpshufb[V][0][0]));
+	asm volatile("movdqa %0,%%xmm5" : : "m" (gfmulpshufb[V][1][0]));
+
+	for (i = 0; i < size; i += 16) {
+		asm volatile("movdqa %0,%%xmm0" : : "m" (p[i]));
+		asm volatile("movdqa %0,%%xmm1" : : "m" (pa[i]));
+		asm volatile("movdqa %xmm4,%xmm2");
+		asm volatile("movdqa %xmm5,%xmm3");
+		asm volatile("pxor   %xmm0,%xmm1");
+		asm volatile("movdqa %xmm1,%xmm0");
+		asm volatile("psrlw  $4,%xmm1");
+		asm volatile("pand   %xmm7,%xmm0");
+		asm volatile("pand   %xmm7,%xmm1");
+		asm volatile("pshufb %xmm0,%xmm2");
+		asm volatile("pshufb %xmm1,%xmm3");
+		asm volatile("pxor   %xmm3,%xmm2");
+		asm volatile("movdqa %%xmm2,%0" : "=m" (pa[i]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86
+/*
+ * RAID recovering for two disks SSSE3 implementation
+ */
+void raid_rec2_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	const int N = 2;
+	uint8_t *p[N];
+	uint8_t *pa[N];
+	uint8_t G[N*N];
+	uint8_t V[N*N];
+	size_t i;
+	int j, k;
+
+	(void)nr; /* unused, it's always 2 */
+
+#ifdef RAID_USE_RAID6_PQ
+	/* if it's RAID6 recovering with P and Q uses the faster function */
+	if (ip[0] == 0 && ip[1] == 1) {
+		raid6_2data_recov(nd + 2, size, id[0], id[1], vv);
+		return;
+	}
+#endif
+
+	/* setup the coefficients matrix */
+	for (j = 0; j < N; ++j)
+		for (k = 0; k < N; ++k)
+			G[j*N+k] = A(ip[j], id[k]);
+
+	/* invert it to solve the system of linear equations */
+	raid_invert(G, V, N);
+
+	/* compute delta parity */
+	raid_delta_gen(N, id, ip, nd, size, vv);
+
+	for (j = 0; j < N; ++j) {
+		p[j] = v[nd+ip[j]];
+		pa[j] = v[id[j]];
+	}
+
+	asm_begin();
+
+	asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
+
+	for (i = 0; i < size; i += 16) {
+		asm volatile("movdqa %0,%%xmm0" : : "m" (p[0][i]));
+		asm volatile("movdqa %0,%%xmm2" : : "m" (pa[0][i]));
+		asm volatile("movdqa %0,%%xmm1" : : "m" (p[1][i]));
+		asm volatile("movdqa %0,%%xmm3" : : "m" (pa[1][i]));
+		asm volatile("pxor   %xmm2,%xmm0");
+		asm volatile("pxor   %xmm3,%xmm1");
+
+		asm volatile("pxor %xmm6,%xmm6");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfmulpshufb[V[0]][0][0]));
+		asm volatile("movdqa %0,%%xmm3" : : "m" (gfmulpshufb[V[0]][1][0]));
+		asm volatile("movdqa %xmm0,%xmm4");
+		asm volatile("movdqa %xmm0,%xmm5");
+		asm volatile("psrlw  $4,%xmm5");
+		asm volatile("pand   %xmm7,%xmm4");
+		asm volatile("pand   %xmm7,%xmm5");
+		asm volatile("pshufb %xmm4,%xmm2");
+		asm volatile("pshufb %xmm5,%xmm3");
+		asm volatile("pxor   %xmm2,%xmm6");
+		asm volatile("pxor   %xmm3,%xmm6");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfmulpshufb[V[1]][0][0]));
+		asm volatile("movdqa %0,%%xmm3" : : "m" (gfmulpshufb[V[1]][1][0]));
+		asm volatile("movdqa %xmm1,%xmm4");
+		asm volatile("movdqa %xmm1,%xmm5");
+		asm volatile("psrlw  $4,%xmm5");
+		asm volatile("pand   %xmm7,%xmm4");
+		asm volatile("pand   %xmm7,%xmm5");
+		asm volatile("pshufb %xmm4,%xmm2");
+		asm volatile("pshufb %xmm5,%xmm3");
+		asm volatile("pxor   %xmm2,%xmm6");
+		asm volatile("pxor   %xmm3,%xmm6");
+
+		asm volatile("movdqa %%xmm6,%0" : "=m" (pa[0][i]));
+
+		asm volatile("pxor %xmm6,%xmm6");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfmulpshufb[V[2]][0][0]));
+		asm volatile("movdqa %0,%%xmm3" : : "m" (gfmulpshufb[V[2]][1][0]));
+		asm volatile("movdqa %xmm0,%xmm4");
+		asm volatile("movdqa %xmm0,%xmm5");
+		asm volatile("psrlw  $4,%xmm5");
+		asm volatile("pand   %xmm7,%xmm4");
+		asm volatile("pand   %xmm7,%xmm5");
+		asm volatile("pshufb %xmm4,%xmm2");
+		asm volatile("pshufb %xmm5,%xmm3");
+		asm volatile("pxor   %xmm2,%xmm6");
+		asm volatile("pxor   %xmm3,%xmm6");
+
+		asm volatile("movdqa %0,%%xmm2" : : "m" (gfmulpshufb[V[3]][0][0]));
+		asm volatile("movdqa %0,%%xmm3" : : "m" (gfmulpshufb[V[3]][1][0]));
+		asm volatile("movdqa %xmm1,%xmm4");
+		asm volatile("movdqa %xmm1,%xmm5");
+		asm volatile("psrlw  $4,%xmm5");
+		asm volatile("pand   %xmm7,%xmm4");
+		asm volatile("pand   %xmm7,%xmm5");
+		asm volatile("pshufb %xmm4,%xmm2");
+		asm volatile("pshufb %xmm5,%xmm3");
+		asm volatile("pxor   %xmm2,%xmm6");
+		asm volatile("pxor   %xmm3,%xmm6");
+
+		asm volatile("movdqa %%xmm6,%0" : "=m" (pa[1][i]));
+	}
+
+	asm_end();
+}
+#endif
+
+#ifdef CONFIG_X86
+/*
+ * RAID recovering SSSE3 implementation
+ */
+void raid_recX_ssse3(int nr, int *id, int *ip, int nd, size_t size, void **vv)
+{
+	uint8_t **v = (uint8_t **)vv;
+	int N = nr;
+	uint8_t *p[RAID_PARITY_MAX];
+	uint8_t *pa[RAID_PARITY_MAX];
+	uint8_t G[RAID_PARITY_MAX*RAID_PARITY_MAX];
+	uint8_t V[RAID_PARITY_MAX*RAID_PARITY_MAX];
+	size_t i;
+	int j, k;
+
+	/* setup the coefficients matrix */
+	for (j = 0; j < N; ++j)
+		for (k = 0; k < N; ++k)
+			G[j*N+k] = A(ip[j], id[k]);
+
+	/* invert it to solve the system of linear equations */
+	raid_invert(G, V, N);
+
+	/* compute delta parity */
+	raid_delta_gen(N, id, ip, nd, size, vv);
+
+	for (j = 0; j < N; ++j) {
+		p[j] = v[nd+ip[j]];
+		pa[j] = v[id[j]];
+	}
+
+	asm_begin();
+
+	asm volatile("movdqa %0,%%xmm7" : : "m" (gfconst16.low4[0]));
+
+	for (i = 0; i < size; i += 16) {
+		uint8_t PD[RAID_PARITY_MAX][16] __aligned(16);
+
+		/* delta */
+		for (j = 0; j < N; ++j) {
+			asm volatile("movdqa %0,%%xmm0" : : "m" (p[j][i]));
+			asm volatile("movdqa %0,%%xmm1" : : "m" (pa[j][i]));
+			asm volatile("pxor   %xmm1,%xmm0");
+			asm volatile("movdqa %%xmm0,%0" : "=m" (PD[j][0]));
+		}
+
+		/* reconstruct */
+		for (j = 0; j < N; ++j) {
+			asm volatile("pxor %xmm0,%xmm0");
+			asm volatile("pxor %xmm1,%xmm1");
+
+			for (k = 0; k < N; ++k) {
+				uint8_t m = V[j*N+k];
+
+				asm volatile("movdqa %0,%%xmm2" : : "m" (gfmulpshufb[m][0][0]));
+				asm volatile("movdqa %0,%%xmm3" : : "m" (gfmulpshufb[m][1][0]));
+				asm volatile("movdqa %0,%%xmm4" : : "m" (PD[k][0]));
+				asm volatile("movdqa %xmm4,%xmm5");
+				asm volatile("psrlw  $4,%xmm5");
+				asm volatile("pand   %xmm7,%xmm4");
+				asm volatile("pand   %xmm7,%xmm5");
+				asm volatile("pshufb %xmm4,%xmm2");
+				asm volatile("pshufb %xmm5,%xmm3");
+				asm volatile("pxor   %xmm2,%xmm0");
+				asm volatile("pxor   %xmm3,%xmm1");
+			}
+
+			asm volatile("pxor %xmm1,%xmm0");
+			asm volatile("movdqa %%xmm0,%0" : "=m" (pa[j][i]));
+		}
+	}
+
+	asm_end();
+}
+#endif
+
-- 
1.7.12.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/