vendor/github.com/klauspost/reedsolomon/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395

# Reed-Solomon
[![GoDoc][1]][2] [![Build Status][3]][4]

[1]: https://godoc.org/github.com/klauspost/reedsolomon?status.svg
[2]: https://pkg.go.dev/github.com/klauspost/reedsolomon?tab=doc
[3]: https://travis-ci.org/klauspost/reedsolomon.svg?branch=master
[4]: https://travis-ci.org/klauspost/reedsolomon

Reed-Solomon Erasure Coding in Go, with speeds exceeding 1GB/s/cpu core implemented in pure Go.

This is a Go port of the [JavaReedSolomon](https://github.com/Backblaze/JavaReedSolomon) library released by 
[Backblaze](http://backblaze.com), with some additional optimizations.

For an introduction on erasure coding, see the post on the [Backblaze blog](https://www.backblaze.com/blog/reed-solomon/).

Package home: https://github.com/klauspost/reedsolomon

Godoc: https://pkg.go.dev/github.com/klauspost/reedsolomon?tab=doc

# Installation
To get the package use the standard:
```bash
go get -u github.com/klauspost/reedsolomon
```

# Changes

## May 2020

* ARM64 optimizations, up to 2.5x faster.
* Added [WithFastOneParityMatrix](https://pkg.go.dev/github.com/klauspost/reedsolomon?tab=doc#WithFastOneParityMatrix) for faster operation with 1 parity shard.
* Much better performance when using a limited number of goroutines.
* AVX512 is now using multiple cores.
* Stream processing overhaul, big speedups in most cases.
* AVX512 optimizations

## March 6, 2019

The pure Go implementation is about 30% faster. Minor tweaks to assembler implementations.

## February 8, 2019

AVX512 accelerated version added for Intel Skylake CPUs. This can give up to a 4x speed improvement as compared to AVX2.
See [here](https://github.com/klauspost/reedsolomon#performance-on-avx512) for more details.

## December 18, 2018

Assembly code for ppc64le has been contributed, this boosts performance by about 10x on this platform.

## November 18, 2017

Added [WithAutoGoroutines](https://godoc.org/github.com/klauspost/reedsolomon#WithAutoGoroutines) which will attempt 
to calculate the optimal number of goroutines to use based on your expected shard size and detected CPU.

## October 1, 2017

* [Cauchy Matrix](https://godoc.org/github.com/klauspost/reedsolomon#WithCauchyMatrix) is now an option. 
Thanks to [templexxx](https://github.com/templexxx) for the basis of this.

* Default maximum number of [goroutines](https://godoc.org/github.com/klauspost/reedsolomon#WithMaxGoroutines) 
has been increased for better multi-core scaling.

* After several requests the Reconstruct and ReconstructData now slices of zero length but sufficient capacity to 
be used instead of allocating new memory.

## August 26, 2017

*  The [`Encoder()`](https://godoc.org/github.com/klauspost/reedsolomon#Encoder) now contains an `Update` 
function contributed by [chenzhongtao](https://github.com/chenzhongtao).

* [Frank Wessels](https://github.com/fwessels) kindly contributed ARM 64 bit assembly, 
which gives a huge performance boost on this platform.

## July 20, 2017

`ReconstructData` added to [`Encoder`](https://godoc.org/github.com/klauspost/reedsolomon#Encoder) interface. 
This can cause compatibility issues if you implement your own Encoder. A simple workaround can be added:

```Go
func (e *YourEnc) ReconstructData(shards [][]byte) error {
	return ReconstructData(shards)
}
```

You can of course also do your own implementation. 
The [`StreamEncoder`](https://godoc.org/github.com/klauspost/reedsolomon#StreamEncoder) 
handles this without modifying the interface. 
This is a good lesson on why returning interfaces is not a good design.

# Usage

This section assumes you know the basics of Reed-Solomon encoding. 
A good start is this [Backblaze blog post](https://www.backblaze.com/blog/reed-solomon/).

This package performs the calculation of the parity sets. The usage is therefore relatively simple.

First of all, you need to choose your distribution of data and parity shards. 
A 'good' distribution is very subjective, and will depend a lot on your usage scenario. 
A good starting point is above 5 and below 257 data shards (the maximum supported number), 
and the number of parity shards to be 2 or above, and below the number of data shards.

To create an encoder with 10 data shards (where your data goes) and 3 parity shards (calculated):
```Go
    enc, err := reedsolomon.New(10, 3)
```
This encoder will work for all parity sets with this distribution of data and parity shards. 
The error will only be set if you specify 0 or negative values in any of the parameters, 
or if you specify more than 256 data shards.

If you will primarily be using it with one shard size it is recommended to use 
[`WithAutoGoroutines(shardSize)`](https://pkg.go.dev/github.com/klauspost/reedsolomon?tab=doc#WithAutoGoroutines)
as an additional parameter. This will attempt to calculate the optimal number of goroutines to use for the best speed.
It is not required that all shards are this size. 

The you send and receive data  is a simple slice of byte slices; `[][]byte`. 
In the example above, the top slice must have a length of 13.

```Go
    data := make([][]byte, 13)
```
You should then fill the 10 first slices with *equally sized* data, 
and create parity shards that will be populated with parity data. In this case we create the data in memory, 
but you could for instance also use [mmap](https://github.com/edsrzf/mmap-go) to map files.

```Go
    // Create all shards, size them at 50000 each
    for i := range input {
      data[i] := make([]byte, 50000)
    }
    
    
  // Fill some data into the data shards
    for i, in := range data[:10] {
      for j:= range in {
         in[j] = byte((i+j)&0xff)
      }
    }
```

To populate the parity shards, you simply call `Encode()` with your data.
```Go
    err = enc.Encode(data)
```
The only cases where you should get an error is, if the data shards aren't of equal size. 
The last 3 shards now contain parity data. You can verify this by calling `Verify()`:

```Go
    ok, err = enc.Verify(data)
```

The final (and important) part is to be able to reconstruct missing shards. 
For this to work, you need to know which parts of your data is missing. 
The encoder *does not know which parts are invalid*, so if data corruption is a likely scenario, 
you need to implement a hash check for each shard. 

If a byte has changed in your set, and you don't know which it is, there is no way to reconstruct the data set.

To indicate missing data, you set the shard to nil before calling `Reconstruct()`:

```Go
    // Delete two data shards
    data[3] = nil
    data[7] = nil
    
    // Reconstruct the missing shards
    err := enc.Reconstruct(data)
```
The missing data and parity shards will be recreated. If more than 3 shards are missing, the reconstruction will fail.

If you are only interested in the data shards (for reading purposes) you can call `ReconstructData()`:

```Go
    // Delete two data shards
    data[3] = nil
    data[7] = nil
    
    // Reconstruct just the missing data shards
    err := enc.ReconstructData(data)
```

So to sum up reconstruction:
* The number of data/parity shards must match the numbers used for encoding.
* The order of shards must be the same as used when encoding.
* You may only supply data you know is valid.
* Invalid shards should be set to nil.

For complete examples of an encoder and decoder see the 
[examples folder](https://github.com/klauspost/reedsolomon/tree/master/examples).

# Splitting/Joining Data

You might have a large slice of data. 
To help you split this, there are some helper functions that can split and join a single byte slice.

```Go
   bigfile, _ := ioutil.Readfile("myfile.data")
   
   // Split the file
   split, err := enc.Split(bigfile)
```
This will split the file into the number of data shards set when creating the encoder and create empty parity shards. 

An important thing to note is that you have to *keep track of the exact input size*. 
If the size of the input isn't divisible by the number of data shards, extra zeros will be inserted in the last shard.

To join a data set, use the `Join()` function, which will join the shards and write it to the `io.Writer` you supply: 
```Go
   // Join a data set and write it to io.Discard.
   err = enc.Join(io.Discard, data, len(bigfile))
```

# Streaming/Merging

It might seem like a limitation that all data should be in memory, 
but an important property is that *as long as the number of data/parity shards are the same, 
you can merge/split data sets*, and they will remain valid as a separate set.

```Go
    // Split the data set of 50000 elements into two of 25000
    splitA := make([][]byte, 13)
    splitB := make([][]byte, 13)
    
    // Merge into a 100000 element set
    merged := make([][]byte, 13)
    
    for i := range data {
      splitA[i] = data[i][:25000]
      splitB[i] = data[i][25000:]
      
      // Concatenate it to itself
	  merged[i] = append(make([]byte, 0, len(data[i])*2), data[i]...)
	  merged[i] = append(merged[i], data[i]...)
    }
    
    // Each part should still verify as ok.
    ok, err := enc.Verify(splitA)
    if ok && err == nil {
        log.Println("splitA ok")
    }
    
    ok, err = enc.Verify(splitB)
    if ok && err == nil {
        log.Println("splitB ok")
    }
    
    ok, err = enc.Verify(merge)
    if ok && err == nil {
        log.Println("merge ok")
    }
```

This means that if you have a data set that may not fit into memory, you can split processing into smaller blocks. 
For the best throughput, don't use too small blocks.

This also means that you can divide big input up into smaller blocks, and do reconstruction on parts of your data. 
This doesn't give the same flexibility of a higher number of data shards, but it will be much more performant.

# Streaming API

There has been added support for a streaming API, to help perform fully streaming operations, 
which enables you to do the same operations, but on streams. 
To use the stream API, use [`NewStream`](https://godoc.org/github.com/klauspost/reedsolomon#NewStream) function 
to create the encoding/decoding interfaces. 

You can use [`WithConcurrentStreams`](https://godoc.org/github.com/klauspost/reedsolomon#WithConcurrentStreams) 
to ready an interface that reads/writes concurrently from the streams.

You can specify the size of each operation using 
[`WithStreamBlockSize`](https://godoc.org/github.com/klauspost/reedsolomon#WithStreamBlockSize).
This will set the size of each read/write operation.

Input is delivered as `[]io.Reader`, output as `[]io.Writer`, and functionality corresponds to the in-memory API. 
Each stream must supply the same amount of data, similar to how each slice must be similar size with the in-memory API. 
If an error occurs in relation to a stream, 
a [`StreamReadError`](https://godoc.org/github.com/klauspost/reedsolomon#StreamReadError) 
or [`StreamWriteError`](https://godoc.org/github.com/klauspost/reedsolomon#StreamWriteError) 
will help you determine which stream was the offender.

There is no buffering or timeouts/retry specified. If you want to add that, you need to add it to the Reader/Writer.

For complete examples of a streaming encoder and decoder see the 
[examples folder](https://github.com/klauspost/reedsolomon/tree/master/examples).

# Advanced Options

You can modify internal options which affects how jobs are split between and processed by goroutines.

To create options, use the WithXXX functions. You can supply options to `New`, `NewStream`. 
If no Options are supplied, default options are used.

Example of how to supply options:

 ```Go
     enc, err := reedsolomon.New(10, 3, WithMaxGoroutines(25))
 ```


# Performance
Performance depends mainly on the number of parity shards. 
In rough terms, doubling the number of parity shards will double the encoding time.

Here are the throughput numbers with some different selections of data and parity shards. 
For reference each shard is 1MB random data, and 2 CPU cores are used for encoding.

| Data | Parity | Parity | MB/s   | SSSE3 MB/s  | SSSE3 Speed | Rel. Speed |
|------|--------|--------|--------|-------------|-------------|------------|
| 5    | 2      | 40%    | 576,11 | 2599,2      | 451%        | 100,00%    |
| 10   | 2      | 20%    | 587,73 | 3100,28     | 528%        | 102,02%    |
| 10   | 4      | 40%    | 298,38 | 2470,97     | 828%        | 51,79%     |
| 50   | 20     | 40%    | 59,81  | 713,28      | 1193%       | 10,38%     |

If `runtime.GOMAXPROCS()` is set to a value higher than 1, 
the encoder will use multiple goroutines to perform the calculations in `Verify`, `Encode` and `Reconstruct`.

Example of performance scaling on AMD Ryzen 3950X - 16 physical cores, 32 logical cores, AVX 2.
The example uses 10 blocks with 1MB data each and 4 parity blocks.

| Threads | Speed      |
|---------|------------|
| 1       | 9979 MB/s  |
| 2       | 18870 MB/s |
| 4       | 33697 MB/s |
| 8       | 51531 MB/s |
| 16      | 59204 MB/s |


Benchmarking `Reconstruct()` followed by a `Verify()` (=`all`) versus just calling `ReconstructData()` (=`data`) gives the following result:
```
benchmark                            all MB/s     data MB/s    speedup
BenchmarkReconstruct10x2x10000-8     2011.67      10530.10     5.23x
BenchmarkReconstruct50x5x50000-8     4585.41      14301.60     3.12x
BenchmarkReconstruct10x2x1M-8        8081.15      28216.41     3.49x
BenchmarkReconstruct5x2x1M-8         5780.07      28015.37     4.85x
BenchmarkReconstruct10x4x1M-8        4352.56      14367.61     3.30x
BenchmarkReconstruct50x20x1M-8       1364.35      4189.79      3.07x
BenchmarkReconstruct10x4x16M-8       1484.35      5779.53      3.89x
```

# Performance on AVX512

The performance on AVX512 has been accelerated for Intel CPUs. 
This gives speedups on a per-core basis typically up to 2x compared to 
AVX2 as can be seen in the following table:

```
[...]
```

This speedup has been achieved by computing multiple parity blocks in parallel as opposed to one after the other. 
In doing so it is possible to minimize the memory bandwidth required for loading all data shards. 
At the same time the calculations are performed in the 512-bit wide ZMM registers and the surplus of ZMM 
registers (32 in total) is used to keep more data around (most notably the matrix coefficients).

# Performance on ARM64 NEON

By exploiting NEON instructions the performance for ARM has been accelerated. 
Below are the performance numbers for a single core on an EC2 m6g.16xlarge (Graviton2) instance (Amazon Linux 2):

```
BenchmarkGalois128K-64        119562     10028 ns/op        13070.78 MB/s
BenchmarkGalois1M-64           14380     83424 ns/op        12569.22 MB/s
BenchmarkGaloisXor128K-64      96508     12432 ns/op        10543.29 MB/s
BenchmarkGaloisXor1M-64        10000    100322 ns/op        10452.13 MB/s
```

# Performance on ppc64le

The performance for ppc64le has been accelerated. 
This gives roughly a 10x performance improvement on this architecture as can been seen below:

```
benchmark                      old MB/s     new MB/s     speedup
BenchmarkGalois128K-160        948.87       8878.85      9.36x
BenchmarkGalois1M-160          968.85       9041.92      9.33x
BenchmarkGaloisXor128K-160     862.02       7905.00      9.17x
BenchmarkGaloisXor1M-160       784.60       6296.65      8.03x
```

# asm2plan9s

[asm2plan9s](https://github.com/fwessels/asm2plan9s) is used for assembling the AVX2 instructions into their BYTE/WORD/LONG equivalents.

# Links
* [Backblaze Open Sources Reed-Solomon Erasure Coding Source Code](https://www.backblaze.com/blog/reed-solomon/).
* [JavaReedSolomon](https://github.com/Backblaze/JavaReedSolomon). Compatible java library by Backblaze.
* [ocaml-reed-solomon-erasure](https://gitlab.com/darrenldl/ocaml-reed-solomon-erasure). Compatible OCaml implementation.
* [reedsolomon-c](https://github.com/jannson/reedsolomon-c). C version, compatible with output from this package.
* [Reed-Solomon Erasure Coding in Haskell](https://github.com/NicolasT/reedsolomon). Haskell port of the package with similar performance.
* [reed-solomon-erasure](https://github.com/darrenldl/reed-solomon-erasure). Compatible Rust implementation.
* [go-erasure](https://github.com/somethingnew2-0/go-erasure). A similar library using cgo, slower in my tests.
* [Screaming Fast Galois Field Arithmetic](http://www.snia.org/sites/default/files2/SDC2013/presentations/NewThinking/EthanMiller_Screaming_Fast_Galois_Field%20Arithmetic_SIMD%20Instructions.pdf). Basis for SSE3 optimizations.

# License

This code, as the original [JavaReedSolomon](https://github.com/Backblaze/JavaReedSolomon) is published under an MIT license. See LICENSE file for more information.