docs/benchmarks.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106

.. _benchmarks:

Benchmarks
==========

We currently use `pytest-benchmark <https://pytest-benchmark.readthedocs.io/>`_
to write tests to assess the time and resources taken by various tasks.

To run benchmark tests, once inside a cloned Soledad repository, do the
following::

    tox -e benchmark

Results of automated benchmarking for each commit in the repository can be seen
in: https://benchmarks.leap.se/.

Benchmark tests also depend on `tox` and `CouchDB`. See the :ref:`tests` page
for more information on how to setup the test environment.

Test repetition
---------------

``pytest-benchmark`` runs tests multiple times so it can provide meaningful
statistics for the time taken for a tipical run of a test function. The number
of times that the test is run can be manually or automatically configured.

When automatically configured, the number of runs is decided by taking into
account multiple ``pytest-benchmark`` configuration parameters. See the `the
corresponding documenation
<https://pytest-benchmark.readthedocs.io/en/stable/calibration.html>`_ for more
details on how automatic calibration works.

To achieve a reasonable number of repetitions and a reasonable amount of time
at the same time, we let ``pytest-benchmark`` choose the number of repetitions
for faster tests, and manually limit the number of repetitions for slower tests.

Currently, tests for `synchronization` and `sqlcipher asynchronous document
creation` are fixed to run 4 times each. All the other tests are left for
``pytest-benchmark`` to decide how many times to run each one. With this setup,
the benchmark suite is taking approximatelly 7 minutes to run in our CI server.
As the benchmark suite is run twice (once for time and cpu stats and a second
time for memory stats), the whole benchmarks run takes around 15 minutes.

The actual number of times a test is run when calibration is done automatically
by ``pytest-benchmark`` depends on many parameters: the time taken for a sample
run and the configuration of the minimum number of rounds and maximum time
allowed for a benchmark. For a snapshot of the number of rounds for each test
function see `the soledad benchmarks wiki page
<https://0xacab.org/leap/soledad/wikis/benchmarks>`_.

Sync size statistics
--------------------

Currenly, the main use of Soledad is to synchronize client-encrypted email
data. Because of that, it makes sense to measure the time and resources taken
to synchronize an amount of data that is realistically comparable to a user's
email box.

In order to determine what is a good example of dataset for synchronization
tests, we used the size of messages of one week of incoming and outgoing email
flow of a friendly provider. The statistics that came out from that are (all
sizes are in KB):

+--------+-----------+-----------+
|        | outgoing  | incoming  |
+========+===========+===========+
| min    | 0.675     | 0.461     |
+--------+-----------+-----------+
| max    | 25531.361 | 25571.748 |
+--------+-----------+-----------+
| mean   | 252.411   | 110.626   |
+--------+-----------+-----------+
| median | 5.320     | 14.974    |
+--------+-----------+-----------+
| mode   | 1.404     | 1.411     |
+--------+-----------+-----------+
| stddev | 1376.930  | 732.933   |
+--------+-----------+-----------+

Sync test scenarios
-------------------

Ideally, we would want to run tests for a big data set (i.e. a high number of
documents and a big payload size), but that may be infeasible given time and
resource limitations. Because of that, we choose a smaller data set and suppose
that the behaviour is somewhat linear to get an idea for larger sets.

Supposing a data set total size of 10MB, some possibilities for number of
documents and document sizes for testing download and upload can be seen below.
Scenarios marked in bold are the ones that are actually run in the current sync
benchmark tests, and you can see the current graphs for each one by following
the corresponding links:


* 10 x 1M
* **20 x 500K** (`upload <https://benchmarks.leap.se/test-dashboard_test_upload_20_500k.html>`_, `download <https://benchmarks.leap.se/test-dashboard_test_download_20_500k.html>`_)
* **100 x 100K** (`upload <https://benchmarks.leap.se/test-dashboard_test_upload_100_100k.html>`_, `download <https://benchmarks.leap.se/test-dashboard_test_download_100_100k.html>`_)
* 200 x 50K
* **1000 x 10K** (`upload <https://benchmarks.leap.se/test-dashboard_test_upload_1000_10k.html>`_, `download <https://benchmarks.leap.se/test-dashboard_test_download_1000_10k.html>`_)

In each of the above scenarios all the documents are of the same size. If we
want to account for some variability on document sizes, it is sufficient to
come up with a simple scenario where the average, minimum and maximum sizes are
somehow coherent with the above statistics, like the following one:

* 60 x 15KB + 1 x 1MB