Instead of "realistic data", I chose the test data to be either all-zero (all-ascii-zero to be precise)
or all-random and benchmark them separately.
So we can better determine the cause (deduplication or storage) in case we see some performance regression.
"help" is benchmarked to see the minimum runtime when it basically does nothing.
also:
- refactor archiver execution core functionality into exec_cmd() so it can be used more flexibly
- tox: usually we want to skip benchmarks, only run them if requested manually
- install pytest-benchmark - run tox with "-r" to have it installed into your .tox envs