Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseded by #160) #158

eldipa · 2022-06-21T12:11:22Z

Context

While the runtime of a general application using pyte is dominated by stream.feed for the standard geometry (24x80), the runtime of screen.display gets dominant for larger geometries (240x800, 2400x80, 24x8000).

This is because screen.display does not use the fact that screen.buffer is sparse and iterates over the whole range of possible coordinates (x,y) in the screen, wasting time accessing non-existing entries in screen.buffer.

Proposal

This PR does a series of changes to the screen.display method to make it faster with 4 changes:

make screen.display aware that screen.buffer is sparse and iterate over the real existing chars and not over the range of coordinates (bfeab39)
inline the generator into a for-loop: generators coded in Python (not in C) have a lower performance than traditional for-loop so a change is an easy win ( 5b32e25)
remove an assert that was called for every single char: the corresponding check was moved to the tests so we don't loose coverage (13ee784)
cache wcwidth on each char: while wcwidth is already a function with a cache (thanks to functools), calling wcwidth still requires to do a call. We can avoid that storing the results of wcwidth on the char during the screen.draw and reuse it later in screen.display (c298bd3)

Results

For the standard geometry of 24x80 we got the following improvement on screen.display:

| [screen_display 24x80] cat-gpl3.input->Screen                | 656 us   | 135 us: 4.86x faster            |
| [screen_display 24x80] cat-gpl3.input->DiffScreen            | 647 us   | 131 us: 4.93x faster            |
| [screen_display 24x80] cat-gpl3.input->HistoryScreen         | 693 us   | 137 us: 5.07x faster            |
| [screen_display 24x80] find-etc.input->Screen                | 672 us   | 84.6 us: 7.94x faster           |
| [screen_display 24x80] find-etc.input->DiffScreen            | 662 us   | 83.4 us: 7.94x faster           |
| [screen_display 24x80] find-etc.input->HistoryScreen         | 718 us   | 85.1 us: 8.43x faster           |
| [screen_display 24x80] htop.input->Screen                    | 602 us   | 246 us: 2.45x faster            |
| [screen_display 24x80] htop.input->DiffScreen                | 599 us   | 244 us: 2.46x faster            |
| [screen_display 24x80] htop.input->HistoryScreen             | 604 us   | 250 us: 2.42x faster            |
| [screen_display 24x80] ls.input->Screen                      | 660 us   | 137 us: 4.82x faster            |
| [screen_display 24x80] ls.input->DiffScreen                  | 663 us   | 136 us: 4.89x faster            |
| [screen_display 24x80] ls.input->HistoryScreen               | 678 us   | 136 us: 4.97x faster            |
| [screen_display 24x80] mc.input->Screen                      | 563 us   | 277 us: 2.03x faster            |
| [screen_display 24x80] mc.input->DiffScreen                  | 551 us   | 285 us: 1.93x faster            |
| [screen_display 24x80] mc.input->HistoryScreen               | 574 us   | 277 us: 2.07x faster            |
| [screen_display 24x80] top.input->Screen                     | 644 us   | 154 us: 4.19x faster            |
| [screen_display 24x80] top.input->DiffScreen                 | 649 us   | 152 us: 4.26x faster            |
| [screen_display 24x80] top.input->HistoryScreen              | 663 us   | 158 us: 4.20x faster            |
| [screen_display 24x80] vi.input->Screen                      | 623 us   | 165 us: 3.77x faster            |
| [screen_display 24x80] vi.input->DiffScreen                  | 622 us   | 170 us: 3.66x faster            |
| [screen_display 24x80] vi.input->HistoryScreen               | 647 us   | 169 us: 3.84x faster            |

For larger geometries we made screen.display x10, x100 and almost x1000 faster.

For stream.feed we got a minimal improvement and a minimal regression (*)

| [stream_feed 24x80] cat-gpl3.input->Screen                   | 48.3 ms  | 49.2 ms: 1.02x slower           |
| [stream_feed 24x80] cat-gpl3.input->DiffScreen               | 46.7 ms  | 47.6 ms: 1.02x slower           |
| [stream_feed 24x80] cat-gpl3.input->HistoryScreen            | 155 ms   | 149 ms: 1.04x faster            |
| [stream_feed 24x80] find-etc.input->DiffScreen               | 92.6 ms  | 96.7 ms: 1.04x slower           |
| [stream_feed 24x80] find-etc.input->HistoryScreen            | 319 ms   | 303 ms: 1.05x faster            |
| [stream_feed 24x80] htop.input->Screen                       | 21.9 ms  | 21.2 ms: 1.03x faster           |
| [stream_feed 24x80] htop.input->DiffScreen                   | 21.6 ms  | 21.2 ms: 1.02x faster           |
| [stream_feed 24x80] ls.input->Screen                         | 2.29 ms  | 2.23 ms: 1.03x faster           |
| [stream_feed 24x80] ls.input->DiffScreen                     | 2.19 ms  | 2.22 ms: 1.02x slower           |
| [stream_feed 24x80] ls.input->HistoryScreen                  | 7.17 ms  | 6.87 ms: 1.04x faster           |
| [stream_feed 24x80] mc.input->HistoryScreen                  | 46.5 ms  | 45.4 ms: 1.02x faster           |
| [stream_feed 24x80] top.input->Screen                        | 2.49 ms  | 2.41 ms: 1.03x faster           |
| [stream_feed 24x80] top.input->DiffScreen                    | 2.54 ms  | 2.45 ms: 1.04x faster           |
| [stream_feed 24x80] top.input->HistoryScreen                 | 7.69 ms  | 7.28 ms: 1.06x faster           |
| [stream_feed 24x80] vi.input->Screen                         | 4.72 ms  | 4.53 ms: 1.04x faster           |

(*) I don't thing that the results of stream.feed are meaningful and the discrepancies look like more due the noise. In a separated analysis about pyperf (the tool that we use for the benchmark), it seems that it uses the average instead of the minimum of the samples so this will make the results slightly unstable)

Full results are in benchmark_results/: one file has the performance for 0.8.1 while the other includes the optimizations. These benchmark were executed with the auxiliary script fullbenchmark.

Since 0.8.1 pyte does not support Python 2.x anymore so it makes sense to upgrade one of its dev dependencies, pyperf.

Receive via environ the geometry of the screen to test with a default of 24 lines by 80 columns. Add this and the input file into Runner's metadata so it is preserved in the log file (if any)

Implement three more benchmark scenarios for testing screen.display, screen.reset and screen.resize. For the standard 24x80 geometry, these methods have a negligible cost however of larger geometries, they can be up to 100 times slower than stream.feed so benchmarking them is important. Changed how the metadata is stored so on each bench_func call we encode which scenario are we testing, with which screen class and geometry.

A shell script to test all the captured input files and run them under different terminal geometries (24x80, 240x800, 2400x8000, 24x8000 and 2400x80). These settings aim to stress pyte with larger and larger screens (by a 10 factor on both dimensions and on each dimension separately).

The input files in the tests/captured must be loaded with ByteStream and not Stream, otherwise the \r are lost and the benchmark results may not reflect real scenarios.

The former `for x in range(...)` implementation iterated over the all the possibly indexes (for columns and lines) wasting cyclies because some of those indexes (and in some cases most) pointed to non-existing entries. These non-existing entries were faked and a default character was returned in place. This commit instead makes display to iterate over the existing entries. When gaps between to entries are detected, the gap is filled with the same default character without having to pay for indexing non-entries. Note: I found that in the current implementation of screen, screen.buffer may have entries (chars in a line) outside of the width of the screen. At the display method those are filtered out however I'm not sure if this is not a real bug that was uncovered because never we iterated over the data entries. If this is true, we may be wasting space as we keep in memory chars that are outside of the screen.

Python generators (yield) and function calls are slower then normal for-loops. Improve screen.display by x1 to x1.8 times faster by inlining the code.

The assert that checks the width of each char is removed from screen.display and put it into the tests. This ensures that our test suite maintains the same quality and at the same time we make screen.display ~x1.7 faster.

Instead of computing it on each screen.display, compute the width of the char once on screen.draw and store it in the Char tuple. This makes screen.display ~x1.10 to ~x1.20 faster and it makes stream.feed only ~x1.01 slower in the worst case. This negative impact is due the change on screen.draw but measurements on my lab show inconsistent results (stream.feed didn't show a consistent performance regression and ~x1.01 slower was the worst value that I've got).

eldipa · 2022-07-14T03:18:50Z

Closed, superseded by #160

eldipa added 5 commits June 17, 2022 15:25

Upgrade pyperf (drop support for Python 2.x)

8513fe8

Since 0.8.1 pyte does not support Python 2.x anymore so it makes sense to upgrade one of its dev dependencies, pyperf.

Allow change the screen geometry

cabc0a5

Receive via environ the geometry of the screen to test with a default of 24 lines by 80 columns. Add this and the input file into Runner's metadata so it is preserved in the log file (if any)

Fix benchmark.py using ByteStream and not Stream

e0b0e8b

The input files in the tests/captured must be loaded with ByteStream and not Stream, otherwise the \r are lost and the benchmark results may not reflect real scenarios.

eldipa force-pushed the Display-Optimizations branch from b37bfaf to 011dcb6 Compare July 2, 2022 22:02

eldipa added 5 commits July 4, 2022 17:46

Enable optionally tracemalloc on full benchmark

eec4a2e

Inline generator into display inner loop

b3b7db4

Python generators (yield) and function calls are slower then normal for-loops. Improve screen.display by x1 to x1.8 times faster by inlining the code.

Move assert out of prod code

de59245

The assert that checks the width of each char is removed from screen.display and put it into the tests. This ensures that our test suite maintains the same quality and at the same time we make screen.display ~x1.7 faster.

eldipa force-pushed the Display-Optimizations branch from 011dcb6 to 020fce6 Compare July 4, 2022 22:26

eldipa changed the title ~~Display optimizations (between 2x00 and 8x00 times faster)~~ Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseed by #160) Jul 14, 2022

eldipa closed this Jul 14, 2022

eldipa changed the title ~~Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseed by #160)~~ Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseded by #160) Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseded by #160) #158

Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseded by #160) #158

eldipa commented Jun 21, 2022

eldipa commented Jul 14, 2022

Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseded by #160) #158

Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseded by #160) #158

Conversation

eldipa commented Jun 21, 2022

Context

Proposal

Results

eldipa commented Jul 14, 2022