Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseded by #160) #158

Closed
wants to merge 10 commits into from

Conversation

eldipa
Copy link
Contributor

@eldipa eldipa commented Jun 21, 2022

Context

While the runtime of a general application using pyte is dominated by stream.feed for the standard geometry (24x80), the runtime of screen.display gets dominant for larger geometries (240x800, 2400x80, 24x8000).

This is because screen.display does not use the fact that screen.buffer is sparse and iterates over the whole range of possible coordinates (x,y) in the screen, wasting time accessing non-existing entries in screen.buffer.

Proposal

This PR does a series of changes to the screen.display method to make it faster with 4 changes:

  • make screen.display aware that screen.buffer is sparse and iterate over the real existing chars and not over the range of coordinates (bfeab39)
  • inline the generator into a for-loop: generators coded in Python (not in C) have a lower performance than traditional for-loop so a change is an easy win ( 5b32e25)
  • remove an assert that was called for every single char: the corresponding check was moved to the tests so we don't loose coverage (13ee784)
  • cache wcwidth on each char: while wcwidth is already a function with a cache (thanks to functools), calling wcwidth still requires to do a call. We can avoid that storing the results of wcwidth on the char during the screen.draw and reuse it later in screen.display (c298bd3)

Results

For the standard geometry of 24x80 we got the following improvement on screen.display:

| [screen_display 24x80] cat-gpl3.input->Screen                | 656 us   | 135 us: 4.86x faster            |
| [screen_display 24x80] cat-gpl3.input->DiffScreen            | 647 us   | 131 us: 4.93x faster            |
| [screen_display 24x80] cat-gpl3.input->HistoryScreen         | 693 us   | 137 us: 5.07x faster            |
| [screen_display 24x80] find-etc.input->Screen                | 672 us   | 84.6 us: 7.94x faster           |
| [screen_display 24x80] find-etc.input->DiffScreen            | 662 us   | 83.4 us: 7.94x faster           |
| [screen_display 24x80] find-etc.input->HistoryScreen         | 718 us   | 85.1 us: 8.43x faster           |
| [screen_display 24x80] htop.input->Screen                    | 602 us   | 246 us: 2.45x faster            |
| [screen_display 24x80] htop.input->DiffScreen                | 599 us   | 244 us: 2.46x faster            |
| [screen_display 24x80] htop.input->HistoryScreen             | 604 us   | 250 us: 2.42x faster            |
| [screen_display 24x80] ls.input->Screen                      | 660 us   | 137 us: 4.82x faster            |
| [screen_display 24x80] ls.input->DiffScreen                  | 663 us   | 136 us: 4.89x faster            |
| [screen_display 24x80] ls.input->HistoryScreen               | 678 us   | 136 us: 4.97x faster            |
| [screen_display 24x80] mc.input->Screen                      | 563 us   | 277 us: 2.03x faster            |
| [screen_display 24x80] mc.input->DiffScreen                  | 551 us   | 285 us: 1.93x faster            |
| [screen_display 24x80] mc.input->HistoryScreen               | 574 us   | 277 us: 2.07x faster            |
| [screen_display 24x80] top.input->Screen                     | 644 us   | 154 us: 4.19x faster            |
| [screen_display 24x80] top.input->DiffScreen                 | 649 us   | 152 us: 4.26x faster            |
| [screen_display 24x80] top.input->HistoryScreen              | 663 us   | 158 us: 4.20x faster            |
| [screen_display 24x80] vi.input->Screen                      | 623 us   | 165 us: 3.77x faster            |
| [screen_display 24x80] vi.input->DiffScreen                  | 622 us   | 170 us: 3.66x faster            |
| [screen_display 24x80] vi.input->HistoryScreen               | 647 us   | 169 us: 3.84x faster            |

For larger geometries we made screen.display x10, x100 and almost x1000 faster.

For stream.feed we got a minimal improvement and a minimal regression (*)

| [stream_feed 24x80] cat-gpl3.input->Screen                   | 48.3 ms  | 49.2 ms: 1.02x slower           |
| [stream_feed 24x80] cat-gpl3.input->DiffScreen               | 46.7 ms  | 47.6 ms: 1.02x slower           |
| [stream_feed 24x80] cat-gpl3.input->HistoryScreen            | 155 ms   | 149 ms: 1.04x faster            |
| [stream_feed 24x80] find-etc.input->DiffScreen               | 92.6 ms  | 96.7 ms: 1.04x slower           |
| [stream_feed 24x80] find-etc.input->HistoryScreen            | 319 ms   | 303 ms: 1.05x faster            |
| [stream_feed 24x80] htop.input->Screen                       | 21.9 ms  | 21.2 ms: 1.03x faster           |
| [stream_feed 24x80] htop.input->DiffScreen                   | 21.6 ms  | 21.2 ms: 1.02x faster           |
| [stream_feed 24x80] ls.input->Screen                         | 2.29 ms  | 2.23 ms: 1.03x faster           |
| [stream_feed 24x80] ls.input->DiffScreen                     | 2.19 ms  | 2.22 ms: 1.02x slower           |
| [stream_feed 24x80] ls.input->HistoryScreen                  | 7.17 ms  | 6.87 ms: 1.04x faster           |
| [stream_feed 24x80] mc.input->HistoryScreen                  | 46.5 ms  | 45.4 ms: 1.02x faster           |
| [stream_feed 24x80] top.input->Screen                        | 2.49 ms  | 2.41 ms: 1.03x faster           |
| [stream_feed 24x80] top.input->DiffScreen                    | 2.54 ms  | 2.45 ms: 1.04x faster           |
| [stream_feed 24x80] top.input->HistoryScreen                 | 7.69 ms  | 7.28 ms: 1.06x faster           |
| [stream_feed 24x80] vi.input->Screen                         | 4.72 ms  | 4.53 ms: 1.04x faster           |

(*) I don't thing that the results of stream.feed are meaningful and the discrepancies look like more due the noise. In a separated analysis about pyperf (the tool that we use for the benchmark), it seems that it uses the average instead of the minimum of the samples so this will make the results slightly unstable)

Full results are in benchmark_results/: one file has the performance for 0.8.1 while the other includes the optimizations. These benchmark were executed with the auxiliary script fullbenchmark.

Since 0.8.1 pyte does not support Python 2.x anymore so it makes sense
to upgrade one of its dev dependencies, pyperf.
Receive via environ the geometry of the screen to test with a
default of 24 lines by 80 columns.
Add this and the input file into Runner's metadata so it is preserved in
the log file (if any)
Implement three more benchmark scenarios for testing screen.display,
screen.reset and screen.resize.

For the standard 24x80 geometry, these methods have a negligible cost
however of larger geometries, they can be up to 100 times slower than
stream.feed so benchmarking them is important.

Changed how the metadata is stored so on each bench_func call we encode
which scenario are we testing, with which screen class and geometry.
A shell script to test all the captured input files and run them
under different terminal geometries (24x80, 240x800, 2400x8000, 24x8000
and 2400x80).

These settings aim to stress pyte with larger and larger screens (by a
10 factor on both dimensions and on each dimension separately).
The input files in the tests/captured must be loaded with ByteStream and
not Stream, otherwise the \r are lost and the benchmark results may not
reflect real scenarios.
The former `for x in range(...)` implementation iterated over the all
the possibly indexes (for columns and lines) wasting cyclies because
some of those indexes (and in some cases most) pointed to non-existing
entries.

These non-existing entries were faked and a default character was
returned in place.

This commit instead makes display to iterate over the existing entries.
When gaps between to entries are detected, the gap is filled with the
same default character without having to pay for indexing non-entries.

Note: I found that in the current implementation of screen,
screen.buffer may have entries (chars in a line) outside of the width of
the screen. At the display method those are filtered out however I'm not
sure if this is not a real bug that was uncovered because never we
iterated over the data entries. If this is true, we may be wasting space
as we keep in memory chars that are outside of the screen.
Python generators (yield) and function calls are slower then normal
for-loops. Improve screen.display by x1 to x1.8 times faster by
inlining the code.
The assert that checks the width of each char is removed from
screen.display and put it into the tests. This ensures that our test
suite maintains the same quality and at the same time we make
screen.display ~x1.7 faster.
Instead of computing it on each screen.display, compute the width of the
char once on screen.draw and store it in the Char tuple.

This makes screen.display ~x1.10 to ~x1.20 faster and it makes
stream.feed only ~x1.01 slower in the worst case. This negative impact
is due the change on screen.draw but measurements on my lab show
inconsistent results (stream.feed didn't show a consistent performance
regression and ~x1.01 slower was the worst value that I've got).
@eldipa eldipa changed the title Display optimizations (between 2x00 and 8x00 times faster) Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseed by #160) Jul 14, 2022
@eldipa
Copy link
Contributor Author

eldipa commented Jul 14, 2022

Closed, superseded by #160

@eldipa eldipa closed this Jul 14, 2022
@eldipa eldipa changed the title Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseed by #160) Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseded by #160) Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant