Dev.ej/compact lexicon #400

joanise · 2024-09-13T14:50:18Z

PR Goal?

A bit of a crazy idea to reduce the amount of memory used by the g2p library: the English lexicon takes a lot of space in memory because each word is stored in a string, and each string object in Python is quite large, well beyond the size of the actual character sequence.

So, I thought, let's see what happens if I compact them by joining them by blocks. And a) it save 15MB in RAM from loading langs, b) with no measurable speed cost. The real cost is the increased complexity of the find_alignment() function in g2p.mappings.utils, but it goes from six lines to 15 lines, so, really, not that bad.

I've tested carefully to make sure the results are identical, and speed is not adversely affected.

The langs.json.gz file is a tad larger, but the memory footprint from loading it is significantly smaller.

Fixes?

Alternative solution to the problem raised in #395

Feedback sought?

Careful review.

And, whether we can reasonably expect a memory mapping solution soon, which would make this patch irrelevant.

Priority?

low

Tests added?

~~I better do that... coverage was not as good as I expected, and I will fix that to make sure I've tested all the corner cases.~~

Yes

How to test?

g2p convert "some test" eng eng-ipa still works as expected
unit tests & CI pass

Confidence?

high

Version change?

nope

github-actions · 2024-09-13T14:53:44Z

CLI load time: 0:00.05
Pull Request HEAD: 291708d1aef2a87b9480ea116f89450bec6b01b7
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

codecov · 2024-09-13T14:54:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.89%. Comparing base (b315a6c) to head (291708d).
Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #400      +/-   ##
==========================================
+ Coverage   93.82%   93.89%   +0.06%     
==========================================
  Files          18       18              
  Lines        2575     2587      +12     
  Branches      577      580       +3     
==========================================
+ Hits         2416     2429      +13     
  Misses         91       91              
+ Partials       68       67       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Store the lexicon and joined groups of 16 entries to reduce the str object memory overhead. Experimented with various block sizes to see the memory impact Measured by running `import g2p; g2p.get_arpabet_lang()`: - original: 71MB - blocks of 4: 59MB - blocks of 16: 56MB - blocks of 256: 55MB I decided the 15MB RAM savings were worth it for blocks of 16, but the gain beyond that is trivial and not worth it. In terms of speed the original code and blocks of 16 are the same, at least within the error of measurement, which was running `g2p convert --file en.txt eng eng-ipa` where en.txt is a file containing all the words in the cmudict lexicon: original and 16 both took 20-21 seconds depending on the run. At blocks of 256, I was getting 23 seconds, not a big difference, but measurable for not significant memory gain.

dhdaines · 2024-09-14T00:09:51Z

I think this actually is the memory mapping solution we're looking for, because if we put all the data together in a single (or multiple, whatever) block, then... we can easily memory map that block, without needing any external dependencies.

But also, if we support some kind of compact trainable G2P rules, we don't need to store the lexicon entries that are predictable by the rules, so that's something to think about (I could resume my debogosifying efforts on Phonetisaurus for instance)

joanise · 2024-09-16T14:13:32Z

Before we use this as memory mapping, I'd want to convert the solution from splitting the compacted blocks into memory into reading the entries in place. But yeah, if we add the ability to access individual entries via string slices, we could turn the whole list into a single string, i.e., just one compact block.
For bisect to work on this, I guess we say new_pos = (right + left)//2, then scan back until we find \x01, from there scan forward to \x01, and that's the entry to examine.
I don't know if bisect supports anything but integral indexing -- probably not -- but it's a pretty darn simple algorithm to write ourselves -- that's one I've done a number of times -- so might be the solution indeed.

Question @dhdaines do we merge this anyway, take the 15mb saving, and plan a future update with a single block as described above, or do we hold off on this PR?
I vote for merging this PR in, because

it would already let us turn the ReadAlong-Studio Heroku server back down to a 1x from the current 2x.
the one-block solution will take a bit longer to write and I don't have time right now for it.

dhdaines · 2024-09-16T17:13:19Z

Question @dhdaines do we merge this anyway, take the 15mb saving, and plan a future update with a single block as described above, or do we hold off on this PR?

Yes, merge it now, because the internal representation of our rules/lexicons is not a public API/ABI so it really doesn't matter if we change it.

joanise · 2024-09-16T17:18:36Z

Yes, merge it now, because the internal representation of our rules/lexicons is not a public API/ABI so it really doesn't matter if we change it.

Sounds good.

😁 can you approve it then?

dhdaines

Approved, though you could add the assertions / comments mentioned if you want to make sure future generations understand what's going on :)

g2p/mappings/utils.py

joanise · 2024-09-16T19:12:31Z

Yes, adding that additional documentation (and possibly assertions) would be helpful, this is definitely not obvious code. Which is part of why I was as extensive as I was for unit testing, making unit testing confirm to me I didn't miss any corner cases.

joanise · 2024-09-16T19:12:48Z

Thanks for the careful review and these comments.

joanise · 2024-09-16T20:06:35Z

19be394 documents the algorithm much more clearly now.

joanise requested a review from dhdaines September 13, 2024 14:50

joanise force-pushed the dev.ej/compact-lexicon branch from 0c76367 to 6991b4e Compare September 13, 2024 15:27

joanise added 3 commits September 13, 2024 11:35

chore: g2p update, to get compacted lexicon

72e761d

test: carefully cover compact lexicon corner cases

d03aabb

joanise force-pushed the dev.ej/compact-lexicon branch from 6991b4e to d03aabb Compare September 13, 2024 15:36

dhdaines approved these changes Sep 16, 2024

View reviewed changes

g2p/mappings/utils.py Show resolved Hide resolved

g2p/mappings/utils.py Show resolved Hide resolved

docs: better document the double bisect algorithm

291708d

joanise force-pushed the dev.ej/compact-lexicon branch from 19be394 to 291708d Compare September 16, 2024 20:35

joanise merged commit 53c78f1 into main Sep 16, 2024
8 checks passed

joanise deleted the dev.ej/compact-lexicon branch September 16, 2024 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev.ej/compact lexicon #400

Dev.ej/compact lexicon #400

joanise commented Sep 13, 2024 •

edited

Loading

github-actions bot commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading

dhdaines commented Sep 14, 2024

joanise commented Sep 16, 2024

dhdaines commented Sep 16, 2024

joanise commented Sep 16, 2024

dhdaines left a comment

joanise commented Sep 16, 2024

joanise commented Sep 16, 2024

joanise commented Sep 16, 2024

Dev.ej/compact lexicon #400

Dev.ej/compact lexicon #400

Conversation

joanise commented Sep 13, 2024 • edited Loading

PR Goal?

Fixes?

Feedback sought?

Priority?

Tests added?

How to test?

Confidence?

Version change?

github-actions bot commented Sep 13, 2024 • edited Loading

codecov bot commented Sep 13, 2024 • edited Loading

Codecov Report

dhdaines commented Sep 14, 2024

joanise commented Sep 16, 2024

dhdaines commented Sep 16, 2024

joanise commented Sep 16, 2024

dhdaines left a comment

Choose a reason for hiding this comment

joanise commented Sep 16, 2024

joanise commented Sep 16, 2024

joanise commented Sep 16, 2024

joanise commented Sep 13, 2024 •

edited

Loading

github-actions bot commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading