Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev.ej/compact lexicon #400

Merged
merged 4 commits into from
Sep 16, 2024
Merged

Dev.ej/compact lexicon #400

merged 4 commits into from
Sep 16, 2024

Conversation

joanise
Copy link
Collaborator

@joanise joanise commented Sep 13, 2024

PR Goal?

A bit of a crazy idea to reduce the amount of memory used by the g2p library: the English lexicon takes a lot of space in memory because each word is stored in a string, and each string object in Python is quite large, well beyond the size of the actual character sequence.

So, I thought, let's see what happens if I compact them by joining them by blocks. And a) it save 15MB in RAM from loading langs, b) with no measurable speed cost. The real cost is the increased complexity of the find_alignment() function in g2p.mappings.utils, but it goes from six lines to 15 lines, so, really, not that bad.

I've tested carefully to make sure the results are identical, and speed is not adversely affected.

The langs.json.gz file is a tad larger, but the memory footprint from loading it is significantly smaller.

Fixes?

Alternative solution to the problem raised in #395

Feedback sought?

Careful review.

And, whether we can reasonably expect a memory mapping solution soon, which would make this patch irrelevant.

Priority?

low

Tests added?

I better do that... coverage was not as good as I expected, and I will fix that to make sure I've tested all the corner cases.

Yes

How to test?

  • g2p convert "some test" eng eng-ipa still works as expected
  • unit tests & CI pass

Confidence?

high

Version change?

nope

Copy link
Contributor

github-actions bot commented Sep 13, 2024

CLI load time: 0:00.05
Pull Request HEAD: 291708d1aef2a87b9480ea116f89450bec6b01b7
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.89%. Comparing base (b315a6c) to head (291708d).
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #400      +/-   ##
==========================================
+ Coverage   93.82%   93.89%   +0.06%     
==========================================
  Files          18       18              
  Lines        2575     2587      +12     
  Branches      577      580       +3     
==========================================
+ Hits         2416     2429      +13     
  Misses         91       91              
+ Partials       68       67       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Store the lexicon and joined groups of 16 entries to reduce the str object
memory overhead.

Experimented with various block sizes to see the memory impact
Measured by running `import g2p; g2p.get_arpabet_lang()`:
 - original: 71MB
 - blocks of 4: 59MB
 - blocks of 16: 56MB
 - blocks of 256: 55MB
I decided the 15MB RAM savings were worth it for blocks of 16, but the
gain beyond that is trivial and not worth it.

In terms of speed the original code and blocks of 16 are the same, at
least within the error of measurement, which was running
`g2p convert --file en.txt eng eng-ipa` where en.txt is a file
containing all the words in the cmudict lexicon: original and 16 both
took 20-21 seconds depending on the run.
At blocks of 256, I was getting 23 seconds, not a big difference, but
measurable for not significant memory gain.
@dhdaines
Copy link
Collaborator

I think this actually is the memory mapping solution we're looking for, because if we put all the data together in a single (or multiple, whatever) block, then... we can easily memory map that block, without needing any external dependencies.

But also, if we support some kind of compact trainable G2P rules, we don't need to store the lexicon entries that are predictable by the rules, so that's something to think about (I could resume my debogosifying efforts on Phonetisaurus for instance)

@joanise
Copy link
Collaborator Author

joanise commented Sep 16, 2024

Before we use this as memory mapping, I'd want to convert the solution from splitting the compacted blocks into memory into reading the entries in place. But yeah, if we add the ability to access individual entries via string slices, we could turn the whole list into a single string, i.e., just one compact block.
For bisect to work on this, I guess we say new_pos = (right + left)//2, then scan back until we find \x01, from there scan forward to \x01, and that's the entry to examine.
I don't know if bisect supports anything but integral indexing -- probably not -- but it's a pretty darn simple algorithm to write ourselves -- that's one I've done a number of times -- so might be the solution indeed.

Question @dhdaines do we merge this anyway, take the 15mb saving, and plan a future update with a single block as described above, or do we hold off on this PR?
I vote for merging this PR in, because

  • it would already let us turn the ReadAlong-Studio Heroku server back down to a 1x from the current 2x.
  • the one-block solution will take a bit longer to write and I don't have time right now for it.

@dhdaines
Copy link
Collaborator

Question @dhdaines do we merge this anyway, take the 15mb saving, and plan a future update with a single block as described above, or do we hold off on this PR?

Yes, merge it now, because the internal representation of our rules/lexicons is not a public API/ABI so it really doesn't matter if we change it.

@joanise
Copy link
Collaborator Author

joanise commented Sep 16, 2024

Yes, merge it now, because the internal representation of our rules/lexicons is not a public API/ABI so it really doesn't matter if we change it.

Sounds good.

😁 can you approve it then?

Copy link
Collaborator

@dhdaines dhdaines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, though you could add the assertions / comments mentioned if you want to make sure future generations understand what's going on :)

g2p/mappings/utils.py Show resolved Hide resolved
g2p/mappings/utils.py Show resolved Hide resolved
@joanise
Copy link
Collaborator Author

joanise commented Sep 16, 2024

Yes, adding that additional documentation (and possibly assertions) would be helpful, this is definitely not obvious code. Which is part of why I was as extensive as I was for unit testing, making unit testing confirm to me I didn't miss any corner cases.

@joanise
Copy link
Collaborator Author

joanise commented Sep 16, 2024

Thanks for the careful review and these comments.

@joanise
Copy link
Collaborator Author

joanise commented Sep 16, 2024

19be394 documents the algorithm much more clearly now.

@joanise joanise merged commit 53c78f1 into main Sep 16, 2024
8 checks passed
@joanise joanise deleted the dev.ej/compact-lexicon branch September 16, 2024 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants