Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: what does "code units" means? #42

Open
fogzot opened this issue Sep 13, 2023 · 1 comment
Open

Documentation: what does "code units" means? #42

fogzot opened this issue Sep 13, 2023 · 1 comment

Comments

@fogzot
Copy link

fogzot commented Sep 13, 2023

The documentation of the text::cluster::Token module does not explain what a code unit is. From the example code in the shape module it seems that the offset property is index of the character in the text and len its length when represented as UTF8, but is it?

In my code I don't use UTF8 strings because I have extra information and I keep an array of "chars" like this:

(char 'A') (char 'B')(kern -0.5pt)(char '🙃')

I suppose this is three tokens but what values for offset and len should one use?

offset: 0 len: 'A'.len_utf8()
offset: 1 len: 'B'.len_utf8()
offset: 2 len: '🙃'.len_utf8()

Should the offset of the third token be 2 (logical index into the characters) or 3 (index into my array)?

@declantsien
Copy link

I assume we can build the tokens from str. let char_indices compute the offset here.

let source = "AB🙃";
source.char_indices().map(|(i, ch)| Token {
			   ch,
			   offset: i as u32,
			   len: ch.len_utf8() as u8,
			   info: ch.properties().into(),
			   data: 0,
			   });

I use SourceRange like this. The start and end is defined in code units. You should get the idea.

source[source_range.to_range().start..source_range.to_range().end]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants