Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(percent-encoding): add support for preserving characters when decoding #970

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ForsakenHarmony
Copy link

@ForsakenHarmony ForsakenHarmony commented Sep 19, 2024

This is useful to match the behavior of JavaScript's decodeURI for example.

I've also made the functions not take a static reference to the ASCII set for inline inversions.

And I've changed AsciiSet::EMPTY (added in #969) to be a reference to match the existing constants (I'm not sure if they need to be references, but I feel like it's better to have the same behavior everywhere).

@ForsakenHarmony ForsakenHarmony changed the title feat: add support for preserving characters when decoding feat(percent-encoding): add support for preserving characters when decoding Sep 19, 2024
@joshka
Copy link
Contributor

joshka commented Sep 20, 2024

Copied from #969, to make this the canonical place to discuss:


I wasn't 100% sure about where to put the constant. I went with EMPTY being a constant on the AsciiSet as empty seems like an inherent property of a type, but the other constants seem like usages of AsciiSet. I was 70/30% on this being right, so wouldn't object to this being changed to be consistent with the other constants.

The rationale for making the constants references rather than just values all seemed odd to me. What was that necessary for?

@@ -79,7 +79,7 @@ const BITS_PER_CHUNK: usize = 8 * mem::size_of::<Chunk>();

impl AsciiSet {
/// An empty set.
pub const EMPTY: AsciiSet = AsciiSet {
pub const EMPTY: &'static AsciiSet = &AsciiSet {
Copy link
Author

@ForsakenHarmony ForsakenHarmony Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshka (continuing here to have a thread)

I don't think there's a reason it needs to be a reference, but given that the existing constants are references, I think it makes sense to just have all of them be the same.

Changing the existing ones to not be references would be a breaking change, so this kinda seems like the only option to me.

I guess you could consider this a special case, but that does require calling the functions with &AsciiSet::EMPTY unlike the others, no?

Another option would be to change the function to take impl AsRef<AsciiSet>.

Copy link
Contributor

@joshka joshka Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know the design decisions behind this well enough to give a useful answer. I'd defer to the library maintainers for more understanding / context on that. Making a constant that is a reference to a constant value instead of just the value seems just a little odd to me. It also seems unlikely to me that an EMPTY const would ever be used except in some other constant expression, so it seems likely to me that a non-ref is still more correct here.

Rather than using AsRef as suggested, if I was fixing this up a bit to make it work with either values or refs, I'd add derived implementations for Clone and Copy, impl Into<AsciiSet> for &'_ AsciiSet { fn into(self) -> AsciiSet { *self } }, and then change the methods to accept Into<AsciiSet> and the PercentEncoding struct to just store the value instead of a ref. This would be both backward compatible and obvious. The caveat to this is I'm unsure if this code is called in some performance sensitive situation however (e.g where the nanoseconds matter), It's 16 bytes of memory copied instead of 8 bytes for the reference, so I'd hope a copy would be fine at some general level. It may not be if this is used in a super high volume scenario (e.g. a firewall or proxy). There aren't any benchmarks to imply that this would have some high perf needs.

BTW, I meant to add, I'm definitely no expert on this crate. I added the functionality in the recent PR as I was trying to work out how to represent RFC defined percentage encodings, and they are defined in terms of combinations of sets rather than in terms of individual characters, so it made sense to have that available here.

Copy link
Member

@Manishearth Manishearth Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance shouldn't matter too much either way, the more pressing thing for static-vs-const is the instruction size bloat if the const gets duplicated everywhere, but this is a small const. The compiler is also able to optimize in both ways at times. I'd say that the reference is slightly better just because of consistency with existing consts, but for smaller consts in general a straight up const is better overall.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#976 does the refactoring mentioned

joshka added a commit to joshka/rust-url that referenced this pull request Sep 26, 2024
Refs take 8 bytes, whereas the values are only 16 bytes, so there is not
a huge benefit to using references rather than values. PercentEncoding
is changed to store the AsciiSet as a value, and the functions that take
AsciiSet now take Into<AsciiSet> instead of &'static AsciiSet. This
allows existing code to continue to work without modification. The
AsciiSet consts (CONTROLS and NON_ALPHANUMERIC) are also changed to be
values, which is a breaking change, but will only affect code that
attempts to dereference them.

Discussion about the rationale for this is change is at
<servo#970 (comment)>
joshka added a commit to joshka/rust-url that referenced this pull request Sep 26, 2024
Refs take 8 bytes, whereas the values are only 16 bytes, so there is not
a huge benefit to using references rather than values. PercentEncoding
is changed to store the AsciiSet as a value, and the functions that take
AsciiSet now take Into<AsciiSet> instead of &'static AsciiSet. This
allows existing code to continue to work without modification. The
AsciiSet consts (CONTROLS and NON_ALPHANUMERIC) are also changed to be
values, which is a breaking change, but will only affect code that
attempts to dereference them.

Discussion about the rationale for this is change is at
<servo#970 (comment)>
joshka added a commit to joshka/rust-url that referenced this pull request Sep 26, 2024
Refs take 8 bytes, whereas the values are only 16 bytes, so there is not
a huge benefit to using references rather than values. PercentEncoding
is changed to store the AsciiSet as a value, and the functions that take
AsciiSet now take Into<AsciiSet> instead of &'static AsciiSet. This
allows existing code to continue to work without modification. The
AsciiSet consts (CONTROLS and NON_ALPHANUMERIC) are also changed to be
values, which is a breaking change, but will only affect code that
attempts to dereference them.

Discussion about the rationale for this is change is at
<servo#970 (comment)>
joshka added a commit to joshka/rust-url that referenced this pull request Sep 27, 2024
Refs take 8 bytes, whereas the values are only 16 bytes, so there is not
a huge benefit to using references rather than values. PercentEncoding
is changed to store the AsciiSet as a value, and the functions that take
AsciiSet now take Into<AsciiSet> instead of &'static AsciiSet. This
allows existing code to continue to work without modification. The
AsciiSet consts (CONTROLS and NON_ALPHANUMERIC) are also changed to be
values, which is a breaking change, but will only affect code that
attempts to dereference them.

Discussion about the rationale for this is change is at
<servo#970 (comment)>
joshka added a commit to joshka/rust-url that referenced this pull request Sep 27, 2024
Refs take 8 bytes, whereas the values are only 16 bytes, so there is not
a huge benefit to using references rather than values. PercentEncoding
is changed to store the AsciiSet as a value, and the functions that
previously accepted a reference now accept a value. This is a breaking
change for users who were passing a reference to AsciiSet to the
functions in the public API.

The AsciiSet consts (CONTROLS, NON_ALPHANUMERIC, etc.) are also changed
to be values.

This is an alternative to the non-breaking change in
<servo#976>

Discussion about the rationale for this is change is at
<servo#970 (comment)>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants