Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scanning query strings with offsets #12

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

fnl
Copy link

@fnl fnl commented Jun 19, 2013

Added a offset kw-paramter to dawg.prefixes() (and .iterprefixes()) to make it possible to iteratively scan a string for prefixes without having to take slices of each prefix of the query. I.e., before::

dwag = DWAG([...])

def scan(query:str):
    for i in range(len(query)):
        yield dwag.prefixes(query[i:])

New behaviour::

def scan(query:str):
    for i in range(len(query)):
        yield dwag.prefixes(query, i)

fnl added 2 commits June 19, 2013 18:42
…t possible to search for prefixes at a given offset in the query, therefore avoiding to have to create a string slice for every single character when scanning strings
@kmike
Copy link
Member

kmike commented Jun 22, 2013

Hi Florian,

I can't unfortunately merge this pull request as-is, because of the following reasons:

  • tests fail on Travis under Python 2.x;
  • the feature looks a bit odd to me: could you explain your use-case, when is it useful?
  • as implemened, 'offset' is a byte offset - this can produce unexpected results for non-latin unicode strings (and lead to decoding errors because incorrectly passed offset could split a string in a such way that it won't be decodable from utf8). I think the example you provided in the ticket won't work for multibyte chars.

@fnl
Copy link
Author

fnl commented Jun 22, 2013

Hi Mikhail,

Thanks for you comments and taking the time! To reply to them:

Sorry, I was not aware of Py2x issues - I have stopped using Py2x.

Regarding the use case and the second two points, the whole change is all about getting the max speed/efficiency. If you want to scan a long string (say, a book or an article) for the contents of the FSA, you have to create a slice of the string ("string[offset:]") at every other char you want to scan. So to avoid this, I wanted to create a method where you can input always the same string, but scan for a match at a given offset. Also, I actually only need to scan for matches at token boundaries, not every possible character, so I use your DAFSA to only scan for matches at very specific offsets, like so:

def YieldLongestMatches(dafsa, input_string):
for off in TokenOffsets(input_string):
matches = dafsa.prefixes(input_string, off)
if matches:
yield matches[-1] # and, longest match only

However, as you also noticed, my current implementation leads to the repeated encoding of the same string over and over again to UTF-8 - indeed very inefficient, still. Given how UTF-8 works (only the first byte has the first bit=0, the second byte in a MBS has 10xxxxxx, the next 110xxxx asf.), the principal problem you outline does not occur, however: Only the whole string gets encoded/decoded, and only a correctly encodable prefix should be able to match, so the possible problem you mention should never occur.

Therefore, overall, I need to make a few more changes to my request, namely:

  1. restore the original state of "prefixes(str)" and leave that method again as it was, because the recurrent encoding is still a very strong "speed-breaker"
  2. instead, create a new method, say, "rawprefixes(bytes, int)" that will allow a more efficient application, namely:

def YieldLongestMatches(dafsa, input_string):
raw = input_string.encode('utf8')
for off in RawTokenOffsets(raw):
matches = dafsa.rawprefixes(raw, off)
if matches:
yield matches[-1].decode('utf8')

  1. Check that that new method works with Py2x, as well. A question I have here: would a pyx method signature "rawprefixes(input:bytes, offset:int)" be correctly compiled in Py3k to use bytes, but in Py2x to use str?

Is that OK with you, too? I hope you understand my issue, and the enormous speedup this seemingly minor change brings when scanning large text collections; But if you do not like it, no hard feelings :). Let me know if you are still interested in me contributing this change or not, and if so, if this change (using "bytes" as the method signature) would work in Py2x!

Cheers,
fnl

On Jun 22, 2013, at 13:49 , Mikhail Korobov wrote:

Hi Florian,

I can't unfortunately merge this pull request as-is, because of the following reasons:

• tests fail on Travis under Python 2.x;
• the feature looks a bit odd to me: could you explain your use-case, when is it useful?
• as implemened, 'offset' is a byte offset - this can produce unexpected results for non-latin unicode strings (and lead to decoding errors because incorrectly passed offset could split a string in a such way that it won't be decodable from utf8). I think the example you provided in the ticket won't work for multibyte chars.

Reply to this email directly or view it on GitHub.

@kmike
Copy link
Member

kmike commented Jun 22, 2013

Hi Florian,

Regarding the use case and the second two points, the whole change is all about getting the max speed/efficiency. If you want to scan a long string (say, a book or an article) for the contents of the FSA, you have to create a slice of the string ("string[offset:]") at every other char you want to scan. So to avoid this, I wanted to create a method where you can input always the same string, but scan for a match at a given offset. Also, I actually only need to scan for matches at token boundaries, not every possible character, so I use your DAFSA to only scan for matches at very specific offsets, like so:

def YieldLongestMatches(dafsa, input_string):
    for off in TokenOffsets(input_string):
        matches = dafsa.prefixes(input_string, off)
        if matches:
            yield matches[-1] # and, longest match only

Thanks for the explanation, your use case is clearer now.

Just an idea: I wonder if it is possible to bring memoryviews to API instead of "bytes + offset". They looks like a more standard/pythonic way to take slices without memory copies. You'll then use memoryview(input_string.encode('utf8')) in Python code, take slices on that and pass results to DAFSA.

Also, what about creating longest_prefix method? It could me much faster than prefixes because it probably won't have to construct all the keys. I implemented this method for datrie (see https://github.com/kmike/datrie/blob/master/src/datrie.pyx#L291 ), but was too lazy (= waiting for someone else) to implement it for DAWG.

However, as you also noticed, my current implementation leads to the repeated encoding of the same string over and over again to UTF-8 - indeed very inefficient, still. Given how UTF-8 works (only the first byte has the first bit=0, the second byte in a MBS has 10xxxxxx, the next 110xxxx asf.), the principal problem you outline does not occur, however: Only the whole string gets encoded/decoded, and only a correctly encodable prefix should be able to match, so the possible problem you mention should never occur.

I see. I think that if we go this way (not memoryviews), it worths mentioning in the source code (as a comment).

Therefore, overall, I need to make a few more changes to my request, namely:

  1. restore the original state of "prefixes(str)" and leave that method again as it was, because the recurrent encoding is still a very strong "speed-breaker"
  2. instead, create a new method, say, "rawprefixes(bytes, int)" that will allow a more efficient application, namely:
def YieldLongestMatches(dafsa, input_string):
    raw = input_string.encode('utf8')
    for off in RawTokenOffsets(raw):
        matches = dafsa.rawprefixes(raw, off)
    if matches:
            yield matches[-1].decode('utf8')

I wonder if it is possible to have a single 'prefixes' (and/or "longest_prefix') method that supports both unicode and memoryviews and does a dispatch based on result type (I think it should return unicode regardless of argument type). I think that a simple C-level type check shouldn't cause slowdowns (it didn't for __contains__). And with longest_prefix method there won't be recurrent encoding (and, even better, not unnecessary construction of throw-away Python objects).

  1. Check that that new method works with Py2x, as well. A question I have here: would a pyx method signature "rawprefixes(input:bytes, offset:int)" be correctly compiled in Py3k to use bytes, but in Py2x to use str?

Python 2.6+ has "bytes" type which is an alias for "str", so I think this will work without issues.

Is that OK with you, too? I hope you understand my issue, and the enormous speedup this seemingly minor change brings when scanning large text collections; But if you do not like it, no hard feelings :). Let me know if you are still interested in me contributing this change or not, and if so, if this change (using "bytes" as the method signature) would work in Py2x!

Thanks for working on this pull request and not just making changes that will work for your particular problem!

I'm very interested in library improvements, and I think your use case is valid. I'm not sure about "bytes + offset" API because it looks like a workaround for Python copy behavior, and if we take it further all C/C++ extensions will eventually need to add "start" and "stop" arguments for all methods accepting bytes just to avoid Python-level slices, and this just looks inelegant and wrong. But if we won't be able to make memoryviews work or if they'll turn out to be inefficient, like 2x slower for your problem, I'm fine with "bytes + offset".

@fnl
Copy link
Author

fnl commented Jun 24, 2013

Hi Mikhail,

Thanks for your replies! Firs, to warn you on time, my wife is about to
give birth to our son these days, so please excuse if at some time my reply
might stay out for a while…

Regarding the issue of memory-views, I must say I do miss a Java-like
CharSequence type in Python, too. But to do that, you would have to create
a new type that behaves exactly like str but allows to address/index
sub-regions of the underlying string. Nothing impossible at all, but still
lots of work… Furthermore, the worries you brought up about providing
optional start and end offsets on your methods actually is not that bad,
either - rather on the contrary: All similar string-related methods (count,
endswith, find, index, rfind, rindex, and startswith) follow this format
already: method(str[, start, [end]]), so I think that providing similar
offsets rather is a natural choice wrt. to the official str API than
something to worry about.

Your suggestion of adding a longest prefix method is probably significantly
faster and thus a useful method to add to the DAWG API, particularly
because it avoids the need of constructing a list that should lead to a
measurable speed benefit and is based on a common use-case. I can add both
approaches (all prefixes/longest only).

I am less convinced we always should return unicode/str (2.x/3.x),
regardless the input type (str/bytes, 2.x/3.x). That is confusing and
contrary to programming best practices. If I send a bytes method bytes to
extract a subsequence inside it, I expect to get bytes back, and if I send
a string method a string to check, I expect that type back, too, and
nothing else. Everything else I think is very unexpected and will lead to
problems sooner rather than later.

So I propose to add the following two methods to conform with the way
strings already work:

rawPrefixes(bytes[, int[, int]]) -> [ bytes ]
rawPrefix(bytes[, int[, int]]) -> bytes

Then, we might consider updating the current prefixes method accordingly
and add one more for the longest prefix only:

prefixes(str[, int[, int]]) -> [ str ]
prefix(str[, int[, int]]) -> str

I could call the longest prefix-only methods "rawLongestPrefix" and
"longestPrefix", but think this makes for too long and ugly method names
and would rather push that fact into the docstrings - but let me know if
you'd prefer the longer names.

Cheers,
Florian

On 23 June 2013 01:36, Mikhail Korobov [email protected] wrote:

Hi Florian,

Regarding the use case and the second two points, the whole change is all
about getting the max speed/efficiency. If you want to scan a long string
(say, a book or an article) for the contents of the FSA, you have to create
a slice of the string ("string[offset:]") at every other char you want to
scan. So to avoid this, I wanted to create a method where you can input
always the same string, but scan for a match at a given offset. Also, I
actually only need to scan for matches at token boundaries, not every
possible character, so I use your DAFSA to only scan for matches at very
specific offsets, like so:

def YieldLongestMatches(dafsa, input_string):
for off in TokenOffsets(input_string):
matches = dafsa.prefixes(input_string, off)
if matches:
yield matches[-1] # and, longest match only

Thanks for the explanation, your use case is clearer now.

Just an idea: I wonder if it is possible to bring memoryviews to API
instead of "bytes + offset". They looks like a more standard/pythonic way
to take slices without memory copies. You'll then use
memoryview(input_string.encode('utf8')) in Python code, take slices on
that and pass results to DAFSA.

Also, what about creating longest_prefix method? It could me much faster
than prefixes because it probably won't have to construct all the keys. I
implemented this method for datrie (see
https://github.com/kmike/datrie/blob/master/src/datrie.pyx#L291 ), but
was too lazy (= waiting for someone else) to implement it for DAWG.

However, as you also noticed, my current implementation leads to the
repeated encoding of the same string over and over again to UTF-8 - indeed
very inefficient, still. Given how UTF-8 works (only the first byte has the
first bit=0, the second byte in a MBS has 10xxxxxx, the next 110xxxx asf.),
the principal problem you outline does not occur, however: Only the whole
string gets encoded/decoded, and only a correctly encodable prefix should
be able to match, so the possible problem you mention should never occur.

I see. I think that if we go this way (not memoryviews), it worths
mentioning in the source code (as a comment).

Therefore, overall, I need to make a few more changes to my request,
namely:

  1. restore the original state of "prefixes(str)" and leave that method
    again as it was, because the recurrent encoding is still a very strong
    "speed-breaker"
  2. instead, create a new method, say, "rawprefixes(bytes, int)" that will
    allow a more efficient application, namely:

def YieldLongestMatches(dafsa, input_string):
raw = input_string.encode('utf8')
for off in RawTokenOffsets(raw):
matches = dafsa.rawprefixes(raw, off)
if matches:
yield matches[-1].decode('utf8')

I wonder if it is possible to have a single 'prefixes' (and/or
"longest_prefix') method that supports both unicode and memoryviews and
does a dispatch based on result type (I think it should return unicode
regardless of argument type). I think that a simple C-level type check
shouldn't cause slowdowns (it didn't for contains). And with
longest_prefix method there won't be recurrent encoding (and, even better,
not unnecessary construction of throw-away Python objects).

  1. Check that that new method works with Py2x, as well. A question I have
    here: would a pyx method signature "rawprefixes(input:bytes, offset:int)"
    be correctly compiled in Py3k to use bytes, but in Py2x to use str?

Python 2.6+ has "bytes" type which is an alias for "str", so I think this
will work without issues.

Is that OK with you, too? I hope you understand my issue, and the enormous
speedup this seemingly minor change brings when scanning large text
collections; But if you do not like it, no hard feelings :). Let me know if
you are still interested in me contributing this change or not, and if so,
if this change (using "bytes" as the method signature) would work in Py2x!

Thanks for working on this pull request and not just making changes that
will work for your particular problem!

I'm very interested in library improvements, and I think your use case is
valid. I'm not sure about "bytes + offset" API because it looks like a
workaround for Python copy behavior, and if we take it further all C/C++
extensions will eventually need to add "start" and "stop" arguments for all
methods accepting bytes just to avoid Python-level slices, and this just
looks inelegant and wrong. But if we won't be able to make memoryviews work
or if they'll turn out to be inefficient, like 2x slower for your problem,
I'm fine with "bytes + offset".


Reply to this email directly or view it on GitHubhttps://github.com//pull/12#issuecomment-19866584
.

@fnl
Copy link
Author

fnl commented Jun 24, 2013

Sorry, forgot about your wish for having one a single, unified API; So
rather than having two new "raw…" methods and a "prefix", I'd change the
current "prefixes" method to:

prefixes(seq:object[, start:int[, end:int]]) -> object:
if isinstance(seq, bytes):
return self._rawPrefixes(seq, start, end)
else:
return self._prefixes(seq, start, end)

I'd move the current "prefixes" to "_prefixes" and add the new
"_rawPrefixes" method, returning a list of unicode/str and str/bytes,
respectively.

And then all that once more for a new "prefix" (or "longestPrefix", if you
prefer) method, returning a single instance only.

On 24 June 2013 19:57, Florian Leitner [email protected] wrote:

Hi Mikhail,

Thanks for your replies! Firs, to warn you on time, my wife is about to
give birth to our son these days, so please excuse if at some time my reply
might stay out for a while…

Regarding the issue of memory-views, I must say I do miss a Java-like
CharSequence type in Python, too. But to do that, you would have to create
a new type that behaves exactly like str but allows to address/index
sub-regions of the underlying string. Nothing impossible at all, but still
lots of work… Furthermore, the worries you brought up about providing
optional start and end offsets on your methods actually is not that bad,
either - rather on the contrary: All similar string-related methods (count,
endswith, find, index, rfind, rindex, and startswith) follow this format
already: method(str[, start, [end]]), so I think that providing similar
offsets rather is a natural choice wrt. to the official str API than
something to worry about.

Your suggestion of adding a longest prefix method is probably
significantly faster and thus a useful method to add to the DAWG API,
particularly because it avoids the need of constructing a list that should
lead to a measurable speed benefit and is based on a common use-case. I can
add both approaches (all prefixes/longest only).

I am less convinced we always should return unicode/str (2.x/3.x),
regardless the input type (str/bytes, 2.x/3.x). That is confusing and
contrary to programming best practices. If I send a bytes method bytes to
extract a subsequence inside it, I expect to get bytes back, and if I send
a string method a string to check, I expect that type back, too, and
nothing else. Everything else I think is very unexpected and will lead to
problems sooner rather than later.

So I propose to add the following two methods to conform with the way
strings already work:

rawPrefixes(bytes[, int[, int]]) -> [ bytes ]
rawPrefix(bytes[, int[, int]]) -> bytes

Then, we might consider updating the current prefixes method accordingly
and add one more for the longest prefix only:

prefixes(str[, int[, int]]) -> [ str ]
prefix(str[, int[, int]]) -> str

I could call the longest prefix-only methods "rawLongestPrefix" and
"longestPrefix", but think this makes for too long and ugly method names
and would rather push that fact into the docstrings - but let me know if
you'd prefer the longer names.

Cheers,
Florian

On 23 June 2013 01:36, Mikhail Korobov [email protected] wrote:

Hi Florian,

Regarding the use case and the second two points, the whole change is all
about getting the max speed/efficiency. If you want to scan a long string
(say, a book or an article) for the contents of the FSA, you have to create
a slice of the string ("string[offset:]") at every other char you want to
scan. So to avoid this, I wanted to create a method where you can input
always the same string, but scan for a match at a given offset. Also, I
actually only need to scan for matches at token boundaries, not every
possible character, so I use your DAFSA to only scan for matches at very
specific offsets, like so:

def YieldLongestMatches(dafsa, input_string):
for off in TokenOffsets(input_string):
matches = dafsa.prefixes(input_string, off)
if matches:
yield matches[-1] # and, longest match only

Thanks for the explanation, your use case is clearer now.

Just an idea: I wonder if it is possible to bring memoryviews to API
instead of "bytes + offset". They looks like a more standard/pythonic way
to take slices without memory copies. You'll then use
memoryview(input_string.encode('utf8')) in Python code, take slices on
that and pass results to DAFSA.

Also, what about creating longest_prefix method? It could me much faster
than prefixes because it probably won't have to construct all the keys. I
implemented this method for datrie (see
https://github.com/kmike/datrie/blob/master/src/datrie.pyx#L291 ), but
was too lazy (= waiting for someone else) to implement it for DAWG.

However, as you also noticed, my current implementation leads to the
repeated encoding of the same string over and over again to UTF-8 - indeed
very inefficient, still. Given how UTF-8 works (only the first byte has the
first bit=0, the second byte in a MBS has 10xxxxxx, the next 110xxxx asf.),
the principal problem you outline does not occur, however: Only the whole
string gets encoded/decoded, and only a correctly encodable prefix should
be able to match, so the possible problem you mention should never occur.

I see. I think that if we go this way (not memoryviews), it worths
mentioning in the source code (as a comment).

Therefore, overall, I need to make a few more changes to my request,
namely:

  1. restore the original state of "prefixes(str)" and leave that method
    again as it was, because the recurrent encoding is still a very strong
    "speed-breaker"
  2. instead, create a new method, say, "rawprefixes(bytes, int)" that will
    allow a more efficient application, namely:

def YieldLongestMatches(dafsa, input_string):
raw = input_string.encode('utf8')
for off in RawTokenOffsets(raw):
matches = dafsa.rawprefixes(raw, off)
if matches:
yield matches[-1].decode('utf8')

I wonder if it is possible to have a single 'prefixes' (and/or
"longest_prefix') method that supports both unicode and memoryviews and
does a dispatch based on result type (I think it should return unicode
regardless of argument type). I think that a simple C-level type check
shouldn't cause slowdowns (it didn't for contains). And with
longest_prefix method there won't be recurrent encoding (and, even better,
not unnecessary construction of throw-away Python objects).

  1. Check that that new method works with Py2x, as well. A question I have
    here: would a pyx method signature "rawprefixes(input:bytes, offset:int)"
    be correctly compiled in Py3k to use bytes, but in Py2x to use str?

Python 2.6+ has "bytes" type which is an alias for "str", so I think this
will work without issues.

Is that OK with you, too? I hope you understand my issue, and the
enormous speedup this seemingly minor change brings when scanning large
text collections; But if you do not like it, no hard feelings :). Let me
know if you are still interested in me contributing this change or not, and
if so, if this change (using "bytes" as the method signature) would work in
Py2x!

Thanks for working on this pull request and not just making changes that
will work for your particular problem!

I'm very interested in library improvements, and I think your use case is
valid. I'm not sure about "bytes + offset" API because it looks like a
workaround for Python copy behavior, and if we take it further all C/C++
extensions will eventually need to add "start" and "stop" arguments for all
methods accepting bytes just to avoid Python-level slices, and this just
looks inelegant and wrong. But if we won't be able to make memoryviews work
or if they'll turn out to be inefficient, like 2x slower for your problem,
I'm fine with "bytes + offset".


Reply to this email directly or view it on GitHubhttps://github.com//pull/12#issuecomment-19866584
.

@kmike
Copy link
Member

kmike commented Jun 24, 2013

Hi Florian,

Congratulations! Say hello to the new Leitner from me :)

Regarding the issue of memory-views, I must say I do miss a Java-like
CharSequence type in Python, too. But to do that, you would have to create
a new type that behaves exactly like str but allows to address/index
sub-regions of the underlying string. Nothing impossible at all, but still
lots of work… Furthermore, the worries you brought up about providing
optional start and end offsets on your methods actually is not that bad,
either - rather on the contrary: All similar string-related methods (count,
endswith, find, index, rfind, rindex, and startswith) follow this format
already: method(str[, start, [end]]), so I think that providing similar
offsets rather is a natural choice wrt. to the official str API than
something to worry about.

I was talking about http://docs.python.org/dev/library/stdtypes.html#memory-views. Cython has support for memoriviews built-in (see http://docs.cython.org/src/userguide/memoryviews.html ), including a simple syntax. The idea is to wrap the whole input_string (encoded to bytes) into a memoryview and do slicing in Python (in your code), and to add support for memoryviews as keys (and maybe binary keys as well) to the library so it won't need 'offset' parameter. The slices have 'tobytes' method, so if these slices are needed in Python land, it is possible to use them as bytes. I don't think you need a new type unless you want to slice "str" directly (instead of "bytes"). Copyless "str" slicing can be more complex than "bytes" slicing because there is nothing in stdlib to help, and it is not possible to compute byte offset for a given char offset for variable-character-sized encodings without iterating over the string.

Good point about startswith; now I'm not opposed to start/end arguments. But for str they should work on char offsets, and this could be harder to implement. For byte keys they could also be useful (and easier to implement), but I don't know what is better, start/end or memoryviews. I actually like start/end for simplicity (and they'll work in Python 2.6 unlike memoryviews).

Your suggestion of adding a longest prefix method is probably significantly
faster and thus a useful method to add to the DAWG API, particularly
because it avoids the need of constructing a list that should lead to a
measurable speed benefit and is based on a common use-case. I can add both
approaches (all prefixes/longest only).

Cool!

I am less convinced we always should return unicode/str (2.x/3.x),
regardless the input type (str/bytes, 2.x/3.x). That is confusing and
contrary to programming best practices. If I send a bytes method bytes to
extract a subsequence inside it, I expect to get bytes back, and if I send
a string method a string to check, I expect that type back, too, and
nothing else. Everything else I think is very unexpected and will lead to
problems sooner rather than later.

I think of this in a slightly different way: most method should always return results of the same type regardless of arguments; returning results of different types will lead to problems and hard-to-debug errors (the exception is generic functions like "max"). So in this logic if we add support for "bytes" as keys we should still return unicode, or we should create an another function that will always return bytes. Also, the encoding is currently done by the library, and it assumes utf8 in various places, so there is no need to force user to always decode results from utf8 (as e.g. you do in the example). When could user possibly need values of prefixes/keys/etc. as bytes, given that the result is text data encoded to utf8? Another point is that decoding from utf8 is faster from Cython because Cython optimizes it by using C API utf8-decoding method instead of "bytes" object's method call.

That said, now I like your suggestion of raw_prefixes which will accept bytes (and maybe memoryviews) as input and provide start/end arguments. But should it return bytes? When could be this feature useful?

It looks like unified "prefixes" method with "start/end" arguments is a bad idea, because start/end should mean different things for bytes and unicode (byte offset vs char offset). If we go with start/end (not memoryviews) I'd prefer adding raw_prefixes method and leaving prefixes as-is. "Start/end" API is not an issue for memoryviews, so with memoryviews we can have a single "prefixes" method that supports unicode, bytes and memoryviews as keys.

I could call the longest prefix-only methods "rawLongestPrefix" and
"longestPrefix", but think this makes for too long and ugly method names
and would rather push that fact into the docstrings - but let me know if
you'd prefer the longer names.

I'd prefer longer names, "longest_prefix" and "raw_longest_prefix" :) Based on method name, for "prefix" method I'd expect it to raise an exception if several prefixes are found; "longest_prefix" purpose is clear. The API of DAWG was modelled after PyTrie (https://bitbucket.org/gsakkis/pytrie/src/1bbd6dec97df4a8f3ae0f0cf6b91586c1556e932/pytrie.py?at=default ), and I think it is better to continue following it because some other packages are also trying to follow it (all my other similar packages - datrie, marisa-trie, hat-trie).

@Dobatymo Dobatymo mentioned this pull request Jul 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants