Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect occurrences information #976

Merged
merged 41 commits into from
Dec 15, 2023
Merged

Conversation

panglesd
Copy link
Collaborator

@panglesd panglesd commented Jun 23, 2023

This PR is not fully finished, but I already open it to already leave the possibility to comment on the approach.

It adds to the source information the occurrences of each identifier.
This information can be used for at least two things:

  • Provide a "jump to documentation" from the source rendering (see for example this file, which has links to the documentation of List.fold_left, ...)
  • Count the number of occurrences of an identifier, for instance as a new information to rank search results. For instance, on odoc and all of its dependencies, we can find the number of occurrences for each identifier in this file, generated by the reference driver.

Concretely, this PR adds:

  • A --count-occurrences flag to the compile command, which makes odoc look for a .cmt file and gather all occurrences in it (as odoc Paths) as source information. If the source rendering is also activated, this information will be used
  • A count-occurences command, which Marshal a Map from odoc Identifier.t to int, corresponding to the number of occurrences.

What can be improved:

  • Benchmark this PR in terms of size and time
  • Add information of other kinds of nodes, such as constructors... Delayed to future PR
  • Handle "local" identifier (they are currently not counted in the total, neither they have links to documentation in the rendered source code) Delayed to future PR
  • Handle resolved but hidden identifiers in source rendering (they cannot jump to documentation since there is no generated documentation, but maybe could jump to implementation?)
  • testing can be improved

@panglesd panglesd force-pushed the occurrences-in-odoc branch 2 times, most recently from 2c2fe55 to 372cee0 Compare July 6, 2023 10:15
@panglesd
Copy link
Collaborator Author

panglesd commented Jul 6, 2023

I added path to constructors and their resolving function to the PR.
In terms of occurrences, both occurrences of constructors in pattern-matching and in expressions are counted.

I'd particularly appreciate a review on the code related to the addition and resolving of value path and constructor path!

@panglesd panglesd force-pushed the occurrences-in-odoc branch 3 times, most recently from 50f8e11 to c702368 Compare July 26, 2023 14:50
@panglesd panglesd force-pushed the occurrences-in-odoc branch 2 times, most recently from 343afbe to a3c8465 Compare August 4, 2023 08:19
test/occurrences/dune Outdated Show resolved Hide resolved
test/occurrences/dune Outdated Show resolved Hide resolved
test/odoc_print/dune Outdated Show resolved Hide resolved
test/occurrences/source.t/run.t Outdated Show resolved Hide resolved
test/odoc_print/occurrences_print.ml Outdated Show resolved Hide resolved
src/xref2/tools.ml Outdated Show resolved Hide resolved
src/xref2/link.ml Outdated Show resolved Hide resolved
src/model/paths.mli Outdated Show resolved Hide resolved
src/model/lang.ml Outdated Show resolved Hide resolved
src/loader/implementation.ml Outdated Show resolved Hide resolved
@panglesd
Copy link
Collaborator Author

panglesd commented Nov 3, 2023

Here is the state of this PR:

Jump from occurrence to implementation

It works quite well I think! Both inside a file, and "cross module".
The way it works is that ocaml paths are turned into odoc path, which are then turned into shapes, which are resolved.

Jump from occurrence to documentation

This works well for cross-compilation-unit occurrences, but not intra-compilation unit.
Currently, the ocaml path is turned into an odoc path, which is then resolved to jump to documentation.
Some paths are not rewritten to follow canonical. I'm not sure how to fix that exactly...

The way the "jump to documentation" links are display is temporary.

Count occurrences

There is a command to "count occurrences". It creates a "tree of occurrence information", where each identifier has a number of direct and indirect occurrences. A node is an identifier's information, the children are the information of all its "subidentifiers.
This command has options to control what we add in the table: hidden identifiers (or not), and only "persistent" identifiers (occurrences of something in its implementation unit will not be counted), or everything.
The "tree" table is marshalled in a file.

There is also a command to "aggregate" occurrences, which takes several marshalled tables, unmarshall them, and aggregate them in one table which is then marshall.

Constructors

Constuctors are a bit harder to resolve (there can be several M.A which do not resolve to the same constructor depending on the type of the expression), so despite some code being included in this PR, there is no support for constructor occurrences yet.

The test

A good way of looking at the current state and its problems is by reading the test.

Conclusion

You can play with the jump to occurrences and implementation here.
Please report any wrong behaviour!

@panglesd
Copy link
Collaborator Author

panglesd commented Dec 6, 2023

I have removed the two things that I could not get right in this PR:

  • Resolving of "internal occurrences". They were resolved in the ".mli" context, while they belong to the ".ml" context. Now, they are simply not counted, only "persistent" occurrences are counted. Internal occurrence counts can be added back in a future PR.
  • Rendering of "jump to documentation" links. They are hard to render in a way that accomodates laptops, touch devices, and which do not mess with the alignment in the code. They can be designed and added in another PR.

Copy link
Collaborator

@Julow Julow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work :)

doc/driver.mld Outdated
@@ -750,6 +760,7 @@ let compiled = compile_all () in
let linked = link_all compiled in
let () = index_generate () in
let _ = js_index () in
let _ = count_occurrences (Fpath.v "occurrences.txt") in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a text file. Any way to render it in a readable way ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "print_occurrences" binary defined in the test allows to render it in a readable way. Do you think I should include in odoc itself a way to render it as readable?

Also, maybe the count-occurrences command should produce files with a "fixed" extension, e.g. *.occ-odoc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took inspiration from the constraints we give to the output name of a file representing a source tree.
So, the output file name of the count-occurrences and aggregate-occurrences must now have occurrences- prefix and odoc extension.

src/loader/implementation.ml Outdated Show resolved Hide resolved
src/loader/implementation.ml Show resolved Hide resolved
src/loader/implementation.ml Outdated Show resolved Hide resolved
src/loader/typedtree_traverse.ml Show resolved Hide resolved
src/model/paths.ml Outdated Show resolved Hide resolved
Comment on lines +150 to +149
Fs.Directory.mkdir_p (Fs.File.dirname dst);
let oc = open_out_bin (Fs.File.to_string dst) in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for later: It seems that all Odoc commands create the parent directories of files they create. It would be nice to have this written down in the form of a factorized function.

(Ok []) input
>>= fun files -> Ok (List.concat files)

let aggregate files file_list ~warnings_options:_ ~dst =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the input and output files could be large. What do you think of using a sorted line-based format for both the inputs and the output ?
Sorted files can be merged in a streaming way and the command no longer needs to hold all the inputs and outputs in memory at the same time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree that a data structure that would allow to merge several "occurrence tables" (even many of them) in O(n) time (n being the total number of elements) and O(t) space (t being the number of tables to merge) is possible as you suggest, interesting and a good thing to have, let's do that in a future PR!

src/xref2/compile.ml Outdated Show resolved Hide resolved
src/xref2/link.ml Outdated Show resolved Hide resolved
Signed-off-by: Paul-Elliot <[email protected]>
- Handle `Pextra_ty` in `is_persistent`
- alias `Tast_iterator.default_iterator` to improve readability
- Remove possibility for `jump_to` type to have different types for doc and impl
- Factorize all instances of `contains_double_underscore`
- Use `ModuleName.is_hidden` instead of inlining its definition...
- Handle `ClassType` occurrences in compile, and avoid future miss by having an
  exhaustive match.

Signed-off-by: Paul-Elliot <[email protected]>
- Remove now unused is_internal_rec. It was added for non-persistent values, but
this has been delayed to another PR
- Remove TODO comment on Classes when building the environment: classes cannot
contain items that are contained in our current set of occurrence kinds (values,
modules, module types, ...)

Signed-off-by: Paul-Elliot <[email protected]>
Similar to source trees: must have `odoc` extension and `occurrences-` prefix

Signed-off-by: Paul-Elliot <[email protected]>
They used to not be raised, as with non-persistent occurrences, some of them
could never be got rid of!

Signed-off-by: Paul-Elliot <[email protected]>
The resolving of constructor paths introduced was wrong and never tested. It is
removed in this commit as well.

Signed-off-by: Paul-Elliot <[email protected]>
Signed-off-by: Paul-Elliot <[email protected]>
@jonludlam
Copy link
Member

Could we please avoid the change from Map to Hashtbl in Ident_env? The reasoning behind this is a change suggested by @lpw25 a while back to resolve the 'root' of a reference during the load phase so that references end up looking more like paths. The reasoning behind that is to have lexical scoping of references rather than the dynamic scoping we currently have - that way we can avoid having to make this sort of change. Switching to Hashtbls will make that much trickier.

This is somewhat contingent on my guess that the change to ident_env is separable from the rest of the PR - if it isn't we might have to just commit this and solve the reference issue when we get around to it.

@panglesd
Copy link
Collaborator Author

panglesd commented Dec 13, 2023

Could we please avoid the change from Map to Hashtbl in Ident_env?

Sure, I'll revert that. If I remember correctly, it is again part of the changes I made for non-persistent occurrences, that are not included anymore in this PR. I left it as it seemed an improvement, but that was not deeply though!

Copy link
Collaborator

@Julow Julow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now perfect :) Let's merge!

@jonludlam
Copy link
Member

I agree, lovely stuff! We might want to revisit how and when we process stuff (see https://hackmd.io/__sMo7xRSEGDephPhtDpNQ for some thoughts), but this is all consistent with the current handling so no sense in delaying it.

@jonludlam jonludlam merged commit 100c348 into ocaml:master Dec 15, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants