Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Extract code blocks by ID #303

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
208 changes: 208 additions & 0 deletions notes/testable_examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
# [DRAFT] Testable examples

Library authors are encouraged to include examples and short snippets of code
in documentation to demonstrate how to effectively use their library. Such code
snippets are included in docstrings as code blocks and therefor cannot be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/therefor/therefore

executed and tested in the same way regular source files are. This leads to
code duplication for library authors who want to make sure their examples can
be correctly executed, and to out of date examples when they forget to update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"out-of-date" for the adjective.

them, as the library’s API changes.

To address this problem odoc implements the ability to extract code blocks from
documented interfaces and documentation pages (`mli` and `mld` files
respectively) into source code files. With this build systems can implement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/With this/With this,

user-friendly workflows for execution, testing and even promotion of corrected
examples. In addition, the extracted examples can be installed as documentation
assets and thus avoid the need to duplicate them as separate files for
distribution.

## Named code blocks

In the new version of odoc code blocks can be annotated with a file name. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/odoc/odoc,

file name is used by odoc to group related code blocks for extraction, and also
to correctly annotate the markup for syntax highlighting.

The following table demonstrates the two variants of code blocks: the
traditionally supported *anonymous* code blocks and the new *named* code
blocks.

| **Anonymous code block** | **Named code block** |
| ------------------------ | ----------------------------------- |
| `"{[" <content> "]}"` | `"{" <filename> "[" <content> "]}"` |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing this proposal doesn't mention but must be clarified is what happens to named code block which are within stop comments (**/**) ... (**/**).

I think these should be included in the file aswell (they allow to hide setup minutiae you may not want to have in the rendered document).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But… do we get these in .mld files ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected the proposal, let me know if this is what you had in mind.

But… do we get these in .mld files ?

Maybe the extended code block annotation could have an option for that...


### Code extraction

Both named and anonymous code blocks can be extracted by odoc via the
command-line interface. Code blocks with the same file name in a given
documentation file will be concatenated and written into a file with that name.
Optionally a different output file name for a given group can be provided.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Optionally/Optionally,

Users are always required to provide an output file name for extraction of
anonymous code blocks.

To facilitate debugging and allow the tooling to implement expect-style
promotions, popularized by cram and dune, the extracted examples can be
optionally annotated with line numbers and the source file name (see [Line
number directives](https://caml.inria.fr/pub/docs/manual-ocaml/lex.html#sec86)
in the OCaml manual).

**Note**: unrelated code blocks do not need to have a unique file name, it is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean they will all go into one file example.ml? If so, this should read something like "unrelated code blocks can be put into the same file, and so don't need unique file names." However, why would it be good to put unrelated examples into one file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, why would it be good to put unrelated examples into one file?

If an interface has hundreds of functions each one of them with examples, will users want to provide a unique name to each code block to get code execution and syntax highlighting?

It is clear that filenames are not a satisfactory way to annotate code blocks, so this needs to be changed in the proposal anyway.

recommended to group them by using a file name like `examples.ml` or similar.

The described functionality will also be exposed as a library to facilitate
integration with build systems and test promotion tooling.

### Syntax highlighting

The file names used to annotated code blocks are also used by odoc to decide
what language should be used for syntax highlighting in the generated HTML. The
language is decided based on the file name’s extension.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already said that may not be entirely sufficient (I guess clashes do exist). See below for more on this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clashes aside, I think a more structured annotation is needed indeed.


**Warning:** code blocks without a file name will not have syntax highlighting.
Once this feature is implemented, the currently used automatic language
inference should be disabled.
Copy link
Contributor

@dbuenzli dbuenzli Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea. Historically code blocks meant OCaml. If you didn't want syntax highlighting you would use a verbatim block ({v v}). For small anonymous snippets it's tedious to always have to specify the language. It's better to simply default to OCaml, that's what the ocamldoc language was designed for.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally dislike implicit promotions, but let's respect the history in this case. I updated the proposal to match your suggestion.



## Command-line interface

The following simplified manual page defines the command-line interface:


odoc-extract-code(1) Odoc Manual odoc-extract-code(1)


NAME
odoc-extract-code - Extract code blocks included in documentation files.

SYNOPSIS
odoc extract-code [OPTION]... FILE

OPTIONS
--name=NAME
The name of the code block to extract.

-o PATH, --output=PATH
Output file path. If omitted, the provided NAME will be used.
Required for extraction of anonymous code blocks.
Copy link
Contributor

@dbuenzli dbuenzli Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For convenience, one could add an --all option which would extracts everything in one go and -o would then denote the destination directory. That way you can distribute a tutorial as an .mld file and simply instruct someone to odoc extract-code --all -o . tutorial.mld to be setup.

For example this gist and associated blog post could well be distributed as a single .mld file (see for example the "# Generate the html page" in the toplevel comment...).


--anonymous
Extract code blocks without name. Cannot be used with the `--name'
option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not. Note however that my guess is that unless the author pays particular attention to her anonymous code block the result will likely be garbage most of the time (you may have code showing errors, or even get different languages...)

Copy link
Contributor

@dbuenzli dbuenzli Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echoing @jsomers desire for certain fileless senarios (though formally you could always simply specify a filename to use by convention). This could be --anonymous=TOKEN, and would extract anonymous code blocks that have the token TOKEN (e.g. ocamlx).

It would also prevent the likely garbage problem I mentioned, if completely unconstrained.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it anonymous if it has a name (ocamlx in your example)?

But I do understand your concerns. I will try to address this in the new proposal for code block annotations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still regarding anonymous blocks:

The only thing I'm a little bit sceptical about is the extraction of
anonymous blocks, it will likely often be made of garbage so I wouldn't
bother.

It was one of the @lpw25's concerns, I believe. I imagine for large code bases
it would be useful to operate on existing unmarked code blocks in some way. So
if people do want to execute them they might get an opportunity to fix the
"garbage".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it anonymous if it has a name (ocamlx in your example)?

In the sense it has no filename.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically in the two forms here. The first one is a named code block and the second one is an anonymous one.

I quite like the idea of being able to extract by token, this leaves a lot of flexibity to authors to attribute meaning to what they want to extract.

Copy link
Collaborator Author

@rizo rizo Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the idea of token annotations. Without wanting to commit to a concrete syntax, something like {ocaml id=BLOCK_ID[...]} could work, where BLOCK_ID could be anything including a file name. This of course would be used during extraction. I'd make the --output option required though, to clarify that the id is not necessarily a file name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it's preferable to have to direct support for filenames.


--with-line-numbers
Include line number and file name of the extracted code blocks.

FILE (required)
Input cmti, cmt, cmi, mli or mld file.


Odoc 11VERSION11 odoc-extract-code(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing is a way to extract the list of names and their language. Following the current "odoc way" this could well be odoc extract-code-targets which would list each on their line the set of names and their tokens (see below) existing in the object file.

filename token* 

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was in my plans but I forgot to include it. Thanks.



## Dune integration

Here is an excerpt from a documented interface file that demonstrates named code blocks.

**Io_utils.mli**

```ocaml
val read_file : string -> string
(** [read_file path] is the content of the file located at [path] read into a string.

{4 Examples}
Given a text file with the content:

{letters.txt[abcdef]}

The following example will print the number of letters in the file:

{count_letters.ml[
# let letters = read_file "assets/letters.txt";;
val letters : string = "abcdef"
# String.length letters;;
- : int = 6
]} *)
```

The user wants to test the two code blocks in this example and all the
anonymous code blocks. To achieve this, the library stanza can be instructed to
extract and execute the code blocks from the documentation:


```dune
(library
(public_name io-utils)
(name Io_utils)
(libraries base bos)
(documentation
(extract_code
(letters.txt as assets/letters.txt)
count_letters.ml
(:anonymous as examples.ml))
(execute_code examples.ml count_letters.ml)))
```

Here is a detailed description of these options:

- `(extract_code <filenames>)` where `<filenames>` field follows the [Ordered
set language](http://#). This is a set of code block names found in `mli`
files of the library that should be extracted into files. Where `:standard`
refers to all annotated code blocks found in the library. Optionally the name
of the extracted file can be changed by using the following form:
`(<code_block> as <filename>)`, for example, `(letters.txt as
assets/letters.txt)`. Untitled code blocks can be extracted by providing a
file name to a special `:anonymous` name: `(:anonymous as <filename>)`.
- `(execute_code <filenames>)` where `<filenames>` field follows the [Ordered
set language](http://#). This is a set of extracted code files that will be
compiled and executed during documentation generation. Currently only the
files with the `ml` and `re` extensions are supported.

With these two options it is possible to precisely control what gets extracted
and what gets executed. Furthermore the extracted files can also be installed
by dune.

The top-level `documentation` stanza for `mld` files can also be extended to
support these options.

----------

## Requirements

- Allow the errors to be highlighted in examples in the original file. Might require
https://github.com/ocaml/odoc/issues/147
- Produce `.corrected` files to allow dune (or other build systems) to support
promotion of corrected files.
- The code block name should contain the language information for syntax
highlighting.


## Questions

- Should odoc require code block annotations to be filenames with extension?
The extension could be used to identify the language and correctly do code
highlighting. On the other hand the code blocks could be annotated only with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe one way would be to simply have these two forms

{[:filename token* [ ... ]}
{[ token* [ ... ]}

without more constraints than that.

With filename a syntax that allows a relative file path (that way an .mld file can specify a source file hierarchy) and token arbitrary identifiers that allow the : character.

That way authors and processors are free to collaborate on the meaning they want to give to tokens. Of course odoc should standardize a few behaviour like:

  • On {[:f.$EXT [...]}, odoc HTML gen will try to highlight the block using the most likely language associated to the extension $EXT (if present).
  • On {[:f.$EXT lang:$L [...]}, odoc HTML gen will try to highlight using language $L.
  • On {[ lang:$L [...]}, odoc HTML gen will try to highlight using language $L.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable proposal. I also think this should be the focus for the actual work in the current PR. I'll elaborate on this a bit later.

the language name (*i.e.* `{ocaml[...]}`), but this would limit the scope of
the feature. In particular this would make it impossible to:
1. Write examples in code blocks that read input from files extracted from
other code blocks;
2. Explicitly select the examples that should be compiled (ignoring others);
3. Install multiple extracted examples without compiling them.
- Should “execution” of `mli` files be supported too? Might be useful for basic
type-checking of the signature items.
- Should code blocks with the same name from different `mli` and `mld` files
(in the same library) be extracted into the same file? This might be
problematic with anonymous code blocks. On the other hand the
`--with-line-numbers` can be used to keep track of the name of the original
file.


## Alternatives

- Examples could be loaded from existing files into odoc's output. This is more
limited than the current proposal because it does not allow to interleve
comments and code. But, on the other hand, would not need any additional
build tooling as the examples can be directly compiled/tested.
- Introduce something like `mlt` files where code is mixed with comments. These
files could be converted into `mld` files for HTML rendering. See
https://github.com/janestreet/toplevel_expect_test

3 changes: 2 additions & 1 deletion src/html/comment.ml
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,8 @@ let rec nestable_block_element
fun ?xref_base_uri ~to_syntax ~from_syntax -> function
| `Paragraph [{value = `Raw_markup (`Html, s); _}] -> Html.Unsafe.data s
| `Paragraph content -> Html.p (inline_element_list ?xref_base_uri content)
| `Code_block s ->
| `Code_block (_, s) ->
(* TODO(rizo): use code block id as a CSS class. *)
let open Tree in
(*
TODO: This will probably be replaced by a proper plugin / PPX system.
Expand Down
2 changes: 1 addition & 1 deletion src/model/comment.ml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ type inline_element = [

type nestable_block_element = [
| `Paragraph of (inline_element with_location) list
| `Code_block of string
| `Code_block of string option * string
| `Verbatim of string
| `Modules of Reference.module_ list
| `List of
Expand Down
2 changes: 1 addition & 1 deletion src/parser/ast.ml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ type inline_element = [

type nestable_block_element = [
| `Paragraph of (inline_element with_location) list
| `Code_block of string
| `Code_block of string option * string
| `Verbatim of string
| `Modules of Reference.module_ list
| `List of
Expand Down
9 changes: 6 additions & 3 deletions src/parser/lexer.mll
Original file line number Diff line number Diff line change
Expand Up @@ -201,11 +201,11 @@ let emit_verbatim input start_offset buffer =
let t = trim_trailing_blank_lines t in
emit input (`Verbatim t) ~start_offset

let code_block c =
let code_block ?id c =
let c = trim_leading_blank_lines c in
let c = trim_trailing_blank_lines c in
let c = trim_leading_whitespace c in
`Code_block c
`Code_block (id, c)



Expand Down Expand Up @@ -299,6 +299,9 @@ rule token input = parse
| (reference_start as start) ([^ '}']* as target) '}'
{ emit input (reference_token start target) }

| '{' (['a'-'z' 'A'-'Z']+ as id) '[' (code_block_text as c) "]}"
{ emit input (code_block ~id c) }

| "{[" (code_block_text as c) "]}"
{ emit input (code_block c) }

Expand Down Expand Up @@ -461,7 +464,7 @@ rule token input = parse
~start_offset:(Lexing.lexeme_end lexbuf)
(Parse_error.not_allowed
~what:(Token.describe `End)
~in_what:(Token.describe (`Code_block "")));
~in_what:(Token.describe (`Code_block (None, ""))));
emit input (code_block c) }


Expand Down
2 changes: 1 addition & 1 deletion src/parser/semantics.ml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ let rec nestable_block_element
| {value = `Paragraph content; location} ->
Location.at location (`Paragraph (inline_elements status content))

| {value = `Code_block _; _}
| {value = `Code_block (_, _); _}
| {value = `Verbatim _; _}
| {value = `Modules _; _} as element ->
element
Expand Down
4 changes: 2 additions & 2 deletions src/parser/syntax.ml
Original file line number Diff line number Diff line change
Expand Up @@ -855,7 +855,7 @@ let rec block_element_list
let acc = block::acc in
consume_block_elements ~parsed_a_tag `After_text acc

| {value = `Code_block s | `Verbatim s as token; location} as next_token ->
| {value = `Code_block (_, s) | `Verbatim s as token; location} as next_token ->
warn_if_after_tags next_token;
warn_if_after_text next_token;
if s = "" then
Expand All @@ -865,7 +865,7 @@ let rec block_element_list
junk input;
let block =
match token with
| `Code_block _ -> `Code_block s
| `Code_block (id, _) -> `Code_block (id, s)
| `Verbatim _ -> `Verbatim s
in
let block = accepted_in_all_contexts context block in
Expand Down
2 changes: 1 addition & 1 deletion src/parser/token.ml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ type t = [
| `Begin_link_with_replacement_text of string

(* Leaf block element markup. *)
| `Code_block of string
| `Code_block of string option * string
| `Verbatim of string
| `Modules of string

Expand Down
8 changes: 8 additions & 0 deletions test/html/cases/markup.mli
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,14 @@
ignore foo
]}

Code blocks can have an identifier attached to them:

{ocaml[
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading example? Using ocaml as the identifier makes it looks like an annotation specifying the language, rather than an identifier to enable references to this specific block from elsewhere.

Suggested change
{ocaml[
{print-two-plus-two[

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current definition of annotations needs to be changed in my opinion as it has some limitations. Unfortunately I haven't been available to work on this in the last few months.

To clarify the misunderstanding here: the code block annotation is not used to reference a specific block from elsewhere (it is not a unique identifier). It annotates a class of blocks to be extracted. It could be a language or a filename, or any other user-defined "class".

Not sure if this is helpful, but let me know if you have any other questions or suggestions.

# print_int (2 + 2);;
4
- : unit = ()
]}

There are also verbatim blocks:

{v
Expand Down
3 changes: 3 additions & 0 deletions test/parser/test.ml
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,9 @@ let tests : test_suite list = [
t "unterminated" "{[foo";
t "unterminated-bracket" "{[foo]";
t "trailing-cr" "{[foo\r]}";
(* t "basic-with-id" "{foo[bar]}"; *)
(* t "empty-with-id" "{lang[]}"; *)
(* t "with-style-id" "{b[foo]}"; *)
];

"verbatim", [
Expand Down
3 changes: 2 additions & 1 deletion test/print/print.ml
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,8 @@ struct
function
| `Paragraph es ->
List [Atom "paragraph"; List (List.map (at inline_element) es)]
| `Code_block c -> List [Atom "code_block"; Atom c]
| `Code_block (Some id, c) -> List [Atom "code_block"; Atom id; Atom c]
| `Code_block (None, c) -> List [Atom "code_block"; Atom c]
| `Verbatim t -> List [Atom "verbatim"; Atom t]
| `Modules ps ->
List [Atom "modules"; List (List.map Reference_to_sexp.reference ps)]
Expand Down