Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document.annotations.in(<schema>) #183

Open
colin-alexa opened this issue Aug 27, 2019 · 8 comments
Open

Document.annotations.in(<schema>) #183

colin-alexa opened this issue Aug 27, 2019 · 8 comments
Assignees
Labels
🎉 Feature A new feature or request

Comments

@colin-alexa
Copy link
Contributor

Right now the Document is the primary typed entity in atjson, which works fairly well in practice but is conceptually a little confusing. When we convert documents between different sources what we're really converting is the collection of annotations associated with that document. I think it would be conceptually simpler if the notion of a type or annotation schema existed primarily at the level of the annotation.

I propose adding a few types to the library:
AnnotationCollection<T extends AnnotationSchema> with public members AnnotationCollection<T>.convertTo(schema: <U extends AnnotationSchema>) : AnnotationCollection<U> and AnnotationCollection<T>.in(schema: <U extends AnnotationSchema>) : AnnotationCollection<U>

  • convertTo(schema: <U extends AnnotationSchema> : AnnotationCollection<U> is Document.convertTo, just scoped to work on a list of annotations instead
  • in(schema: <U extends AnnotationSchema>) : AnnotationCollection<U> returns a new AnnotationCollection where any unknown annotations in schema from the original collection are made known, and all other annotations are made unknown.

AnnotationSchema is the parent type of schema definitions. This would take over much of the current role occupied by Document and its subtypes. Converters would be defined between subtypes of AnnotationSchema rather than subtypes of Document. They would also obviously 'own' the definition of their annotations, and would have a content type string for marking their annotations during de/serialization.

This proposal doesn't give the library any additional power (since a document is just a piece of un-annotated media and an AnnotationCollection, and the media portion doesn't at all complicate the associated schema) but I think it would be helpful for explaining the library and would help generally separate independent concerns within the system

@tim-evans
Copy link
Collaborator

This is wonderful! I'd be interested in what the shape of a schema looks like. Is something like the following reasonable?

interface Schema {
  type: string;
  version: string;
  annotations: {
    [key: string]: typeof Annotation
  ];
}

Also, extending from this is a series of questions around code organization and how to define converters / fromRaw methods, etc.

@tim-evans
Copy link
Collaborator

tim-evans commented Aug 27, 2019

Some examples of how this would be used in the wild would be lovely too! Is this the correct invocation for something?

let doc = new Document({
  content: 'Hello, world',
  annotations: [{
    type: '-commonmark-p',
    start: 0,
    end: 12
  }]
}).in(CommonmarkSchema);

@colin-alexa
Copy link
Contributor Author

@tim-evans that schema interface looks chill to me.

regarding fromRaw I think it makes sense to keep the notion of a source but have it be solely responsible for consuming raw input and making an atjson document out of it (Honestly in my ideal world fromRaw would be an overloaded static method on the Document that like picks the correct implementation based on a user-specified desired return type)

I'm interested if people think there are any operations that currently live on the document that would make more sense moved to the annotations? .in seems like it would make sense in both places although I think it's most useful on the document. I think .where (and possibly .update?) make more sense on the annotations

@tim-evans
Copy link
Collaborator

tim-evans commented Aug 29, 2019

If you feel comfortable, could you define a general interface for Document and Collection so there's more of a bird's eye view of what would be affected?

(This is an example of what I would find most helpful:)

interface Schema {
}

export class Document<S extends Schema> {
  annotations: Collection<S>;
  new(): Document<S>;
  in<T extends Schema>(schema: T): Document<T>;
  convertTo<T extends Schema>(schema: T): Document<T>;
}

export class Collection<S extends Schema> {
  add();
  remove();
  replace();
}
// Equivalent old / new APIs
// -or-
// What old APIs would de-sugar to in terms of new APIs

@tim-evans
Copy link
Collaborator

tim-evans commented Sep 17, 2019

@tim-evans
Copy link
Collaborator

tim-evans commented Oct 7, 2019

Ok, so I've worked through this a bit and this requires a lot of moving parts, but there's a few things that we can do in preparation to start this refactor to minimize large changes everywhere:

  • Use named exports for annotations
  • Move all utility functions out from the annotations directory to the top level.

Unfortunately, the rest of the changes will require some level of coordination or API changes that require different API signatures.

@tim-evans
Copy link
Collaborator

tim-evans commented Oct 7, 2019

Also, here's a list of breaking changes that will likely be occurring in this proposal:

  • schema is an additional property for the constructor to a Document.
  • fromRaw on sources will no longer be used— schemas will be separated from the parsing. The suggestion here is to make the named export fromRaw be a function that returns an instantiated Document from the source. This maps fairly well to existing code.
  • @atjson/offset-annotations will be deprecated in favor of @atjson/schema-offset.
  • All sources should have their default export be the schema for that source.

@tim-evans tim-evans added the 🎉 Feature A new feature or request label Oct 8, 2019
@tim-evans
Copy link
Collaborator

tim-evans commented Nov 2, 2019

This feature is a bit... all encompassing. It is a major improvement to the core api, but definitely touches a whole load of stuff. To make these changes reasonably reviewable, we'll need to break it up into several steps. I think it makes the most sense to merge all of these changes into a release/1.0.0 branch, since this feels like one of the last breaking changes we would make with the core api.

There's a lot of ways to approach this, but I'm going to give some suggestions on what may be the most logical approach:

Change the schema

Make the schema on a source match our new schema interface. This will be one of the bigger changes, and is designed to be a bit less aggressive than breaking everything.

class Document {
  static schema: Array<AnnotationConstructor<any, any>>;
}

will be modified to be:

class Document {
  static schema: Schema;
  schema: Schema;
}

In addition, the schema will be added to the Document instance, which will allow improvements to type inference to occur after this change.

Improve type inferences:

This is a bit involved, but it involves making our querying and joins reflect the types passed in more accurately. We'll add two more overloads to the Document#where api that allows types and strings. The string key will be the name of the annotation in the schema, and the type will be the type of annotation:

doc.where(doc.schema.Bold) // Returns Collection<Bold>
doc.where("Bold") // Returns Collection<Bold>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🎉 Feature A new feature or request
Projects
None yet
Development

No branches or pull requests

3 participants