Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JupyterLab Telemetry Design #8

Open
jaipreet-s opened this issue Sep 15, 2019 · 14 comments
Open

JupyterLab Telemetry Design #8

jaipreet-s opened this issue Sep 15, 2019 · 14 comments

Comments

@jaipreet-s
Copy link
Member

Hi all,

I'm doing a quick recap of the proposed design as we're going to begin implementation on this basis.

Publishing and consuming events

Publishers call the recordEvent() API on the EventLog which emits event to each configured handlers. Handlers can be configured on the event log during instantiation. The EventLog will internally do schema validation and filter
events with allowed schemas.

const el = new EventLog({
  handlers: [handler1, handler2],
  allowedSchemas: ['org.jupyter.foo', 'org.jupyterlab.commands.docmanager:open']
})
el.recordEvent({
  schema: 'org.jupyter.foo',
  version: 1,
  body: {
    'foo': 'bar'
  }
});

The fan-out of an event to each handler is done using PhosphorJS Signals and each handler is simply a function with the signature

function consoleHandler(el: EventLog, events: EventLog.RecordedEvent[]) {
  console.log(`[Handler1] Received events ${JSON.stringify(events)}`)
}

There will be an in-built handler which sends events to the /eventlog endpoint on the notebook server.

JupyterLab commands

The EventLog has some configuration options which causes it to auto-subscribe to JupyterLab command executions (such as new Notebook opened, or terminal closed) and emit the to each handler. They still need to be whitelisting using the allowedSchemas property. This will be based off of Ian Rose's prototype

// This will record the "docmanager:open" command batched every 2 seconds
const el = new EventLog({
  handlers: [consoleHandler, consoleHandler2],
  allowedSchemas: ['org.jupyterlab.commands.docmanager:open'],
  commandRegistry: app.commands,
  commandEmitIntervalSeconds: 2
})

User Transparency

To make it transparent to users what evens are being recorded, the EventLog will publish the set of allowed event schemas to the JupyterLab Settings Registry. Users will be able to view what events are being recorded and can also modify the settings to opt-out of certain events.

{
    // Jupyter Telemetry
    // @jupyterlab/telemetry:plugin
    // Telemetry plgugin
    // *****************************************

    // Allowed telemetry events
    "allowedSchemas": [
        "org.jupyter.foo",
        "org.jupyterlab.commands.docmanager:open"
    ],
    // Disable or enable telemetry completely
    "telemetryEnabled": true
}

We'll have to deal with aggregating allowedSchemas from all EventLog instances but that is an implementation detail.

cc @Zsailer @yuvipanda @zuoyuanh

@Zsailer
Copy link
Member

Zsailer commented Sep 16, 2019

Thanks, @jaipreet-s! This is great—I'm excited to see this move forward!

Just a few comments...

each handler is simply a function

I'm not sure if we're shooting for the Python and JS side to match exactly, but I wanted to bring up a minor difference I see here in the Handlers. The logging handlers we're using in Python are objects with a few extra helpful methods for filtering, formatting, etc. (I wrote an jupyter_config example for filtering here). If we want to follow this design on the JS side, handlers would objects with this design. What do you think?

There will be an in-built handler which sends events to the /eventlog endpoint on the notebook server.

(*thinking out loud here)

Is the /eventlog endpoint a part of the core server or a server extension?

If it's part of the core server, what should the server do with these events? Assuming the source is trusted (a whole other converation, started in jupyter/telemetry#21), does the server pass incoming events hitting the /eventlog endpoint to all Python handlers on the server? Should the Python handlers be able to to validate, filter, etc. this data?

If it's a server extension, jupyterlab-telemetry would have it's own EventLog that users/admins configure separately from the notebook server. That seems a bit cleaner...

@jaipreet-s
Copy link
Member Author

@Zsailer - Thanks for the feedback!

If we want to follow this design on the JS side, handlers would objects with this design. What do you think?

In the Python side, we got filtering with little at the handler level because of the in-built capabilities of the logging module. There isn't an equivalent I'm aware of in the TS world. I'm proposing we take this incrementally by starting with filtering at the entire EventLog and then extend it to each handler, as required.

Is the /eventlog endpoint a part of the core server or a server extension?

This will be the /api/eventlog endpoint that is present on the notebook server (which is WIP as part of the PR in notebook jupyter/notebook#4752 and the relevant file)

Assuming the source is trusted (a whole other converation, started in jupyter/telemetry#21), does the server pass incoming events hitting the /eventlog endpoint to all Python handlers on the server? Should the Python handlers be able to to validate, filter, etc. this data?

Yes - the intention is to pass all whitelisted events to the server, and let the server EventLog configuration take care of which handlers to send the event to. This is useful in cases where an operator still wants to events occurring at the browser but keep common configuration across the server and browser.

BTW I'm missing some permissions on this repo which I have at the jupyter/telemetry repo, could you please check? :)

@Zsailer
Copy link
Member

Zsailer commented Sep 16, 2019

@jaipreet-s I added you to the list of maintainers :)

@Zsailer
Copy link
Member

Zsailer commented Sep 16, 2019

we got filtering with little at the handler level ... I'm proposing we take this incrementally

Right. Sounds good. I agree with taking an incremental approach. My question is whether we should start with a Handler object now (that lacks filtering etc. functionality for now) rather than using a function as the starting place. That way, we're not asking users to change from functions objects later? Or is this over-engineering 😆

This will be the /api/eventlog endpoint that is present on the notebook server.

Ah, yes. I forgot we added this in the notebook PR. I remember now... This means both EventLogs, server-side and client-side, must be configured by operators with the same set of whitelisted events. Then, validation happens twice, client-side then server-side. Correct? I'm not saying this is wrong, I'm just making sure I understand correctly :)

@jaipreet-s
Copy link
Member Author

jaipreet-s commented Sep 16, 2019

My question is whether we should start with a Handler object now (that lacks filtering etc. functionality for now) rather than using a function as the starting place.

The notion behind the function is because the "fan-out" of an event to each handler is managed by PhosphorJS signals, which accept a callback function for subscriber of an event. That is why the proposal was to start with a function so we get that aspect scalability for free. It looks very feasible to extend this to a Handler class and invoke the handleEvent() method of the handler instead.

This means both EventLogs, server-side and client-side, must be configured by operators with the same set of whitelisted events.

No its a good point - I've been thinking about this for other use-cases, and it may make sense to support regexes in the allowedSchemas. For e.g., a operator wanting to whitelist all JupyterLab commands could do allowedSchemas: [org.jupyterlab.commands.*]. An operator wanting to purely control the whitelisting configuration at the server, without duplicating to the client-side, could whitelist all events at the client-side using allowedSchemas: [*] and let the whitelist on the server do the actual work.

@Zsailer
Copy link
Member

Zsailer commented Sep 16, 2019

The notion behind the function is because the "fan-out" of an event to each handler is managed by PhosphorJS signals, which accept a callback function for subscriber of an event. That is why the proposal was to start with a function so we get that aspect scalability for free. It looks very feasible to extend this to a Handler class and invoke the handleEvent() method of the handler instead.

👍 great, that makes perfect sense to me. Thanks for explaining.

it may make sense to support regexes in the allowedSchemas.

That's a cool idea. We could definitely implement something like this. I'd be curious to hear what @yuvipanda thinks.

@Zsailer
Copy link
Member

Zsailer commented Sep 16, 2019

This proposal looks great to me!

@jaipreet-s
Copy link
Member Author

xref #9 which implements this

@ellisonbg
Copy link

Finally having a look at this. One question that comes up is about the relationship between handlers and allowed schemas. My understanding is that we would like an event log to be able to send different events to different handlers. For example, an operator might be sending events from one set of schemas to one handler, but an extension author may be collecting a different schema and sending those somewhere else. From the above discussion, it sounds like this usage case would be handled in the following manner:

  • The EventLog would need to be configured with the union of all allowedSchema that all handlers would want to collect.
  • Each individual handler would then be configured to filter out the schemas it isn't interested in.

A pro of this model is that there is a single authoritative source for the all allowedSchemas, so a user doesn't have to worry about tracking down that information across different places. A con however, is that the use won't actually know precisely which data is being sent where. Is this evaluation accurate? Where this becomes really critical to understand is with private or sensitive data. If some of the allowedSchemas emit such data, as a user I would want to know exactly where that data is actually being sent, not just where the data might be sent.

I will have a look at the implementation now as well. Thanks for this work and discussion!

@Zsailer
Copy link
Member

Zsailer commented Oct 3, 2019

My understanding is that we would like an event log to be able to send different events to different handlers.

👍

use won't actually know precisely which data is being sent where

@ellisonbg allowedSchemas acts as the first "filter" in the EventLog object. Then, each handler can have additional filters to route subsets of allowedSchemas to different places. We should be able to query each handler for the schema subset they allow. We can surface this information to users so they know where events are being sent.

@jaipreet-s
Copy link
Member Author

We should be able to query each handler for the schema subset they allow. We can surface this information to users so they know where events are being sent.

This can be codified in the Handler interface as well, for e.g. the allowedSchemas can be an abstract property. The EventLog then can query each handler's allowedSchemas property and take care of surfacing that to the user

@ellisonbg
Copy link

ellisonbg commented Oct 6, 2019 via email

@yuvipanda
Copy link
Collaborator

Thanks for working on this, @jaipreet-s!

I'm curious what you think of the schema naming recommendations here: https://github.com/jupyter/telemetry/blob/master/proposal/design.md#schema-naming-recommendations. I'd love for us to follow that for events we emit from inside the project, for the reasons outlined there.

@jaipreet-s
Copy link
Member Author

Agreed @yuvipanda. #11 tracks the implementation work for updating schema names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants