Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Registration of metrics, nodes, MV, transforms etc as well as simpler way to allow any query dimensions available for Transform #1117

Open
bazooka720 opened this issue Jul 25, 2024 · 6 comments

Comments

@bazooka720
Copy link

bazooka720 commented Jul 25, 2024

Is there a more simpler way to enable registration of queries as metrics, nodes (as well as allow not just source nodes as basis for transforms?).

One approach is to just have a simple folder structure where file name and description on top of sql files can be automatically consumed into DJ. Essentially something that makes it easy for existing projects to transition into DJ would help accelerate adoption of DJ

@agorajek @samredai hope this summarizes properly our chat
PS: See this for e.g. not exactly the same as above...but the intent of it is relevant
https://github.com/Bl3f/yato

@samredai
Copy link
Contributor

@bazooka720 I think this is a really interesting idea and should definitely be possible with a bit of work. I think this can replace our concept of a "project" that's essentially a collection of YAML files. However that logic is part of the client library and runs locally and the reason I mention that is because the client doesn't have access to the SQL parser+builder which currently is all part of the server-side code.

So the first question I think we need to answer is where would this logic live (should it all run client side or will the files be bundled and sent to the server for processing?). If we decide this should all run client-side then the next question is how do we share the SQL AST and related logic from the server with the client library (maybe break it out as a standalone shared library?).

@agorajek
Copy link
Member

Let's consider a single SQL file (DBT model, Looker explore, etc). We could make a set of similar APIs:

  • register_sql( file-name|query-string, db_connection, target_namespace)
  • register_dbt( file-name|model-string, db_connection, target_namespace)
  • etc.

In all above cases we could parse the content of the 1st argument, find all the referenced tables, connect to db_connection to pull the schema of the referenced tables and create a transform node + any other metric|dim nodes based on the content. Both the target_namespace and db_connection could be optional.

And once we get the above APIs right, then we could simply apply it to the collection of these using a client or a CLI.

@bazooka720
Copy link
Author

This Makes lot of sense @agorajek. This would allow pretty much instant onboarding for any company with pre-existing models / . @samredai I think the logic could be easier for the client to probably push to dj (this is good as the client can more easily changes and so can set triggers ti sync with DJ).

@samredai
Copy link
Contributor

Sounds good, I really like how this sounds. If we want to do the processing on the server side, I think we should lean into that and batch the entire set of local SQL files together into one request. Then the server can validate them all in one go, including the references

I'll try to create a set of these SQL files that models the tpch.sf1 sample schema that comes with the oss Trino image and then we can use that as a test suite when developing this.

@samredai
Copy link
Contributor

samredai commented Aug 1, 2024

In all above cases we could parse the content of the 1st argument, find all the referenced tables, connect to db_connection to pull the schema of the referenced tables and create a transform node + any other metric|dim nodes based on the content. Both the target_namespace and db_connection could be optional.

@agorajek I think we should talk about this. If you look at the issue @vcknorket linked, I believe it's an alternative to what you're describing here. If we could register raw database SQL as a source node, that would prevent us from having to actually parse it and and create a virtual DAG within DJ. This is really just a handy shortcut that's better than telling people to go create their view(s) on their own and then come register it as a source node in DJ. It's still an open question on how DJ will create/manage the view, but maybe it's as simple as it creates it if it doesn't exist, and updates it whenever it's changed in DJ.

Once these views are in DJ as source nodes, nothing else really changes, meaning transforms/metrics/dimensions/cubes can be created on top of them just as happens today.

@agorajek
Copy link
Member

agorajek commented Aug 5, 2024

@samredai I think we are getting closer to a solution. I just read your suggestion above and I think it matches pretty closely with the solution described at #1127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants