Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create ERD from relationships in schema.yml files #84

Open
noahbruegmann opened this issue Apr 15, 2020 · 28 comments
Open

Create ERD from relationships in schema.yml files #84

noahbruegmann opened this issue Apr 15, 2020 · 28 comments
Labels

Comments

@noahbruegmann
Copy link

For complicated table relationships, it's nice to have an ERD (i.e. schema diagram) to help understand how everything fits together.

Schema.yml files already have relationship tests built into them that show the relationship between tables, so those could be parsed to get the data needed for the "keys," which may not actually be keys in most warehouse systems.

A first implementation of this functionality might just look at the schema.yml file and get any columns listed there so that only one data source is needed. A fuller implementation would get all columns from the tables.

Other thoughts:

  • Integrating the descriptions that are in the schema.yml files would also be great, perhaps with a tooltip next to the table or column name
  • If there are multiple schema.yml files in a project, it might be nice to color or box the tables that are known to be related to each other because they're in the same yml. Also would be nice to be able to filter for things in the same yml file. I think this is the nicest way to deal with being able to look at individual layers of a project, but would like someone else to think about that too
  • Being able to zoom in/out and pan around like in the graph view would be a nice feature here
@tayloramurphy
Copy link

We're thinking about this too. dbt docs are great, but a lot of the data in there isn't for the right downstream audience. If you have business users that want to know what to query and where, having a clear ERD can be more useful than saying "go read the dbt docs".

I agree that this should just be done in the schema.yml files, probably using the ref syntax with some specific variables for PKs, FKs, etc.

@aroder
Copy link

aroder commented Oct 27, 2020

We've tried a lot of solutions for this, none of which quite fit and none of which supported the schema.yml spec directly.

https://github.com/BurntSushi/erd, https://gojs.net/latest/index.html, https://pypi.org/project/ERAlchemy/, http://www.nomnoml.com/, https://dbdocs.io/

Right now, we have a functional tool that parses the schema.yml, specifically the relationships tests and columns to auto-generate an ERD

image

I've never worked in the dbt codebase, and there is a fair amount of work to make this something that will work generally. I'm interested in seeing this through, but would need some help

@aroder
Copy link

aroder commented Oct 27, 2020

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 28, 2020

This is really, really cool!

Since dbt-docs is already built out of artifacts produced by dbt docs generate, namely manifest.json (which includes all tests and documentation) and catalog.json (which includes the set of all columns in the database), I think it would make sense to leverage the information available in those as inputs to the ERD.

E.g. grab the to/from components of all relationships tests in a v0.18.1-produced manifest:

jq '.nodes | .[] | select(.test_metadata.name=="relationships") | .test_metadata.kwargs' target/manifest.json
...
{
  "to": "ref('snowplow_web_page_context')",
  "field": "page_view_id",
  "column_name": "page_view_id",
  "model": "{{ ref('snowplow_web_events') }}"
}
...

So my questions are:

  • Tactically, is there a reason to prefer parsing the schema.yml files directly?
  • @tayloramurphy It sounds like you'd want ERD components to be an additional schema.yml specification, rather than relying on unique + relationships tests to define PKs + FKs, respectively. Could you say more about your thinking there?

@aroder
Copy link

aroder commented Oct 28, 2020

@jtcohen6 there is no benefit to parsing schema.yml directly. We don't have any experience working in the dbt docs code, so we just took the path of least resistance.

My preference would be to use the existing specification if possible. The column names, unique tests, and relationship tests provide the necessary metadata. If we are able to use the existing specification, then something like this would just work. Maybe we would need to add a toggle in dbt_project.yml to enable the feature, or something to that effect.

However, I can see a lot of edge cases that my initial assumptions are not good for. We stick to Kimball and star schemas pretty heavily, but what about people who do a normalized model and don't have a visually appealing star schema? What about the use case to see the entire data model in one visual?

@tayloramurphy
Copy link

@tayloramurphy It sounds like you'd want ERD components to be an additional schema.yml specification, rather than relying on unique + relationships tests to define PKs + FKs, respectively. Could you say more about your thinking there?

My first ideas for solutions typically lean more explicit than implicit. But if we're able to parse out relevant joins just based on testing I think it would make sense. Like @aroder mentioned there are corner cases on that.

  • if you have multiple unique tests for a given table would they all be parsed as primary keys?
  • limiting what gets drawn on an ERD would be tough
  • how do you define many-to-one, one-to-one, one-to-many joins?
  • can we limit columns appearing in the ERD tables?

Stuff like that. I think there are some clear easy wins / iterations that can happen on this which can certainly be improved over time. We're just doing things in LucidChart right now which makes me sad 😆 https://about.gitlab.com/handbook/business-ops/data-team/platform/edw/#first-iteration-in-gitlab-dwh

@RichardSharkey
Copy link

This is a fantastic idea and would certainly decrease the fragmentation of documentation. I think defining the explicit configuration would be preferrable, if I was to add an opinion to the mix, since you could then define the relationship style for the ERD connecting lines.

I'm wondering if using the exposures concept would be a good vehicle to start from, for this requirement. Since you are already configuring inferred relationships between a subset of objects, it would make sense to define how they are related. It would also mean you don't need to worry about the "limiting" factor straight away, since the expectation would be that you want to see all objects defined in the exposure.

With regards to limiting columns in the tables, perhaps you could have max limit of 50 columns with a collapse/expand option to hide the excess columns interactively.

Also, imagine being able to hover over the column in the ERD and seeing a tooltip of the docs description or a link to docs page.

@codigo-ergo-sum
Copy link

I want this feature as much as anyone, and having said that, I'm not so sure that the data in the schema.yml files provides the full information needed to do this. A couple of specific items that you need to get in order to build an ERD that are not present in schema.yml files:

  1. Primary Keys. One might argue, "Well, at the least, uniqueness tests indicate the equivalent of an enforced unique index on a table, and all a Primary Key is is choosing which of the one or more unique columns present in a table is the actual PK if there are more than one. At the least we could indicate unique columns on the ERD even if we can't show a PK". Even that might be the case but the challenge is that in schema.yml files you can call out things that are not actual columns in the table for a uniqueness test (and for many things in general). E.g. "col1 || col2: or "SubStr(col1, 2, 4)". How would that be shown on the ERD and how would relationships be shown from another table to a table where one of these calculations exists as being shown as unique?

  2. Column datatypes. We would need to interrogate INFORMATION_SCHEMA or the equivalent for these. That is doable, it just leads to another layer of complexity for the code. What if the columns defined in INFORMATION_SCHEMA don't match up to what is in the .yml files?

@noahbruegmann
Copy link
Author

  1. Primary Keys. One might argue, "Well, at the least, uniqueness tests indicate the equivalent of an enforced unique index on a table, and all a Primary Key is is choosing which of the one or more unique columns present in a table is the actual PK if there are more than one. At the least we could indicate unique columns on the ERD even if we can't show a PK". Even that might be the case but the challenge is that in schema.yml files you can call out things that are not actual columns in the table for a uniqueness test (and for many things in general). E.g. "col1 || col2: or "SubStr(col1, 2, 4)". How would that be shown on the ERD and how would relationships be shown from another table to a table where one of these calculations exists as being shown as unique?

To get around this, you could create some custom schema tests in a package that are reserved for things that you want to include in the ERD. They'd be custom tests that didn't really do anything interesting, but they could serve as a hook for helping decide what to pick up. This feels like it would be a nice balance between not having to define everything twice and still getting to a more explicit definition for the ERD, to address @tayloramurphy 's comment.

@jtcohen6
Copy link
Contributor

I'm wondering if using the exposures concept would be a good vehicle to start from, for this requirement. Since you are already configuring inferred relationships between a subset of objects, it would make sense to define how they are related. It would also mean you don't need to worry about the "limiting" factor straight away, since the expectation would be that you want to see all objects defined in the exposure.

I really like this. We could put an ERD right on the exposure's page in docs site.

Column datatypes. We would need to interrogate INFORMATION_SCHEMA or the equivalent for these. That is doable, it just leads to another layer of complexity for the code. What if the columns defined in INFORMATION_SCHEMA don't match up to what is in the .yml files?

We have access to column datatypes in catalog.json, populated by the dbt docs generate command, and they ultimately come from information schema or show/describe queries. I think that should be the source of truth.

@eduard1987
Copy link

We wanted to know if you have any plans to release capabilities for data modeling with GUI to reverse and forward engineer DDL.

@RickRen7575
Copy link

Hi we are also looking for this functionality! Would love to hear if there is a timeline on this.

@panasenco
Copy link

panasenco commented Jun 7, 2021

Also very interested in this functionality. While waiting for an official implementation, I created this PowerShell solution that parses manifest.json and produces a (honestly kinda ugly but readable) PlantUML diagram: https://gist.github.com/panasenco/31d4c12f0cadce91f576a818018970a0

@blake-enyart
Copy link

blake-enyart commented Sep 13, 2021

@jtcohen6, I'm interested in trying to go about developing the functionality here for injecting .png files into the markdown and attempting to have this be performed with the dbt docs generate command with perhaps some type of flag to indicate ERD generation.

Right now, I'm starting some initial prototyping with @aroder who has an early stage POC that generates the .png files from the schema.yml files via a short python script which identifies facts and dimensions based on naming conventions of fct_ and dim_. I know this isn't the ideal for a more permanent solution.

WIth that, do you have recommendations on how I might be able get started to either A) add in this existing script to the dbt core system or B) provide further guidance on what is needed to get started on this?

From there, I'm happy to carry on to the next phase of development that would leverage the manifest.json file like you recommended earlier up in this post.

Let me know what you think on next steps for this.

@blake-enyart
Copy link

@jtcohen6 @drewbanin (?) just wanted to follow-up on this and see if there are any insights on next steps for this.

@joeyfezster
Copy link

To the great dbt community - wondering if 2022 brings any update on this 😄

@aroder
Copy link

aroder commented Apr 11, 2022

@joeyfezster I'm looking for collaborators to make this more robust. It's working for us but could use some work #84 (comment)

@noahbruegmann
Copy link
Author

@joeyfezster I'm looking for collaborators to make this more robust. It's working for us but could use some work #84 (comment)

@aroder we have some code (written for other purposes) that lets us load up the network file and comprehend the compiled graph. If you have the visualization side, we might be able to back it with data in the way that Jeremy suggested above. Is that the part that you're looking for help with?

@brandco
Copy link

brandco commented Jun 24, 2022

I like this open source package for the R programing language:
https://krlmlr.github.io/dm/
Not sure how hard it would be to translate the dbt yml to what the dm package can parse (I don't know dbt well, myself)

@obar1
Copy link

obar1 commented Oct 16, 2022

I would solve adding a yml for tuning the docs generation with multiple options like so

  • add lineage graph
  • add column names
  • add docs
  • add tests
  • etc
    So I could create an erd or a full doc just changing the yaml config 😅

@datnguye
Copy link

Hey folks, I just published first version CLI for generating ERD-as-code in DBML from manifest.json, which can be easily used together in the generating dbt docs pipeline.

https://github.com/datnguye/dbterd

Check it out!
Feel free to let me know for any questions or requests.
Cheers

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 25, 2023
@eitsupi
Copy link

eitsupi commented Aug 25, 2023

unstale

@noahbruegmann
Copy link
Author

Thanks @eitsupi! Perhaps not surprising given that I opened the ticket initially, but I still think having a solution built into dbt would be great.

@github-actions github-actions bot removed the Stale label Aug 26, 2023
@jtcohen6 jtcohen6 removed the triage label Sep 22, 2023
@shawnchacko88
Copy link

This would be a great feature to update in dbt

@Grism
Copy link

Grism commented Jun 7, 2024

This would be a fantastic addition to DBT

@syedhanicf
Copy link

A solution like this built into dbt would be awesome

@kobejeedeluna-ms
Copy link

I would love to have this feature in dbt-docs. Will help to convince teams to move to DBT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests