Create ERD from relationships in schema.yml files #84

noahbruegmann · 2020-04-15T00:38:11Z

For complicated table relationships, it's nice to have an ERD (i.e. schema diagram) to help understand how everything fits together.

Schema.yml files already have relationship tests built into them that show the relationship between tables, so those could be parsed to get the data needed for the "keys," which may not actually be keys in most warehouse systems.

A first implementation of this functionality might just look at the schema.yml file and get any columns listed there so that only one data source is needed. A fuller implementation would get all columns from the tables.

Other thoughts:

Integrating the descriptions that are in the schema.yml files would also be great, perhaps with a tooltip next to the table or column name
If there are multiple schema.yml files in a project, it might be nice to color or box the tables that are known to be related to each other because they're in the same yml. Also would be nice to be able to filter for things in the same yml file. I think this is the nicest way to deal with being able to look at individual layers of a project, but would like someone else to think about that too
Being able to zoom in/out and pan around like in the graph view would be a nice feature here

tayloramurphy · 2020-06-29T15:40:56Z

We're thinking about this too. dbt docs are great, but a lot of the data in there isn't for the right downstream audience. If you have business users that want to know what to query and where, having a clear ERD can be more useful than saying "go read the dbt docs".

I agree that this should just be done in the schema.yml files, probably using the ref syntax with some specific variables for PKs, FKs, etc.

aroder · 2020-10-27T21:31:20Z

We've tried a lot of solutions for this, none of which quite fit and none of which supported the schema.yml spec directly.

https://github.com/BurntSushi/erd, https://gojs.net/latest/index.html, https://pypi.org/project/ERAlchemy/, http://www.nomnoml.com/, https://dbdocs.io/

Right now, we have a functional tool that parses the schema.yml, specifically the relationships tests and columns to auto-generate an ERD

I've never worked in the dbt codebase, and there is a fair amount of work to make this something that will work generally. I'm interested in seeing this through, but would need some help

aroder · 2020-10-27T21:31:57Z

https://getdbt.slack.com/archives/C0VLNUUTZ/p1603828627482100

jtcohen6 · 2020-10-28T03:56:34Z

This is really, really cool!

Since dbt-docs is already built out of artifacts produced by dbt docs generate, namely manifest.json (which includes all tests and documentation) and catalog.json (which includes the set of all columns in the database), I think it would make sense to leverage the information available in those as inputs to the ERD.

E.g. grab the to/from components of all relationships tests in a v0.18.1-produced manifest:

jq '.nodes | .[] | select(.test_metadata.name=="relationships") | .test_metadata.kwargs' target/manifest.json

...
{
  "to": "ref('snowplow_web_page_context')",
  "field": "page_view_id",
  "column_name": "page_view_id",
  "model": "{{ ref('snowplow_web_events') }}"
}
...

So my questions are:

Tactically, is there a reason to prefer parsing the schema.yml files directly?
@tayloramurphy It sounds like you'd want ERD components to be an additional schema.yml specification, rather than relying on unique + relationships tests to define PKs + FKs, respectively. Could you say more about your thinking there?

aroder · 2020-10-28T13:38:36Z

@jtcohen6 there is no benefit to parsing schema.yml directly. We don't have any experience working in the dbt docs code, so we just took the path of least resistance.

My preference would be to use the existing specification if possible. The column names, unique tests, and relationship tests provide the necessary metadata. If we are able to use the existing specification, then something like this would just work. Maybe we would need to add a toggle in dbt_project.yml to enable the feature, or something to that effect.

However, I can see a lot of edge cases that my initial assumptions are not good for. We stick to Kimball and star schemas pretty heavily, but what about people who do a normalized model and don't have a visually appealing star schema? What about the use case to see the entire data model in one visual?

tayloramurphy · 2020-10-28T14:14:38Z

@tayloramurphy It sounds like you'd want ERD components to be an additional schema.yml specification, rather than relying on unique + relationships tests to define PKs + FKs, respectively. Could you say more about your thinking there?

My first ideas for solutions typically lean more explicit than implicit. But if we're able to parse out relevant joins just based on testing I think it would make sense. Like @aroder mentioned there are corner cases on that.

if you have multiple unique tests for a given table would they all be parsed as primary keys?
limiting what gets drawn on an ERD would be tough
how do you define many-to-one, one-to-one, one-to-many joins?
can we limit columns appearing in the ERD tables?

Stuff like that. I think there are some clear easy wins / iterations that can happen on this which can certainly be improved over time. We're just doing things in LucidChart right now which makes me sad 😆 https://about.gitlab.com/handbook/business-ops/data-team/platform/edw/#first-iteration-in-gitlab-dwh

RichardSharkey · 2020-12-18T10:48:14Z

This is a fantastic idea and would certainly decrease the fragmentation of documentation. I think defining the explicit configuration would be preferrable, if I was to add an opinion to the mix, since you could then define the relationship style for the ERD connecting lines.

I'm wondering if using the exposures concept would be a good vehicle to start from, for this requirement. Since you are already configuring inferred relationships between a subset of objects, it would make sense to define how they are related. It would also mean you don't need to worry about the "limiting" factor straight away, since the expectation would be that you want to see all objects defined in the exposure.

With regards to limiting columns in the tables, perhaps you could have max limit of 50 columns with a collapse/expand option to hide the excess columns interactively.

Also, imagine being able to hover over the column in the ERD and seeing a tooltip of the docs description or a link to docs page.

codigo-ergo-sum · 2020-12-18T16:58:54Z

I want this feature as much as anyone, and having said that, I'm not so sure that the data in the schema.yml files provides the full information needed to do this. A couple of specific items that you need to get in order to build an ERD that are not present in schema.yml files:

Primary Keys. One might argue, "Well, at the least, uniqueness tests indicate the equivalent of an enforced unique index on a table, and all a Primary Key is is choosing which of the one or more unique columns present in a table is the actual PK if there are more than one. At the least we could indicate unique columns on the ERD even if we can't show a PK". Even that might be the case but the challenge is that in schema.yml files you can call out things that are not actual columns in the table for a uniqueness test (and for many things in general). E.g. "col1 || col2: or "SubStr(col1, 2, 4)". How would that be shown on the ERD and how would relationships be shown from another table to a table where one of these calculations exists as being shown as unique?
Column datatypes. We would need to interrogate INFORMATION_SCHEMA or the equivalent for these. That is doable, it just leads to another layer of complexity for the code. What if the columns defined in INFORMATION_SCHEMA don't match up to what is in the .yml files?

noahbruegmann · 2020-12-18T17:30:01Z

Primary Keys. One might argue, "Well, at the least, uniqueness tests indicate the equivalent of an enforced unique index on a table, and all a Primary Key is is choosing which of the one or more unique columns present in a table is the actual PK if there are more than one. At the least we could indicate unique columns on the ERD even if we can't show a PK". Even that might be the case but the challenge is that in schema.yml files you can call out things that are not actual columns in the table for a uniqueness test (and for many things in general). E.g. "col1 || col2: or "SubStr(col1, 2, 4)". How would that be shown on the ERD and how would relationships be shown from another table to a table where one of these calculations exists as being shown as unique?

To get around this, you could create some custom schema tests in a package that are reserved for things that you want to include in the ERD. They'd be custom tests that didn't really do anything interesting, but they could serve as a hook for helping decide what to pick up. This feels like it would be a nice balance between not having to define everything twice and still getting to a more explicit definition for the ERD, to address @tayloramurphy 's comment.

jtcohen6 · 2020-12-18T18:05:06Z

I'm wondering if using the exposures concept would be a good vehicle to start from, for this requirement. Since you are already configuring inferred relationships between a subset of objects, it would make sense to define how they are related. It would also mean you don't need to worry about the "limiting" factor straight away, since the expectation would be that you want to see all objects defined in the exposure.

I really like this. We could put an ERD right on the exposure's page in docs site.

Column datatypes. We would need to interrogate INFORMATION_SCHEMA or the equivalent for these. That is doable, it just leads to another layer of complexity for the code. What if the columns defined in INFORMATION_SCHEMA don't match up to what is in the .yml files?

We have access to column datatypes in catalog.json, populated by the dbt docs generate command, and they ultimately come from information schema or show/describe queries. I think that should be the source of truth.

eduard1987 · 2021-02-05T19:45:45Z

We wanted to know if you have any plans to release capabilities for data modeling with GUI to reverse and forward engineer DDL.

RickRen7575 · 2021-04-21T15:39:12Z

Hi we are also looking for this functionality! Would love to hear if there is a timeline on this.

panasenco · 2021-06-07T23:48:21Z

Also very interested in this functionality. While waiting for an official implementation, I created this PowerShell solution that parses manifest.json and produces a (honestly kinda ugly but readable) PlantUML diagram: https://gist.github.com/panasenco/31d4c12f0cadce91f576a818018970a0

blake-enyart · 2021-09-13T06:02:32Z

@jtcohen6, I'm interested in trying to go about developing the functionality here for injecting .png files into the markdown and attempting to have this be performed with the dbt docs generate command with perhaps some type of flag to indicate ERD generation.

Right now, I'm starting some initial prototyping with @aroder who has an early stage POC that generates the .png files from the schema.yml files via a short python script which identifies facts and dimensions based on naming conventions of fct_ and dim_. I know this isn't the ideal for a more permanent solution.

WIth that, do you have recommendations on how I might be able get started to either A) add in this existing script to the dbt core system or B) provide further guidance on what is needed to get started on this?

From there, I'm happy to carry on to the next phase of development that would leverage the manifest.json file like you recommended earlier up in this post.

Let me know what you think on next steps for this.

blake-enyart · 2021-09-22T19:29:02Z

@jtcohen6 @drewbanin (?) just wanted to follow-up on this and see if there are any insights on next steps for this.

joeyfezster · 2022-03-24T15:19:24Z

To the great dbt community - wondering if 2022 brings any update on this 😄

aroder · 2022-04-11T18:10:04Z

@joeyfezster I'm looking for collaborators to make this more robust. It's working for us but could use some work #84 (comment)

noahbruegmann · 2022-04-26T17:14:00Z

@joeyfezster I'm looking for collaborators to make this more robust. It's working for us but could use some work #84 (comment)

@aroder we have some code (written for other purposes) that lets us load up the network file and comprehend the compiled graph. If you have the visualization side, we might be able to back it with data in the way that Jeremy suggested above. Is that the part that you're looking for help with?

brandco · 2022-06-24T00:55:36Z

I like this open source package for the R programing language:
https://krlmlr.github.io/dm/
Not sure how hard it would be to translate the dbt yml to what the dm package can parse (I don't know dbt well, myself)

obar1 · 2022-10-16T07:59:01Z

I would solve adding a yml for tuning the docs generation with multiple options like so

add lineage graph
add column names
add docs
add tests
etc
So I could create an erd or a full doc just changing the yaml config 😅

datnguye · 2023-02-25T14:09:11Z

Hey folks, I just published first version CLI for generating ERD-as-code in DBML from manifest.json, which can be easily used together in the generating dbt docs pipeline.

https://github.com/datnguye/dbterd

Check it out!
Feel free to let me know for any questions or requests.
Cheers

github-actions · 2023-08-25T01:43:37Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

eitsupi · 2023-08-25T12:27:32Z

unstale

noahbruegmann · 2023-08-25T15:41:04Z

Thanks @eitsupi! Perhaps not surprising given that I opened the ticket initially, but I still think having a solution built into dbt would be great.

shawnchacko88 · 2024-01-31T14:32:46Z

This would be a great feature to update in dbt

Grism · 2024-06-07T16:04:44Z

This would be a fantastic addition to DBT

syedhanicf · 2024-08-22T14:03:25Z

A solution like this built into dbt would be awesome

kobejeedeluna-ms · 2024-09-22T07:23:17Z

I would love to have this feature in dbt-docs. Will help to convince teams to move to DBT

dubravcik mentioned this issue Nov 18, 2020

generate model relation matrix from fk tests/relations #152

Closed

jtcohen6 mentioned this issue Feb 22, 2021

Storing test results in the database dbt-labs/dbt-core#2593

Closed

jtcohen6 mentioned this issue May 6, 2021

Support PK and FK declarations for ER diagram and modelling purposes dbt-labs/dbt-core#3295

Closed

jtcohen6 mentioned this issue Jul 18, 2021

Include required columns in exposures? dbt-labs/dbt-core#3533

Closed

jtcohen6 mentioned this issue Apr 19, 2022

[Feature] dbt should know about joins dbt-labs/dbt-core#4125

Closed

1 task

github-actions bot added the Stale label Aug 25, 2023

github-actions bot added the triage label Aug 25, 2023

github-actions bot removed the Stale label Aug 26, 2023

jtcohen6 removed the triage label Sep 22, 2023

github-actions bot added the triage label Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create ERD from relationships in schema.yml files #84

Create ERD from relationships in schema.yml files #84

noahbruegmann commented Apr 15, 2020

tayloramurphy commented Jun 29, 2020

aroder commented Oct 27, 2020

aroder commented Oct 27, 2020

jtcohen6 commented Oct 28, 2020 •

edited

Loading

aroder commented Oct 28, 2020 •

edited

Loading

tayloramurphy commented Oct 28, 2020

RichardSharkey commented Dec 18, 2020

codigo-ergo-sum commented Dec 18, 2020

noahbruegmann commented Dec 18, 2020

jtcohen6 commented Dec 18, 2020

eduard1987 commented Feb 5, 2021

RickRen7575 commented Apr 21, 2021

panasenco commented Jun 7, 2021 •

edited

Loading

blake-enyart commented Sep 13, 2021 •

edited

Loading

blake-enyart commented Sep 22, 2021

joeyfezster commented Mar 24, 2022

aroder commented Apr 11, 2022

noahbruegmann commented Apr 26, 2022

brandco commented Jun 24, 2022 •

edited

Loading

obar1 commented Oct 16, 2022

datnguye commented Feb 25, 2023

github-actions bot commented Aug 25, 2023

eitsupi commented Aug 25, 2023

noahbruegmann commented Aug 25, 2023

shawnchacko88 commented Jan 31, 2024

Grism commented Jun 7, 2024

syedhanicf commented Aug 22, 2024

kobejeedeluna-ms commented Sep 22, 2024

Create ERD from relationships in schema.yml files #84

Create ERD from relationships in schema.yml files #84

Comments

noahbruegmann commented Apr 15, 2020

tayloramurphy commented Jun 29, 2020

aroder commented Oct 27, 2020

aroder commented Oct 27, 2020

jtcohen6 commented Oct 28, 2020 • edited Loading

aroder commented Oct 28, 2020 • edited Loading

tayloramurphy commented Oct 28, 2020

RichardSharkey commented Dec 18, 2020

codigo-ergo-sum commented Dec 18, 2020

noahbruegmann commented Dec 18, 2020

jtcohen6 commented Dec 18, 2020

eduard1987 commented Feb 5, 2021

RickRen7575 commented Apr 21, 2021

panasenco commented Jun 7, 2021 • edited Loading

blake-enyart commented Sep 13, 2021 • edited Loading

blake-enyart commented Sep 22, 2021

joeyfezster commented Mar 24, 2022

aroder commented Apr 11, 2022

noahbruegmann commented Apr 26, 2022

brandco commented Jun 24, 2022 • edited Loading

obar1 commented Oct 16, 2022

datnguye commented Feb 25, 2023

github-actions bot commented Aug 25, 2023

eitsupi commented Aug 25, 2023

noahbruegmann commented Aug 25, 2023

shawnchacko88 commented Jan 31, 2024

Grism commented Jun 7, 2024

syedhanicf commented Aug 22, 2024

kobejeedeluna-ms commented Sep 22, 2024

jtcohen6 commented Oct 28, 2020 •

edited

Loading

aroder commented Oct 28, 2020 •

edited

Loading

panasenco commented Jun 7, 2021 •

edited

Loading

blake-enyart commented Sep 13, 2021 •

edited

Loading

brandco commented Jun 24, 2022 •

edited

Loading