Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec improvement: Enable data to reside in files while still describing relationships via foreign keys for discoverability #964

Open
adrienDog opened this issue Jul 15, 2024 · 2 comments

Comments

@adrienDog
Copy link

Schematize relationships of files

Summary of the issue

More and more data is not tabular: images, videos, html, etc… But it does have relationships.

The current datapackage.org spec allows to define relationships between table fields, but it does not allow it on pure file resources.

A workaround is to have a special table resource that lists the files and leverages the foreignKeys spec, but it mixes data and schema concepts. Indeed, one column of the table resource is a path to another resource.

Why is this a problem?

Well, the great benefit of using datapackage.org is to enable discoverability of data and a programmatic one even. So that schemas are defined and a whole database schema can be generated without reading the content of each resource. If the relationship is defined inside the data (resource content), then discoverability cannot happen because a script cannot discover these relationships by purely following the datapackage.org standard.

What could be done?

We will use the following use case

./people-data/
  |__ people.csv # list of people
  |__ traits.csv # several traits per person, one row = one trait
  |__ people_files/
    |__ person1/
      |__ profile.jpg
      |__ randomName.png
    |__ person2/
      |__ profile.jpg
      |__ randomName.png

Option 1: Special fileTable table resource that describes data located in files and its references to other resources via foreign keys

We extend the table type for a fileTable type which enforces the first field to be file_path

  • file_path must be a a valid path to a file in the datapackage

Q: Do we need to declare each file resource when they are described in a fileTable?

resources:
  - name: people
    type: table
    path: ./people.csv
    scheme: file
    format: csv
    mediatype: text/csv
    schema:
      fields:
        - name: id
          type: string
        - name: name
          type: string
  - name: traits
    type: table
    path: ./traits.csv
    scheme: file
    format: csv
    mediatype: text/csv
    schema:
      fields:
        - name: person_id
          type: string
        - name: eye_color
          type: string
    foreignKeys:
      - fields: ["person_id"]
        reference:
          resource: people
          fields: ["id"]
  - name: files
    type: fileTable # special resource type
    schema:
      fields:
        - name: file_path # first field has to be filepath
          type: string
        - name: person_id
          type: string
        - name: mediatype
          type: string
    foreignKeys:
      - fields: ["person_id"]
        reference:
          resource: people
          fields: ["id"]
    data:
      - file_path: ./people_files/person1/profile.jpg
        person_id: person1
        mediatype: image/jpeg
      - file_path: ./people_files/person1/randomName.png
        person_id: person1
        mediatype: image/png
      # etc ...

Option 2: File resources can reference a virtualTable parent

The idea here is that the schema can still be defined in a table like resource but the data is in files that have to reference it AND have the same fields as properties

This allows for data frames to be in separate JSON files for examples and leverages the existing metadata definition for files like format, mediatype etc…

resources:
  - name: people
    type: table
    path: ./people.csv
    scheme: file
    format: csv
    mediatype: text/csv
    schema:
      fields:
        - name: id
          type: string
        - name: name
          type: string
  - name: traits
    type: table
    path: ./traits.csv
    scheme: file
    format: csv
    mediatype: text/csv
    schema:
      fields:
        - name: person_id
          type: string
        - name: eye_color
          type: string
    foreignKeys:
      - fields: ["person_id"]
        reference:
          resource: people
          fields: ["id"]
  - name: files
    type: virtualTable # special resource type, virtual, only has schema
    schema:
      fields: # some fields have to be defined in each frame
        - name: person_id
          type: string
    foreignKeys:
      - fields: ["person_id"]
        reference:
          resource: people
          fields: ["id"]
  - name: people_files_person1_profile_jpg
    type: virtualTableFrame
    path: ./people_files/person1/profile.jpg
    mediatype: image/jpeg
    format: jpg
    data:
      person_id: person1
    # path: path/to/json/file.json {"person_id": "person1"}
@peterdesmet
Copy link
Member

Hi, why not use option 1, but as a regular tabular data resource?

resources:
  - name: people
  - name: traits
  - name: files
    type: table # Regular tabular resource
    schema:
      fields:
        - name: person_id
          type: string
        - name: file_path # Does not need to the first field
          type: string
          format: default # We could benefit here from an additional format "path", cf. "uri" for URLs
        - name: mediatype
          type: string
    foreignKeys:
      - fields: ["person_id"]
        reference:
          resource: people
          fields: ["id"]
    data:
      - person_id: person1
        file_path: ./people_files/person1/profile.jpg
        mediatype: image/jpeg
      - person_id: person1
        file_path: ./people_files/person1/randomName.png
        mediatype: image/png
      # etc ...

We use this approach to describe media files in a camera trap study (including foreign keys):

@adrienDog
Copy link
Author

Hi @peterdesmet 👋 Thank you very much for reviewing this 🙏

Why not a regular table resource?

Ive described the limitation of this approach in the description

Why is this a problem?
Well, the great benefit of using datapackage.org is to enable discoverability of data and a programmatic one even. So that schemas are defined and a whole database schema can be generated without reading the content of each resource. If the relationship is defined inside the data (resource content), then discoverability cannot happen because a script cannot discover these relationships by purely following the datapackage.org standard.

in more practical terms: would a system be able to auto-discover the data is also contained in files just thanks to the datapackage specs?
answer is no atm: one has to know there is a file_path field that relates to the current datapackage:

  • what if there are several files?
  • what if the field is called image_file_path, img_path, etc...

basically data in files is not first class citizen the same way table resources are and it limits automatic discoverability

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants