You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
More and more data is not tabular: images, videos, html, etc… But it does have relationships.
The current datapackage.org spec allows to define relationships between table fields, but it does not allow it on pure file resources.
A workaround is to have a special table resource that lists the files and leverages the foreignKeys spec, but it mixes data and schema concepts. Indeed, one column of the table resource is a path to another resource.
Why is this a problem?
Well, the great benefit of using datapackage.org is to enable discoverability of data and a programmatic one even. So that schemas are defined and a whole database schema can be generated without reading the content of each resource. If the relationship is defined inside the data (resource content), then discoverability cannot happen because a script cannot discover these relationships by purely following the datapackage.org standard.
What could be done?
We will use the following use case
./people-data/
|__ people.csv # list of people
|__ traits.csv # several traits per person, one row = one trait
|__ people_files/
|__ person1/
|__ profile.jpg
|__ randomName.png
|__ person2/
|__ profile.jpg
|__ randomName.png
Option 1: Special fileTable table resource that describes data located in files and its references to other resources via foreign keys
We extend the table type for a fileTable type which enforces the first field to be file_path
file_path must be a a valid path to a file in the datapackage
Q: Do we need to declare each file resource when they are described in a fileTable?
Option 2: File resources can reference a virtualTable parent
The idea here is that the schema can still be defined in a table like resource but the data is in files that have to reference it AND have the same fields as properties
This allows for data frames to be in separate JSON files for examples and leverages the existing metadata definition for files like format, mediatype etc…
resources:
- name: peopletype: tablepath: ./people.csvscheme: fileformat: csvmediatype: text/csvschema:
fields:
- name: idtype: string
- name: nametype: string
- name: traitstype: tablepath: ./traits.csvscheme: fileformat: csvmediatype: text/csvschema:
fields:
- name: person_idtype: string
- name: eye_colortype: stringforeignKeys:
- fields: ["person_id"]reference:
resource: peoplefields: ["id"]
- name: filestype: virtualTable # special resource type, virtual, only has schemaschema:
fields: # some fields have to be defined in each frame
- name: person_idtype: stringforeignKeys:
- fields: ["person_id"]reference:
resource: peoplefields: ["id"]
- name: people_files_person1_profile_jpgtype: virtualTableFramepath: ./people_files/person1/profile.jpgmediatype: image/jpegformat: jpgdata:
person_id: person1# path: path/to/json/file.json {"person_id": "person1"}
The text was updated successfully, but these errors were encountered:
Hi, why not use option 1, but as a regular tabular data resource?
resources:
- name: people
- name: traits
- name: filestype: table # Regular tabular resourceschema:
fields:
- name: person_idtype: string
- name: file_path # Does not need to the first fieldtype: stringformat: default # We could benefit here from an additional format "path", cf. "uri" for URLs
- name: mediatypetype: stringforeignKeys:
- fields: ["person_id"]reference:
resource: peoplefields: ["id"]data:
- person_id: person1file_path: ./people_files/person1/profile.jpgmediatype: image/jpeg
- person_id: person1file_path: ./people_files/person1/randomName.pngmediatype: image/png# etc ...
We use this approach to describe media files in a camera trap study (including foreign keys):
Hi @peterdesmet 👋 Thank you very much for reviewing this 🙏
Why not a regular table resource?
Ive described the limitation of this approach in the description
Why is this a problem?
Well, the great benefit of using datapackage.org is to enable discoverability of data and a programmatic one even. So that schemas are defined and a whole database schema can be generated without reading the content of each resource. If the relationship is defined inside the data (resource content), then discoverability cannot happen because a script cannot discover these relationships by purely following the datapackage.org standard.
in more practical terms: would a system be able to auto-discover the data is also contained in files just thanks to the datapackage specs?
answer is no atm: one has to know there is a file_path field that relates to the current datapackage:
what if there are several files?
what if the field is called image_file_path, img_path, etc...
basically data in files is not first class citizen the same way table resources are and it limits automatic discoverability
Schematize relationships of files
Summary of the issue
More and more data is not tabular: images, videos, html, etc… But it does have relationships.
The current datapackage.org spec allows to define relationships between table fields, but it does not allow it on pure file resources.
A workaround is to have a special table resource that lists the files and leverages the
foreignKeys
spec, but it mixes data and schema concepts. Indeed, one column of the table resource is a path to another resource.Why is this a problem?
Well, the great benefit of using datapackage.org is to enable discoverability of data and a programmatic one even. So that schemas are defined and a whole database schema can be generated without reading the content of each resource. If the relationship is defined inside the data (resource content), then discoverability cannot happen because a script cannot discover these relationships by purely following the datapackage.org standard.
What could be done?
We will use the following use case
Option 1: Special
fileTable
table resource that describes data located in files and its references to other resources via foreign keysWe extend the
table
type for afileTable
type which enforces the first field to befile_path
file_path
must be a a valid path to a file in the datapackageQ: Do we need to declare each file resource when they are described in a
fileTable
?Option 2: File resources can reference a
virtualTable
parentThe idea here is that the schema can still be defined in a table like resource but the data is in files that have to reference it AND have the same fields as properties
This allows for data frames to be in separate JSON files for examples and leverages the existing metadata definition for files like format, mediatype etc…
The text was updated successfully, but these errors were encountered: