-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to RO-Crate in Python tutorial #4850
Changes from all commits
0788634
3f594b1
3f84706
67987c3
da7a176
c576a19
8df1ab1
bc347fe
cff2ad3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,6 +20,9 @@ contributions: | |
- kinow | ||
editing: | ||
- hexylena | ||
- elichad | ||
testing: | ||
- elichad | ||
funding: | ||
- by-covid | ||
license: Apache-2.0 | ||
|
@@ -32,32 +35,34 @@ This tutorial will show you how to manipulate [RO-Crates](https://w3id.org/ro/cr | |
|
||
> <agenda-title></agenda-title> | ||
> | ||
> In this tutorial, you will learn how to create a git repo, and begin working with it. | ||
> In this tutorial, we will cover: | ||
> | ||
> 1. TOC | ||
> {:toc} | ||
> | ||
{: .agenda} | ||
|
||
|
||
Let's start by installing the library via [pip](https://docs.python.org/3/installing/). Note that the name of the package is `rocrate`. | ||
|
||
```bash | ||
pip install rocrate | ||
``` | ||
|
||
|
||
## Creating an RO-Crate | ||
# Creating an RO-Crate | ||
|
||
In its simplest form, an RO-Crate is a directory tree with an `ro-crate-metadata.json` file at the top level. This file contains metadata about the other files and directories, represented by [data entities](https://www.researchobject.org/ro-crate/1.1/data-entities.html). These metadata consist both of properties of the data entities themselves and of other, non-digital entities called [contextual entities](https://www.researchobject.org/ro-crate/1.1/contextual-entities.html). A contextual entity can represent, for instance, a person, an organization or an event. | ||
|
||
Suppose Alice and Bob worked on a research project together, and then wrote a paper about it; additionally, Alice prepared a spreadsheet containing experimental data, which Bob then used to generate a diagram. For the purpose of this tutorial, you can just create dummy files for the documents: | ||
Suppose Alice and Bob worked on a research project together, and then wrote a paper about it; additionally, Alice prepared a spreadsheet containing experimental data, which Bob then used to generate a diagram. For the purpose of this tutorial, you can just create placeholder files for the documents: | ||
|
||
```bash | ||
mkdir exp | ||
touch exp/paper.pdf | ||
touch exp/results.csv | ||
touch exp/diagram.svg | ||
```python | ||
import os | ||
|
||
data_dir = "exp" | ||
os.mkdir(data_dir) | ||
|
||
for filename in ["paper.pdf", "results.csv", "diagram.svg"]: | ||
with open(os.path.join(data_dir, filename), "w") as file: | ||
pass | ||
``` | ||
|
||
Let's make an RO-Crate to represent this information: | ||
|
@@ -105,15 +110,16 @@ table["author"] = alice | |
diagram["author"] = bob | ||
``` | ||
|
||
You can also add whole directories together with their contents. In RO-Crate, a directory is represented by the `Dataset` entity: | ||
|
||
```bash | ||
mkdir exp/logs | ||
touch exp/logs/log1.txt | ||
touch exp/logs/log2.txt | ||
Comment on lines
-110
to
-113
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think these should also work with |
||
``` | ||
You can also add whole directories together with their contents. In an RO-Crate, a directory is represented by the `Dataset` entity: | ||
|
||
```python | ||
logs_dir = os.path.join(data_dir, "logs") | ||
os.mkdir(logs_dir) | ||
|
||
for filename in ["log1.txt", "log2.txt"]: | ||
with open(os.path.join(logs_dir, filename), "w") as file: | ||
pass | ||
|
||
logs = crate.add_dataset("exp/logs") | ||
``` | ||
|
||
|
@@ -123,15 +129,22 @@ Finally, we serialize the crate to disk: | |
crate.write("exp_crate") | ||
``` | ||
|
||
This should generate an `exp_crate` directory containing copies of all the files we added and an `ro-crate-metadata.json` file containing a JSON-LD representation of the metadata. Note that we have chosen a different destination path for the diagram, while the paper and the spreadsheet have been placed at the top level with their names unchanged (the default). | ||
This should generate an `exp_crate` directory containing copies of all the files we added and an `ro-crate-metadata.json` file containing a [JSON-LD](https://json-ld.org) representation of the metadata. Note that we have chosen a different destination path for the diagram, while the paper and the spreadsheet have been placed at the top level with their names unchanged (the default). | ||
|
||
Some applications and services support RO-Crates stored as archives. To save the crate in zip format, you can use `write_zip`: | ||
|
||
```python | ||
crate.write_zip("exp_crate.zip") | ||
``` | ||
|
||
### Appending elements to property values | ||
> <comment-title>How `rocrate` handles the contents of `exp/logs`</comment-title> | ||
> | ||
> Exploring the `exp_crate` directory, we see that all files and directories contained in `exp/logs` have been added recursively to the crate. However, in the `ro-crate-metadata.json` file, only the top level Dataset with `@id` `"exp/logs"` is listed. This is because we used `crate.add_dataset("exp/logs")` rather than adding every file individually. There is no requirement to represent every file and folder within the crate in the `ro-crate-metadata.json` file - in fact, if there were many files in the crate it would be impractical to do so. | ||
> | ||
> If you do want to add files and directories recursively to the metadata, use `crate.add_tree` instead of `crate.add_dataset` (but note that it only works on local directory trees). | ||
{: .comment} | ||
|
||
## Appending elements to property values | ||
|
||
What ro-crate-py entities actually store is their JSON representation: | ||
|
||
|
@@ -155,7 +168,9 @@ paper.properties() | |
When `paper["author"]` is accessed, a new list containing the `alice` and `bob` entities is generated on the fly. For this reason, calling `append` on `paper["author"]` won't actually modify the `paper` entity in any way. To add an author, use the `append_to` method instead: | ||
|
||
```python | ||
donald = crate.add(Person(crate, "https://en.wikipedia.org/wiki/Donald_Duck")) | ||
donald = crate.add(Person(crate, "https://en.wikipedia.org/wiki/Donald_Duck", properties={ | ||
"name": "Donald Duck" | ||
})) | ||
paper.append_to("author", donald) | ||
``` | ||
|
||
|
@@ -167,7 +182,7 @@ for n in "Mickey_Mouse", "Scrooge_McDuck": | |
donald.append_to("follows", p) | ||
``` | ||
|
||
### Adding remote entities | ||
## Adding remote entities | ||
|
||
Data entities can also be remote: | ||
|
||
|
@@ -188,7 +203,7 @@ If you add `fetch_remote=True` to the `add_file` call, however, the library (whe | |
|
||
Another option that influences the behavior when dealing with remote entities is `validate_url`, also `False` by default: if it's set to `True`, when the crate is serialized, the library will try to open the URL to add / update metadata such as the content's length and format. | ||
|
||
### Adding entities with an arbitrary type | ||
## Adding entities with an arbitrary type | ||
|
||
An entity can be of any type listed in the [RO-Crate context](https://www.researchobject.org/ro-crate/1.1/context.jsonld). However, only a few of them have a counterpart (e.g., `File`) in the library's class hierarchy, either because they are very common or because they are associated with specific functionality that can be conveniently embedded in the class implementation. In other cases, you can explicitly pass the type via the `properties` argument: | ||
|
||
|
@@ -210,7 +225,7 @@ Note that entities can have multiple types, e.g.: | |
"@type" = ["File", "SoftwareSourceCode"] | ||
``` | ||
|
||
## Consuming an RO-Crate | ||
# Consuming an RO-Crate | ||
|
||
An existing RO-Crate package can be loaded from a directory or zip file: | ||
|
||
|
@@ -221,8 +236,8 @@ for e in crate.get_entities(): | |
``` | ||
|
||
``` | ||
ro-crate-metadata.json CreativeWork | ||
./ Dataset | ||
ro-crate-metadata.json CreativeWork | ||
paper.pdf File | ||
results.csv File | ||
images/figure.svg File | ||
|
@@ -231,7 +246,7 @@ https://orcid.org/0000-0000-0000-0001 Person | |
... | ||
``` | ||
|
||
The first two entities shown in the output are the [metadata file descriptor](https://www.researchobject.org/ro-crate/1.1/metadata.html) and the [root data entity](https://www.researchobject.org/ro-crate/1.1/root-data-entity.html), respectively. The former represents the metadata file, while the latter represents the whole crate. These are special entities managed by the `ROCrate` object, and are always present. The other entities are the ones we added in the [section on RO-Crate creation](#creating-an-ro-crate). As shown above, `get_entities` allows to iterate over all entities in the crate. You can also access only data entities with `crate.data_entities` and only contextual entities with `crate.contextual_entities`. For instance: | ||
The first two entities shown in the output are the [root data entity](https://www.researchobject.org/ro-crate/1.1/root-data-entity.html) and the [metadata file descriptor](https://www.researchobject.org/ro-crate/1.1/metadata.html), respectively. The former represents the whole crate, while the latter represents the metadata file. These are special entities managed by the `ROCrate` object, and are always present. The other entities are the ones we added in the [section on RO-Crate creation](#creating-an-ro-crate). As shown above, `get_entities` allows to iterate over all entities in the crate. You can also access only data entities with `crate.data_entities` and only contextual entities with `crate.contextual_entities`. For instance: | ||
|
||
```python | ||
for e in crate.data_entities: | ||
|
@@ -256,8 +271,11 @@ You can fetch an entity by its `@id` as follows: | |
article = crate.dereference("paper.pdf") # or crate.get("paper.pdf") | ||
``` | ||
|
||
# Command Line Interface | ||
|
||
## Command Line Interface | ||
> <comment-title>Jupyter Notebook users: switch to a terminal</comment-title> | ||
> The code cells in this section use Unix shell commands, which can't be run within a notebook. Open a Unix/Linux terminal to follow along. | ||
{: .comment} | ||
|
||
`ro-crate-py` includes a hierarchical command line interface: the `rocrate` tool. `rocrate` is the top-level command, while specific functionalities are provided via sub-commands. Currently, the tool allows to initialize a directory tree as an RO-Crate (`rocrate init`) and to modify the metadata of an existing RO-Crate (`rocrate add`). | ||
Comment on lines
-260
to
280
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure. One alternative would be to add leading "!" to the commands, but then it will read too jupyter-specific. Would the split involve spreading the tutorial over two separate pages? That would make it a bit harder to follow. |
||
|
||
|
@@ -274,7 +292,7 @@ Commands: | |
write-zip | ||
``` | ||
|
||
### Crate initialization | ||
## Crate initialization | ||
|
||
The `rocrate init` command explores a directory tree and generates an RO-Crate metadata file (`ro-crate-metadata.json`) listing all files and directories as `File` and `Dataset` entities, respectively. | ||
|
||
|
@@ -291,9 +309,9 @@ Options: | |
|
||
The command acts on the current directory, unless the `-c` option is specified. The metadata file is added (overwritten if present) to the directory at the top level, turning it into an RO-Crate. | ||
|
||
### Adding items to the crate | ||
## Adding items to the crate | ||
|
||
The `rocrate add` command allows to add workflows and other entity types (currently [testing-related metadata](https://crs4.github.io/life_monitor/workflow_testing_ro_crate)) to an RO-Crate: | ||
The `rocrate add` command allows to add files, datasets (directories), workflows, and other entity types (currently [testing-related metadata](https://crs4.github.io/life_monitor/workflow_testing_ro_crate)) to an RO-Crate: | ||
|
||
```console | ||
$ rocrate add --help | ||
|
@@ -303,6 +321,8 @@ Options: | |
--help Show this message and exit. | ||
|
||
Commands: | ||
dataset | ||
file | ||
test-definition | ||
test-instance | ||
test-suite | ||
|
@@ -311,7 +331,7 @@ Commands: | |
|
||
Note that data entities (e.g., workflows) must already be present in the directory tree: the effect of the command is to register them in the metadata file. | ||
|
||
### Example | ||
## Example | ||
|
||
To run the following commands, we need a copy of the ro-crate-py repository: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
given that this test case hasn't changed in four years, ok, that's reasonable. If it'd been modified more recently I'd say let's stick with the github clone just to ensure it stays fresh and doesn't need to be updated separately. Zenodo is the right choice, yes. Please create a record, add @simleo's as a contributor to it (his information is in contributors.yaml of course), and add it to the GTN collection if you could! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a separate Zenodo record is a bit overkill, and it would need to be kept in sync with the documentation. That sounds risky since there would be only one person able to update the Zenodo record, while changes to the GitHub repo can be made through pull requests. This is why I'd rather stick with the git clone. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't figure out a good waya to clone just the subdir with a sparse checkout, so, could be fine to clone everything. We could restrict it with |
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add linting for this, good change.