-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to RO-Crate in Python tutorial #4850
Improvements to RO-Crate in Python tutorial #4850
Conversation
I saw you recently on some registration list and was like --- "What can this be, is Eli back?" --- so happy to see you again! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome back @elichad! These are good improvements, thanks for making them. I've pinged the contributors in slack as well.
|
||
In its simplest form, an RO-Crate is a directory tree with an `ro-crate-metadata.json` file at the top level. This file contains metadata about the other files and directories, represented by [data entities](https://www.researchobject.org/ro-crate/1.1/data-entities.html). These metadata consist both of properties of the data entities themselves and of other, non-digital entities called [contextual entities](https://www.researchobject.org/ro-crate/1.1/contextual-entities.html). A contextual entity can represent, for instance, a person, an organization or an event. | ||
|
||
Suppose Alice and Bob worked on a research project together, and then wrote a paper about it; additionally, Alice prepared a spreadsheet containing experimental data, which Bob then used to generate a diagram. For the purpose of this tutorial, you can just create dummy files for the documents: | ||
Suppose Alice and Bob worked on a research project together, and then wrote a paper about it; additionally, Alice prepared a spreadsheet containing experimental data, which Bob then used to generate a diagram. For the purpose of this tutorial, you can just create placeholder files for the documents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add linting for this, good change.
```bash | ||
mkdir exp/logs | ||
touch exp/logs/log1.txt | ||
touch exp/logs/log2.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should also work with !
prefixes, but, the python version is more pythonic
## Command Line Interface | ||
> <comment-title>Jupyter Notebook users: switch to a terminal</comment-title> | ||
> The code cells in this section use Unix shell commands, which can't be run within a notebook. Open a Unix/Linux terminal to follow along. | ||
{: .comment} | ||
|
||
`ro-crate-py` includes a hierarchical command line interface: the `rocrate` tool. `rocrate` is the top-level command, while specific functionalities are provided via sub-commands. Currently, the tool allows to initialize a directory tree as an RO-Crate (`rocrate init`) and to modify the metadata of an existing RO-Crate (`rocrate add`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. One alternative would be to add leading "!" to the commands, but then it will read too jupyter-specific. Would the split involve spreading the tutorial over two separate pages? That would make it a bit harder to follow.
@@ -311,7 +331,7 @@ Commands: | |||
|
|||
Note that data entities (e.g., workflows) must already be present in the directory tree: the effect of the command is to register them in the metadata file. | |||
|
|||
### Example | |||
## Example | |||
|
|||
To run the following commands, we need a copy of the ro-crate-py repository: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command-line interface section has the student clone an entire repository just to get one folder of test data to run rocrate on. I think it'd be better to host this test data in another repository by itself or as a downloadable zip file, which would simplify the commands needed - I'm happy to update the tutorial accordingly but what's the best approach (or most common in the GTN) to storing the data? Zenodo?
given that this test case hasn't changed in four years, ok, that's reasonable. If it'd been modified more recently I'd say let's stick with the github clone just to ensure it stays fresh and doesn't need to be updated separately.
Zenodo is the right choice, yes. Please create a record, add @simleo's as a contributor to it (his information is in contributors.yaml of course), and add it to the GTN collection if you could!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a separate Zenodo record is a bit overkill, and it would need to be kept in sync with the documentation. That sounds risky since there would be only one person able to update the Zenodo record, while changes to the GitHub repo can be made through pull requests. This is why I'd rather stick with the git clone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't figure out a good waya to clone just the subdir with a sparse checkout, so, could be fine to clone everything.
We could restrict it with --depth=1
to make it a bit smaller.
Thanks for this update, Eli. I'm a bit taken aback because I never thought of this tutorial to be something "alive" that would need updating, but only a training component written for the 2023 Smorgasbord. If the intent is to keep it in sync with the development of ro-crate-py, there's the matter of avoiding duplication / out-of-sync with the ro-crate-py docs (which are in the repository's README.md), as noted by Eli in this issue. Eli's proposed solution (see the issue linked above) is to keep updating this tutorial and make it the only documentation source. We could do that, though I have a few doubts. E.g. is there a way to see the rendering of a PR such as this one before it's merged? Is the language just GitHub-flavored markdown or something more specialized to the GTN? |
I think it could be nice? The GTN is a very FAIR training platform, it could be a good way to ensure that it's seen by a number of people. It's not necessarily that we want to 'take over' your documentation, since you're obviously in the best place to maintain it long term, and this would mean using a different platform and changes going through peer-review which I recognise is an added burden for you @simleo. I think there are some benefits to having it in the GTN but I can recognise those aren't necessarily aligned with your primary goals of delivering RO-Crate's python library, so, at least from our side, thanks to contributors like @elichad, and potentially others in the future, we can consider also maintaining parallel versions, as an additional place to advertise this really nice library.
Yes!
The fully rendered preview is potentially more important in this case since this tutorial uses one of the advanced GTN features of converting a markdown document directly into a Jupyter notebook which users can use if they're not following along in their own terminal.
It is GFM, you can use all the familiar markdown constructs But there are some jekyll specific, and GTN-specific additions on top. Jekyll templates out variables such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I managed to render the updated page using GitPod, thanks @hexylena! The changes look good to me.
I've given you merge permissions @simleo for the future! |
Hello again GTN team, a new job has brought me back :)
Suggesting these changes after following the tutorial myself:
get_entities()
because./ Dataset
always appears firstNone
when you print authors in a later section causing confusionexp/logs
dataset in the crateOne question:
The command-line interface section has the student clone an entire repository just to get one folder of test data to run
rocrate
on. I think it'd be better to host this test data in another repository by itself or as a downloadable zip file, which would simplify the commands needed - I'm happy to update the tutorial accordingly but what's the best approach (or most common in the GTN) to storing the data? Zenodo?Requesting feedback from @simleo as a maintainer of the rocrate package