Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to RO-Crate in Python tutorial #4850

Merged
merged 9 commits into from
Mar 20, 2024

Conversation

elichad
Copy link
Contributor

@elichad elichad commented Mar 19, 2024

Hello again GTN team, a new job has brought me back :)

Suggesting these changes after following the tutorial myself:

  • updated output of get_entities() because ./ Dataset always appears first
  • added "name" property to Donald Duck because if you re-export the crate and then continue with the tutorial, the name can appear as None when you print authors in a later section causing confusion
  • switched bash commands to Python where possible so it's easier to follow along with the auto-generated notebook
  • added explanation of what happens when including the exp/logs dataset in the crate
  • added extra CLI commands

One question:
The command-line interface section has the student clone an entire repository just to get one folder of test data to run rocrate on. I think it'd be better to host this test data in another repository by itself or as a downloadable zip file, which would simplify the commands needed - I'm happy to update the tutorial accordingly but what's the best approach (or most common in the GTN) to storing the data? Zenodo?

Requesting feedback from @simleo as a maintainer of the rocrate package

@bgruening
Copy link
Member

Hello again GTN team, a new job has brought me back :)

I saw you recently on some registration list and was like --- "What can this be, is Eli back?" --- so happy to see you again! :)

Copy link
Member

@hexylena hexylena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome back @elichad! These are good improvements, thanks for making them. I've pinged the contributors in slack as well.


In its simplest form, an RO-Crate is a directory tree with an `ro-crate-metadata.json` file at the top level. This file contains metadata about the other files and directories, represented by [data entities](https://www.researchobject.org/ro-crate/1.1/data-entities.html). These metadata consist both of properties of the data entities themselves and of other, non-digital entities called [contextual entities](https://www.researchobject.org/ro-crate/1.1/contextual-entities.html). A contextual entity can represent, for instance, a person, an organization or an event.

Suppose Alice and Bob worked on a research project together, and then wrote a paper about it; additionally, Alice prepared a spreadsheet containing experimental data, which Bob then used to generate a diagram. For the purpose of this tutorial, you can just create dummy files for the documents:
Suppose Alice and Bob worked on a research project together, and then wrote a paper about it; additionally, Alice prepared a spreadsheet containing experimental data, which Bob then used to generate a diagram. For the purpose of this tutorial, you can just create placeholder files for the documents:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add linting for this, good change.

Comment on lines -110 to -113
```bash
mkdir exp/logs
touch exp/logs/log1.txt
touch exp/logs/log2.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should also work with ! prefixes, but, the python version is more pythonic

Comment on lines -260 to 280
## Command Line Interface
> <comment-title>Jupyter Notebook users: switch to a terminal</comment-title>
> The code cells in this section use Unix shell commands, which can't be run within a notebook. Open a Unix/Linux terminal to follow along.
{: .comment}

`ro-crate-py` includes a hierarchical command line interface: the `rocrate` tool. `rocrate` is the top-level command, while specific functionalities are provided via sub-commands. Currently, the tool allows to initialize a directory tree as an RO-Crate (`rocrate init`) and to modify the metadata of an existing RO-Crate (`rocrate add`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider splitting the tutorial in two here, and then use a bash notebook for the second half with the cli. @simleo @stain what do y'all think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. One alternative would be to add leading "!" to the commands, but then it will read too jupyter-specific. Would the split involve spreading the tutorial over two separate pages? That would make it a bit harder to follow.

@@ -311,7 +331,7 @@ Commands:

Note that data entities (e.g., workflows) must already be present in the directory tree: the effect of the command is to register them in the metadata file.

### Example
## Example

To run the following commands, we need a copy of the ro-crate-py repository:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command-line interface section has the student clone an entire repository just to get one folder of test data to run rocrate on. I think it'd be better to host this test data in another repository by itself or as a downloadable zip file, which would simplify the commands needed - I'm happy to update the tutorial accordingly but what's the best approach (or most common in the GTN) to storing the data? Zenodo?

given that this test case hasn't changed in four years, ok, that's reasonable. If it'd been modified more recently I'd say let's stick with the github clone just to ensure it stays fresh and doesn't need to be updated separately.

Zenodo is the right choice, yes. Please create a record, add @simleo's as a contributor to it (his information is in contributors.yaml of course), and add it to the GTN collection if you could!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a separate Zenodo record is a bit overkill, and it would need to be kept in sync with the documentation. That sounds risky since there would be only one person able to update the Zenodo record, while changes to the GitHub repo can be made through pull requests. This is why I'd rather stick with the git clone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't figure out a good waya to clone just the subdir with a sparse checkout, so, could be fine to clone everything.

We could restrict it with --depth=1 to make it a bit smaller.

@simleo
Copy link
Collaborator

simleo commented Mar 20, 2024

Thanks for this update, Eli. I'm a bit taken aback because I never thought of this tutorial to be something "alive" that would need updating, but only a training component written for the 2023 Smorgasbord. If the intent is to keep it in sync with the development of ro-crate-py, there's the matter of avoiding duplication / out-of-sync with the ro-crate-py docs (which are in the repository's README.md), as noted by Eli in this issue. Eli's proposed solution (see the issue linked above) is to keep updating this tutorial and make it the only documentation source. We could do that, though I have a few doubts. E.g. is there a way to see the rendering of a PR such as this one before it's merged? Is the language just GitHub-flavored markdown or something more specialized to the GTN?

@hexylena
Copy link
Member

hexylena commented Mar 20, 2024

If the intent is to keep it in sync with the development of ro-crate-py,

I think it could be nice? The GTN is a very FAIR training platform, it could be a good way to ensure that it's seen by a number of people.

It's not necessarily that we want to 'take over' your documentation, since you're obviously in the best place to maintain it long term, and this would mean using a different platform and changes going through peer-review which I recognise is an added burden for you @simleo.

I think there are some benefits to having it in the GTN but I can recognise those aren't necessarily aligned with your primary goals of delivering RO-Crate's python library, so, at least from our side, thanks to contributors like @elichad, and potentially others in the future, we can consider also maintaining parallel versions, as an additional place to advertise this really nice library.

Eli's proposed solution (see the issue linked above) is to keep updating this tutorial and make it the only documentation source. We could do that, though I have a few doubts. E.g. is there a way to see the rendering of a PR such as this one before it's merged?

Yes!

  • if you just want a basic rendering of markdown there is the github view.
  • If you wish to see a fully rendered preview, it's absolutely possible. (this can also be done locally, but many part-time contributors find gitpod easier.)

The fully rendered preview is potentially more important in this case since this tutorial uses one of the advanced GTN features of converting a markdown document directly into a Jupyter notebook which users can use if they're not following along in their own terminal.

Is the language just GitHub-flavored markdown or something more specialized to the GTN?

It is GFM, you can use all the familiar markdown constructs

But there are some jekyll specific, and GTN-specific additions on top.

Jekyll templates out variables such as {{ site.baseurl }} for us, GTN specific additions include features like tip boxes which are written with standard markdown blockquotes plus a css class, plus a GTN specific title <tip-title>..</tip-title> feature.

Copy link
Collaborator

@simleo simleo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to render the updated page using GitPod, thanks @hexylena! The changes look good to me.

@hexylena
Copy link
Member

I've given you merge permissions @simleo for the future!

@hexylena hexylena merged commit 06dd849 into galaxyproject:main Mar 20, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants