Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

final merge: Offline SBOL3 <-> GenBank Converter; Integration with previous online Converter #186

Merged
merged 143 commits into from
Apr 11, 2023

Conversation

mohitdmak
Copy link
Collaborator

@mohitdmak mohitdmak commented Oct 19, 2022

About the Project:

⇒ Description:

  • Until now, conversion between the file formats of GenBank and SBOL3 used to happen using an online converter, which was called by the SBOL-utilities package.
  • This was not an ideal approach for the conversion mostly due to 3 reasons:
    1. The online converter's hosting service could have downtime, leading to outages and unavailability of the conversion to and from GenBank and SBOL3, and moreover, users could not infer the reasons for this when it happened.
    2. The online converter didn't completely convert all aspects of the GenBank file (including those properties which could not be directly stored in SBOL3, but could be useful in a round trip to restore back into GenBank). The new converter handles this really well and is able to store all properties and aspects of the GenBank which do not have a corresponding SBOL3 field.
    3. The online converter used to convert GenBank files to SBOL2 and then SBOL-utilities used to locally convert the SBOL2 to SBOL3, this was not ideal too, and the new offline converter directly supports conversion between GenBank and SBOL3.
  • Thus, this feature addition would solve all the above issues, and support direct, offline conversion between SBOL3 and GenBank file formats.

⇒ Project Repository:
https://github.com/SynBioDex/SBOL-utilities

⇒ Project Proposal:
[google-summer-of-code-2022-proposal.pdf](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/bfc495bd-2261-4700-bf4c-82fe3770af02/google-summer-of-code-2022-proposal.pdf)

⇒ Project Profile:
[Google Summer of Code](https://summerofcode.withgoogle.com/programs/2022/projects/IvKQhkk9)

⇒ Project Description:
nrnb/GoogleSummerOfCode#183

⇒ Constituent Pull Requests:
https://github.com/SynBioDex/SBOL-utilities/pulls?q=label%3Asbol3-genbank+

Project Blog:

  • Detailed notes about design decisions, issue priorities, future prospects pertaining to each relevant PR have been documented in my blog.
  • It also includes timelines and dependencies for each PR.
  • NOTE: This blog would be of great help to contributors who are willing to understand, in depth, the progress of the whole project from ideation to implementation. You are welcome to approach me for any feedback, query, or request pertaining to this project through the blog or through mail.
  • Here is the link to the blog: https://mohitdmak.notion.site/GSoC-22-Conversion-between-SBOL3-and-GenBank-0c2efaf26fde493ab7914581322458e0

Credits:

  • I was able to bring this project to a successful completion under the guidance of my mentors for the project:

mohitdmak and others added 30 commits June 17, 2022 21:29
Co-authored-by: Tom Mitchell <tcmitchell@users.noreply.github.com>
add: initial simple conversion gb/sbol3 with sample files
Co-authored-by: Tom Mitchell <tcmitchell@users.noreply.github.com>
add: genbank conversion support for multiple records and features
Copy link
Contributor

@jakebeal jakebeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 0-based range error needs to be corrected.

@mohitdmak
Copy link
Collaborator Author

mohitdmak commented Oct 23, 2022

The 0-based range error needs to be corrected.

This turned out to be an issue with BioPython, as it parsed the start Locations of GenBank files based on 0-indexing instead of 1-indexing. I have created a PR for the fix, where we convert the start value to 1-indexed when converting from GenBank to SBOL3 and 0-indexed for the opposite direction.

@jakebeal
Copy link
Contributor

jakebeal commented Mar 7, 2023

It appears that something changed between biopython 1.79 and 1.81 that is causing the tests to fail. This will need investigation.

SeqFeature is no longer hashable as of BioPython 1.80. Try an alternate
implementation that avoids using SeqFeature instances as keys in a dict.
Store the loc_positions on a new property of SeqFeature. Akin to monkey
patching.
@tcmitchell
Copy link
Collaborator

Hey @jakebeal can this one be merged now? The tests are all passing after fixing the BioPython Hashable change.

…s be 1) add a non-standard sbol#locationPosition type, and 2) reference to an unrecognized Location from a SequenceFeature.

Temporarily address this by reverting to the prior behavior of truncating fuzzy ranges into ranges.
@jakebeal
Copy link
Contributor

I am working through some errors that come up in its usage with the iGEM workflow. Once all of those are either resolved or sufficiently identified to defer into other tickets, I will merge this.

jakebeal and others added 8 commits March 27, 2023 11:14
…vel.

The CustomReferenceProperty and CustomStructuredCommentProperty classes in the direct Genbank conversion were set up as TopLevel classes when they should have been Identified objects.
This caused a great deal of problems, since their identities were still those of Identified objects.
I have also changed some incorrectly plural properties to singular names, following the standard SBOL convention (e.g., the "accessions" variable is listed as a set of individual "accession" properties)

After this commit, I will follow with others that clean up now-unused code and style issues.
…mTopLevel classes into behaving like CustomIdentified
In the direct SBOL3->GenBank converter, there were a number of fields only filled from saved data from GenBank->SBOL conversions
Change these to default to taking information from SBOL as well:
- 'molecule_type' annotation defaults based on Component.types (SBO material)
- 'topology' annotation defaults based on Component.types (SO topology)
- 'sequence_version' defaults to 1
- 'label' qualifier for features defaults to name

Likewise, do not attempt to use missing GenBank information if it is not present.
This affects annotations 'date', 'data_file_division', 'molecule_type', 'organism', 'source', 'topology', 'accessions', and 'sequence_version', and feature qualifier 'label'

Other improvements:
- tolerate deprecated values for orientations.
- search multiple roles in component to find SO role for conversion
- workaround patch for tyto URI normalization

Also update tests:
- Confirm that version>1 round-trips correctly
- Add an sbol#inline value for an orientation
- Provide diffs when GenBank conversions fail
- More comprehensive test of ignoring non-converting SBOL properties
- Version should always be set
- No auto-generated names in converted feature (displayID is sufficient, and generating names just confuses round-tripping)

Finally, style was cleaned up in various places
@jakebeal
Copy link
Contributor

jakebeal commented Apr 9, 2023

Once PR #203 and PR #206 are merged, this branch will be ready to merge.

There is still one major outstanding issue that will need to be addressed, which is the "flattening" of hierarchical Component objects to include the features of their SubComponents. I have filed that as #207, and it can be addressed separately from this PR

@jakebeal jakebeal merged commit 2634729 into develop Apr 11, 2023
@jakebeal jakebeal deleted the feature/sbol3-genbank-conversion branch April 11, 2023 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request sbol3-genbank sbol3 / genbank conversion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants