Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the two datasets from the surf publication (#195) #199

Merged
merged 18 commits into from
Oct 7, 2024
Merged

Conversation

bdeadman
Copy link
Collaborator

Borylation and minisci datasets from the SURF publication (ChemRxiv, 2024, 10.26434/chemrxiv-2023-nfq7h-v2 D O I: 10.26434/chemrxiv-2023-nfq7h-v2 [opens in a new tab]). These are reactions which have been collected from the literature and summarised in SURF format by @alexarnimueller.

The Jupyter Notebook used to convert the datasets is located at bdeadman/surf/surf2ord_troubleshooting.ipynb. The surf2ord.py script has been modified to output data into the latest ord-schema version and preferred style.

Notes:

  • Provenance data was not found in the Minisci dataset so this is assumed to also be @alexarnimueller
  • In the borylation dataset several rows had catalyst_1 only defined by the CAS number. In all but 1 of these I have found a SMILES string to approximate the catalyst.
  • surf2ord now assigns each reagent/catalyst/reactant/solvent to a separate input instead of collecting together them by role. Inputs with multiple components would be used when they are known to be added as a solution or mixture.
  • rxn_type has been converted to REACTION_TYPE instead of NAME (makes it compatible with ord-schema >0.3.38)
  • cas numbers have been converted to the CAS_NUMBER type instead of NAME
  • Isolated analysis type in SURF has been defined as WEIGHT analysis type in ORD.
  • datast_name and dataset_description options added to surf2ord function. This ensures the output dataset passes validations in ord-schema >0.3.86. Placeholder text is included by default so the text can be edited afterwards.
  • Fixed the code so it no longer multiples fractional yields by 100 * 100.

* Added the two datasets from the surf publication

Borylation dataset and minisci dataset.

* Create test_file

Adding the file to make a commit. It will be subsequently removed in a new commit. I think this will reset the check target branch test.

* Delete data/test_file

removing this file because it is has served its purpose. The check target branch test was reset to see the correct target branch.
Copy link

Change summary:

Filename Added Removed Changed
data/borylation_ord.pbtxt 0 0 0
data/minisci_ord.pbtxt 0 0 0
0 0 0

Copy link

Change summary:

Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz 130 0 0
1241 0 0

@bdeadman
Copy link
Collaborator Author

@connorcoley @skearnes I was able to pull this through into open-reaction-database/ord-data:#195 without getting approval. It should probably be checked at this stage before it is approved to go into main.

@skearnes skearnes closed this Aug 14, 2024
@skearnes skearnes reopened this Aug 14, 2024
@skearnes
Copy link
Contributor

@connorcoley @skearnes I was able to pull this through into open-reaction-database/ord-data:#195 without getting approval. It should probably be checked at this stage before it is approved to go into main.

Yes, that's expected; we don't protect any branches except for main by requiring approvals.

Copy link

Change summary:

Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz 130 0 0
1241 0 0

@skearnes
Copy link
Contributor

@qai222 do you have time to take a look at these for correctness?

Copy link

Change summary:

Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz 130 0 0
1241 0 0

@skearnes skearnes closed this Aug 14, 2024
@skearnes skearnes reopened this Aug 14, 2024
Copy link

Change summary:

Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz 130 0 0
1241 0 0

Copy link
Collaborator

@qai222 qai222 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the pbtxt files from #195, specifically borylation_ord.pbtxt and minisci_ord.pbtxt.

  1. not sure why the analysis type for yield is isolated
        analyses {
      key: "product_1_isolated"
      value {
        type: CUSTOM
        details: "isolated"
      }
    
  2. some reactions have yield higher than 100
          identifiers {
        type: CUSTOM
        details: "rxn_id from SURF table"
        value: "lit_pub_bo_9"
        }
        ...
          measurements {
        analysis_key: "product_1_GC"
        type: YIELD
        percentage {
          value: 137.0
        }
      }
    

@bdeadman
Copy link
Collaborator Author

For item 2, this is how it is recorded in the SURF table file. Since it is a GC yield I suspect it is just a calibration error. While not ideal, I think we need to report it as it is written. There are already >100% yields in the ORD from the USPTO data.

This particular reaction has come from this paper: https://pubs.acs.org/doi/10.1021/acscatal.0c00152. Unfortunately the rxn does not appear in the SI, and I don't have access to the paper.

@qai222
Copy link
Collaborator

qai222 commented Aug 21, 2024

For item 2, this is how it is recorded in the SURF table file. Since it is a GC yield I suspect it is just a calibration error. While not ideal, I think we need to report it as it is written. There are already >100% yields in the ORD from the USPTO data.

This particular reaction has come from this paper: https://pubs.acs.org/doi/10.1021/acscatal.0c00152. Unfortunately the rxn does not appear in the SI, and I don't have access to the paper.

Yeah the paper says "Yields were determined by gas chromatography and are based on moles of B 2 pin 2." in table 2 caption. I agree we should report as it is.

Copy link

Change summary:

Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pbtxt 0 0 0
1111 0 0

Copy link

Change summary:

Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz 130 0 0
data/fe/ord_dataset-feaf1b793c6d408aaec1cac7cc3ceadc.pb.gz 130 0 0
1371 0 0

Copy link

Change summary:

Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz 130 0 0
1241 0 0

@bdeadman
Copy link
Collaborator Author

@qai222 I fixed the surf2ord program to correctly label "isolated" yield types as the WEIGHT measurement type enum in ORD. The problem was some rogue capitalization in their code.

As discussed above, point 2 (the >100% yields) are left as they were reported.

I made a bit of a mess when replacing the minisci dataset file, but it looks like it has been resolved now, and the PR only shows the 2 datasets and the reaction count is correct. I downloaded them to confirm they are the correct versions.

@bdeadman
Copy link
Collaborator Author

bdeadman commented Oct 3, 2024

@skearnes these two have been peer-reviewed by @qai222, and now need the PR to be approved by a reviewer with write access.

@skearnes skearnes closed this Oct 7, 2024
@skearnes skearnes reopened this Oct 7, 2024
Copy link

github-actions bot commented Oct 7, 2024

Change summary:

Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/fe/ord_dataset-feaf1b793c6d408aaec1cac7cc3ceadc.pb.gz 130 0 0
1241 0 0

Copy link

github-actions bot commented Oct 7, 2024

Change summary:

Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/fe/ord_dataset-feaf1b793c6d408aaec1cac7cc3ceadc.pb.gz 130 0 0
1241 0 0

@skearnes skearnes closed this Oct 7, 2024
@skearnes skearnes reopened this Oct 7, 2024
@skearnes skearnes enabled auto-merge (squash) October 7, 2024 14:51
Copy link

github-actions bot commented Oct 7, 2024

Change summary:

Filename Added Removed Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz 1111 0 0
data/fe/ord_dataset-feaf1b793c6d408aaec1cac7cc3ceadc.pb.gz 130 0 0
1241 0 0

@skearnes skearnes merged commit e453754 into main Oct 7, 2024
4 checks passed
@skearnes skearnes deleted the #195 branch October 7, 2024 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants