Added the two datasets from the surf publication (#195) #199

bdeadman · 2024-08-13T21:31:39Z

Borylation and minisci datasets from the SURF publication (ChemRxiv, 2024, 10.26434/chemrxiv-2023-nfq7h-v2 D O I: 10.26434/chemrxiv-2023-nfq7h-v2 [opens in a new tab]). These are reactions which have been collected from the literature and summarised in SURF format by @alexarnimueller.

The Jupyter Notebook used to convert the datasets is located at bdeadman/surf/surf2ord_troubleshooting.ipynb. The surf2ord.py script has been modified to output data into the latest ord-schema version and preferred style.

Notes:

Provenance data was not found in the Minisci dataset so this is assumed to also be @alexarnimueller
In the borylation dataset several rows had catalyst_1 only defined by the CAS number. In all but 1 of these I have found a SMILES string to approximate the catalyst.
surf2ord now assigns each reagent/catalyst/reactant/solvent to a separate input instead of collecting together them by role. Inputs with multiple components would be used when they are known to be added as a solution or mixture.
rxn_type has been converted to REACTION_TYPE instead of NAME (makes it compatible with ord-schema >0.3.38)
cas numbers have been converted to the CAS_NUMBER type instead of NAME
Isolated analysis type in SURF has been defined as WEIGHT analysis type in ORD.
datast_name and dataset_description options added to surf2ord function. This ensures the output dataset passes validations in ord-schema >0.3.86. Placeholder text is included by default so the text can be edited afterwards.
Fixed the code so it no longer multiples fractional yields by 100 * 100.

* Added the two datasets from the surf publication Borylation dataset and minisci dataset. * Create test_file Adding the file to make a commit. It will be subsequently removed in a new commit. I think this will reset the check target branch test. * Delete data/test_file removing this file because it is has served its purpose. The check target branch test was reset to see the correct target branch.

github-actions · 2024-08-13T21:33:07Z

Change summary:

Filename	Added	Removed	Changed
data/borylation_ord.pbtxt	0	0	0
data/minisci_ord.pbtxt	0	0	0
	0	0	0

github-actions · 2024-08-13T21:39:29Z

Change summary:

Filename	Added	Removed	Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz	1111	0	0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz	130	0	0
	1241	0	0

bdeadman · 2024-08-13T21:44:50Z

@connorcoley @skearnes I was able to pull this through into open-reaction-database/ord-data:#195 without getting approval. It should probably be checked at this stage before it is approved to go into main.

skearnes · 2024-08-14T21:22:46Z

@connorcoley @skearnes I was able to pull this through into open-reaction-database/ord-data:#195 without getting approval. It should probably be checked at this stage before it is approved to go into main.

Yes, that's expected; we don't protect any branches except for main by requiring approvals.

github-actions · 2024-08-14T21:23:26Z

Change summary:

Filename	Added	Removed	Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz	1111	0	0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz	130	0	0
	1241	0	0

skearnes · 2024-08-14T21:26:46Z

@qai222 do you have time to take a look at these for correctness?

github-actions · 2024-08-14T21:47:20Z

Change summary:

Filename	Added	Removed	Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz	1111	0	0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz	130	0	0
	1241	0	0

github-actions · 2024-08-14T21:48:45Z

Change summary:

Filename	Added	Removed	Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz	1111	0	0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz	130	0	0
	1241	0	0

qai222

I reviewed the pbtxt files from #195, specifically borylation_ord.pbtxt and minisci_ord.pbtxt.

not sure why the analysis type for yield is isolated

    analyses {
  key: "product_1_isolated"
  value {
    type: CUSTOM
    details: "isolated"
  }

some reactions have yield higher than 100

      identifiers {
    type: CUSTOM
    details: "rxn_id from SURF table"
    value: "lit_pub_bo_9"
    }
    ...
      measurements {
    analysis_key: "product_1_GC"
    type: YIELD
    percentage {
      value: 137.0
    }
  }

bdeadman · 2024-08-21T23:00:28Z

For item 2, this is how it is recorded in the SURF table file. Since it is a GC yield I suspect it is just a calibration error. While not ideal, I think we need to report it as it is written. There are already >100% yields in the ORD from the USPTO data.

This particular reaction has come from this paper: https://pubs.acs.org/doi/10.1021/acscatal.0c00152. Unfortunately the rxn does not appear in the SI, and I don't have access to the paper.

qai222 · 2024-08-21T23:33:39Z

For item 2, this is how it is recorded in the SURF table file. Since it is a GC yield I suspect it is just a calibration error. While not ideal, I think we need to report it as it is written. There are already >100% yields in the ORD from the USPTO data.

This particular reaction has come from this paper: https://pubs.acs.org/doi/10.1021/acscatal.0c00152. Unfortunately the rxn does not appear in the SI, and I don't have access to the paper.

Yeah the paper says "Yields were determined by gas chromatography and are based on moles of B 2 pin 2." in table 2 caption. I agree we should report as it is.

github-actions · 2024-09-16T16:20:16Z

Change summary:

Filename	Added	Removed	Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz	1111	0	0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pbtxt	0	0	0
	1111	0	0

This reverts commit b8ca100.

…data into #195

github-actions · 2024-09-16T16:30:11Z

Change summary:

Filename	Added	Removed	Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz	1111	0	0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz	130	0	0
data/fe/ord_dataset-feaf1b793c6d408aaec1cac7cc3ceadc.pb.gz	130	0	0
	1371	0	0

…data into #195

github-actions · 2024-09-16T16:34:19Z

Change summary:

Filename	Added	Removed	Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz	1111	0	0
data/c1/ord_dataset-c1e2bd4243aa49448b1e61636463c7cf.pb.gz	130	0	0
	1241	0	0

bdeadman · 2024-09-17T15:06:52Z

@qai222 I fixed the surf2ord program to correctly label "isolated" yield types as the WEIGHT measurement type enum in ORD. The problem was some rogue capitalization in their code.

As discussed above, point 2 (the >100% yields) are left as they were reported.

I made a bit of a mess when replacing the minisci dataset file, but it looks like it has been resolved now, and the PR only shows the 2 datasets and the reaction count is correct. I downloaded them to confirm they are the correct versions.

bdeadman · 2024-10-03T15:47:06Z

@skearnes these two have been peer-reviewed by @qai222, and now need the PR to be approved by a reviewer with write access.

github-actions · 2024-10-07T14:49:03Z

Change summary:

Filename	Added	Removed	Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz	1111	0	0
data/fe/ord_dataset-feaf1b793c6d408aaec1cac7cc3ceadc.pb.gz	130	0	0
	1241	0	0

github-actions · 2024-10-07T14:50:50Z

Change summary:

Filename	Added	Removed	Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz	1111	0	0
data/fe/ord_dataset-feaf1b793c6d408aaec1cac7cc3ceadc.pb.gz	130	0	0
	1241	0	0

github-actions · 2024-10-07T14:52:30Z

Change summary:

Filename	Added	Removed	Changed
data/99/ord_dataset-99c23cf435dc42f1af884053bc8b11c7.pb.gz	1111	0	0
data/fe/ord_dataset-feaf1b793c6d408aaec1cac7cc3ceadc.pb.gz	130	0	0
	1241	0	0

github-actions and others added 3 commits August 13, 2024 21:33

Update submission

2392665

Create test_file.txt

9624acb

Delete test_file.txt

2703d49

Update badges

539d2dd

bdeadman requested review from skearnes, connorcoley and qai222 August 13, 2024 21:42

skearnes closed this Aug 14, 2024

skearnes reopened this Aug 14, 2024

Merge branch 'main' into #195

c3ef932

Update badges

3bc0789

skearnes closed this Aug 14, 2024

skearnes reopened this Aug 14, 2024

qai222 reviewed Aug 18, 2024

View reviewed changes

updated the minisci dataset

b8ca100

github-actions and others added 3 commits September 16, 2024 16:20

Update submission

527a8f4

Revert "updated the minisci dataset"

de65b6c

This reverts commit b8ca100.

Merge branch '#195' of https://github.com/open-reaction-database/ord-…

73036f3

…data into #195

github-actions and others added 3 commits September 16, 2024 16:30

Update badges

756bb46

replaced minisci

2e39c9b

Merge branch '#195' of https://github.com/open-reaction-database/ord-…

b628cfd

…data into #195

Update submission

5bc9cef

qai222 approved these changes Sep 18, 2024

View reviewed changes

skearnes closed this Oct 7, 2024

skearnes reopened this Oct 7, 2024

github-actions and others added 2 commits October 7, 2024 14:49

Update badges

ac1a96d

Merge branch 'main' into #195

af79fa1

skearnes approved these changes Oct 7, 2024

View reviewed changes

Update badges

f08e87b

skearnes closed this Oct 7, 2024

skearnes reopened this Oct 7, 2024

skearnes enabled auto-merge (squash) October 7, 2024 14:51

skearnes merged commit e453754 into main Oct 7, 2024
4 checks passed

skearnes deleted the #195 branch October 7, 2024 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added the two datasets from the surf publication (#195) #199

Added the two datasets from the surf publication (#195) #199

bdeadman commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

bdeadman commented Aug 13, 2024

skearnes commented Aug 14, 2024

github-actions bot commented Aug 14, 2024

skearnes commented Aug 14, 2024

github-actions bot commented Aug 14, 2024

github-actions bot commented Aug 14, 2024

qai222 left a comment

bdeadman commented Aug 21, 2024

qai222 commented Aug 21, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

bdeadman commented Sep 17, 2024

bdeadman commented Oct 3, 2024

github-actions bot commented Oct 7, 2024

github-actions bot commented Oct 7, 2024

github-actions bot commented Oct 7, 2024

Added the two datasets from the surf publication (#195) #199

Added the two datasets from the surf publication (#195) #199

Conversation

bdeadman commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

github-actions bot commented Aug 13, 2024

bdeadman commented Aug 13, 2024

skearnes commented Aug 14, 2024

github-actions bot commented Aug 14, 2024

skearnes commented Aug 14, 2024

github-actions bot commented Aug 14, 2024

github-actions bot commented Aug 14, 2024

qai222 left a comment

Choose a reason for hiding this comment

bdeadman commented Aug 21, 2024

qai222 commented Aug 21, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

bdeadman commented Sep 17, 2024

bdeadman commented Oct 3, 2024

github-actions bot commented Oct 7, 2024

github-actions bot commented Oct 7, 2024

github-actions bot commented Oct 7, 2024