Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JATS validation fails if footnotes include block quotes #5570

Closed
coryschires opened this issue Jun 10, 2019 · 12 comments
Closed

JATS validation fails if footnotes include block quotes #5570

coryschires opened this issue Jun 10, 2019 · 12 comments

Comments

@coryschires
Copy link

Background

The JATS spec does not allow you to have block quotes (i.e. <disp-quote>) inside footnotes (i.e. <fn>).

Frankly, I think this is an odd and unreasonable restriction. Some authors – and some disciplines, such as legal scholarship – make heavy use of footnotes. As far as I know, there's nothing fundamentally wrong with placing a block quote inside a footnote.

Problem

I am encountering real-world examples of this problem, so I need some sort of workaround.

To be clear, here's an example of invalid JATS:

<fn>
  <p> ... </p>
  <disp-quote> ... </disp-quote>
</fn>

This JATS XML would fail validation with the error: Element fn content does not follow the DTD, expecting (label? , p+), got (p disp-quote)

Steps to recreate

Pandoc version: 2.7.2
Files: jats_example.zip

To reproduce the issue described above

  1. pandoc -s --metadata-file metadata.json --to jats example.md -o output.xml
  2. Validate output.xml using the PMC online validation tool: https://www.ncbi.nlm.nih.gov/pmc/tools/xmlchecker
  3. The validation tool will display the errors described above

Solution

After examining the JATS spec, I think I have a solid workaround. I want to wrap the <disp-quote> element in a <p> element. This will ensure the JATS is valid while while only minimally changing the semantic meaning. Unless it's too much work, it would be nice if we could also include a specific-use attribute.

So in practice, we would convert this:

<fn>
  <p> ... </p>
  <disp-quote> ... </disp-quote>
</fn>

Into this:

<fn>
  <p> ... </p>
  <p specific-use="wrapper">
    <disp-quote> ... </disp-quote>
  </p>
</fn>

For sure, this is a little weird. From a semantic (or even just commonsense) standpoint, it doesn't make sense to have a block quote inside a paragraph. But this is allowed / valid in JATS. In fact, here's a proof of concept demonstrating that it's valid / okay to wrap a <disp-quote> in a <p> tag. You can download this file and run it through the PMC Validator to confirm.

Finally, in case it's unclear, I only want to wrap <disp-quote> when nested inside <fn> (i.e. I don't want to wrap all <disp-quote>).

Questions

First, I can't decide if this fix should be made directly in the core JATS writer or only in my code (e.g. using a custom filter). Personally, I am leaning toward the core JATS writer because, imo, the JATS writer should strive to produce valid JATS and thus everyone would benefit from this fix. However, I can also imagine y'all feeling like this problem is too specific and should be solved in the client's code rather than Pandoc. And, of course, it's really not my decision to make. So... Let me know what y'all think.

Second, if y'all think this should be solved in the client's code, then I could use some help writing a Lua filter for this use case. I have successfully written some basic Lua filters in the past, but this problem is proving trickier than I expected. Seems like Pandoc's AST expects paragraphs to include a list of inline elements but I'm trying to nest a block quote which also a block element. Anyway, for whatever reason, it's not working as expected, so any advice would be very much appreciated.

Thanks again for maintaining Pandoc! It's an amazing tool!

@jgm
Copy link
Owner

jgm commented Jun 10, 2019

I think it makes sense to make pandoc do this transformation when generating JATS.

@coryschires
Copy link
Author

Okay, great!

I don't want to slam you all at once with a bunch of issues. But I am seeing a few very similar problems, and it probably makes sense to at least show you the full scope of these problems, so that you can consider the issue comprehensively.

  • JATS validation fails if list items (i.e. <list-item>) include block quotes (i.e. <disp-quote>)
  • JATS validation fails if a <caption> element includes as <list>

Both these problems are very similar to the one outlined above:

  • They seem like reasonable use cases, imo – and I have encountered real-world examples.
  • The can be "fixed" by wrapping the offending element in a <p>.

I'd be happy to write up each of these as separate issues with recreation examples, etc – whatever is most convenient for you.

And overall, I'm happy to help in any way I can. Just let me know what I can do.

Thanks again!

@jgm
Copy link
Owner

jgm commented Jun 10, 2019

Is it safe to assume that

  • every block-level element OTHER than a paragraph must be wrapped in a <p> inside a node?
  • every block-level element other than a paragraph CAN validly be wrapped in a <p>?

@coryschires
Copy link
Author

Good questions. The JATS spec should be helpful here as they strictly specify how elements may be nested. For the full story, I recommend looking at the Document Hierarchy Diagrams.

But to answer your question:

  • <fn> – Footnotes may contain (label? , p+) which means 0-1 <label> and/or 0+ <p>. So in this case I think it would make sense to wrap a paragraph around any element nested within an <fn> other than label or p.
  • <caption> – Captions may contain(title? , p*) which means 0-1 <title> and/or 0+ <p>. So, I think it would make sense to wrap a paragraph around any element nested within a <caption> other than title or p.
  • <list-item> – List items may contain (label? , title? , (p | def-list | list)+) which means 0-1 <label>, 0-1 <title>, and 0+ <p>, <def-list>, or <list> in any order. So the <list-item> element allows a greater variety of child elements, which may make your solution slightly more complex. But, similar to the other elements, a solution could be to wrap any element not included in that list.

Hopefully that helps! Let me know if you have any other questions.

@jgm
Copy link
Owner

jgm commented Jun 11, 2019

OK, according to the spec, <p> can contain:

  • fig
  • table-wrap
  • graphic
  • disp-quote
  • code
  • preformat
  • list
  • def-list

That covers all block-level elements, as rendered by pandoc, except:

  • p (Para)
  • ref-list (Div with id refs)
  • title (Header)

Here's the complete list:

(#PCDATA | email | ext-link | uri | inline-supplementary-material | related-article | related-object | 
address | alternatives | array | boxed-text | chem-struct-wrap | code | fig | fig-group | graphic | media |
 preformat | supplementary-material | table-wrap | table-wrap-group | disp-formula | disp-formula-group |
 citation-alternatives | element-citation | mixed-citation | nlm-citation | bold | fixed-case | italic |
 monospace | overline | roman | sans-serif | sc | strike | underline | ruby | award-id | funding-source |
 open-access | chem-struct | inline-formula | inline-graphic | private-char | def-list | list | tex-math |
 mml:math | abbrev | milestone-end | milestone-start | named-content | styled-content | disp-quote |
 speech | statement | verse-group | fn | target | xref | sub | sup)*

@jgm
Copy link
Owner

jgm commented Jun 11, 2019

So I think what you suggest is basically right:

  • Inside <fn>, wrap everything other than a Para in a <p>. If the note contains a Header or Div with id refs, which is theoretically possible, this will break. Special-case Header by downgrading to <p>; special-case Div#refs by downgrading to regular Div.

  • Inside <list-item>, wrap everything other than a Para or a List in a <p>. Special-case Headers, which can't be wrapped, by downgrading them to <p>.

  • We don't yet need to worry about <caption>, because pandoc doesn't yet allow block-level content there.

I think the cleanest way to do the wrapping is to add a needsWrap parameter to blocksToJATS with type Block -> Bool. blocksToJATS can then handle the wrapping.

@jgm jgm closed this as completed in 550d949 Jun 11, 2019
@jgm
Copy link
Owner

jgm commented Jun 11, 2019

Done. If you could use this version to generate some documents and make sure they validate, that would be great.

@coryschires
Copy link
Author

Wow so fast! I'd be happy to test this fix against my problem articles and follow up.

I do have a related question though... Is there an easy way for me to install the bleeding edge in order to test this? (Otherwise, I could wait for you to cut a new version which would also be fine with me.)

If it matters, currently I am installing via Dockerfile:

RUN wget https://github.com/jgm/pandoc/releases/download/2.7.2/pandoc-2.7.2-1-amd64.deb \
  && dpkg -i pandoc-2.7.2-1-amd64.deb\
  && rm pandoc-2.7.2-1-amd64.deb

@jgm
Copy link
Owner

jgm commented Jun 11, 2019

You could try https://github.com/pandoc-extras/pandoc-nightly
(by tomorrow it should include this change)
Or follow the instructions on the website for installation from source.

@coryschires
Copy link
Author

Great! I can test it tomorrow using the nightly build. I'll follow up on this thread to let you know how it works.

Thanks again!

@coryschires
Copy link
Author

Following up here... I am testing my documents against the latest version, 2.7.3. Good news and bad news:

Good news

The <p specific-use="wrapper"> workaround discussed in this issue appears to be working very well. I have tested against the ~10 articles where I have encountered this bug in the wild and all of them are fixed 💥

Bad news

I believe version 2.7.3 also introduced a fairly major regression in the JATS writer. I suspect the bug was introduced in #5511:

Properly handle footnotes (#5511) according to “best practice.” (Group them at the end in and use elements to link them.)

I agree that grouping footnotes at the end of the document is better. However, this change created a bug where JATS output fails validation if the document does not contain any footnotes. The specific validation error I am seeing is:

fn-group: validity error : Element fn-group content does not follow the DTD, expecting (label? , title? , fn+), got ()

If it's not obvious, that means that <fn-group> must contain at least one child <fn>. So I assume the fix would require if condition or something like that:

If the document contains no footnotes
Then do not include empty <fn-group> tags

Finally, I can confirm that when I add a footnote(s) anywhere in the document, the validation error disappears.

Would you like me to create a separate issue for this bug?

@jgm
Copy link
Owner

jgm commented Jun 21, 2019

@mb21 has fixed this regression, thanks for the testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants