Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write new Format and Formats types, some helper functions #78

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

despresc
Copy link
Contributor

@despresc despresc commented Sep 4, 2020

Edit: The examples of Formats matching below are now wrong, and listMatches is now listSubformats (see this comment).

An initial Format and Formats design (discussed a little here and in the documentation below for the new Text.Pandoc.Format module) to resolve jgm/pandoc#547.

The Format type is straightforward. It just enumerates all of input and output formats that pandoc currently recognizes, as well as the openxml format that the manual says to use instead of docx. These happen to be nearly all of the Format strings that pandoc deals with, but there are two others that I found (and aren't in the new Format yet):

  • noteref (used in a RawInline internally in Readers.HTML)
  • doc (used to trigger an "unknown reader" error in App.FormatHeuristics)

The Formats type and the matches function are intended to capture the fuzziness in matching particular formats, so that the writers can figure out when to include raw content:

-- It is intended that x `matches` y when a format x can be included in a context with format y
-- without other modification, more or less. This is the how Format is used by writers, not how
-- it is used in the --to option (where Html is an alias for Html5).

-- TeX content can always be included in a TeX or LaTeX context
TeX `matches` TeX = True
TeX `matches` LaTeX = True

-- LaTeX can't always be included in a TeX context
LaTeX `matches` TeX = False

-- Replacing (TeX, LaTeX) with (Html, Html5) in the above also works

The matches work this way because the writers are normally faced with something like RawBlock f str, know what they're writing (more or less), and want to know if str should be included. They can do this by asking if f `matches` LaTeX or f `matches` Html5 (the HTML writer would do both when writing Html5, the former to see if supported math should be included). More general Formats patterns can also be constructed:

TeX `matches` (LaTeX `or` Html) = True
LaTeX `matches` (LaTeX `except` Beamer) = False
LaTeX `matches` (Beamer `except` TeX) = True
TeX `matches` (LaTeX `and` Html) = False
LaTeX `matches` not Html = True

This behaviour is defined in listMatches, which lists all the formats that will match the given format, i.e. it lists all of the super-formats (more general formats) of the given format. I tried to base listMatches on what pandoc currently does, so all of the Markdown* formats are equivalent to each other, Epub3 is a sub-format of Html and Html5, and so on. But that might be too conservative:

  • The commonmark* formats aren't recognized by any writer. Should these be equivalent (or otherwise related) to markdown*?
  • The docbook4 and docbook5 formats aren't recognized. Are these sub-formats of docbook?
  • The jats_* formats aren't recognized. I think they form a chain of sub-formats, all more specific than jats?
  • The asciidoctor format is not recognized. It might be equivalent to asciidoc?

There might be other cases I missed.


I haven't written it yet, but there might need to be a

toConcreteFormat :: Format -> Format
toConcreteFormat Html = Html5
toConcreteFormat Html5 = Html5
-- etc.

function that handles formats in the --to sense, as being possible aliases for a particular concrete format.

@despresc
Copy link
Contributor Author

despresc commented Sep 4, 2020

Actually, if Formats is going to be used in an IfFormatBlock or an IfFormatInline element, then the Boolean operations and match should probably be written slightly differently (the super-format strategy in listMatches is still fine, at least).

The reason is that in those elements, the roles in matches are reversed. Instead of being faced with an unknown Format and wanting to know if it matches our concrete one, instead the writers are given an unknown Formats pattern and we have to match our concrete one against it.

That means that this

ifFormatBlock Html content

would probably be interpreted as saying "render content in any Html-based format". Right now, it means "render content in any context that is pure Html", so content would never be rendered, since Html is not an output format. Even if pattern evaluation were changed for these blocks, people would also expect

ifFormatBlock (Html `except` Html5) content

to mean "render content in any Html-based format except Html5 ones", when right now it means "render content in any Html context that can't be included in an Html5 context", which happens to be no contexts at all.

But this can be fixed.

@despresc
Copy link
Contributor Author

despresc commented Sep 4, 2020

I updated the request. Now the idea is that if a writer for a concrete format f like Html5 encounters an IfFormatBlock p content (once that block exists), it will render content when f `matches` p. So the following are true:

TeX `matches` TeX = True
TeX `matches` LaTeX = False
LaTeX `matches` TeX = True
LaTeX `matches` LaTeX = True
LaTeX `matches` Beamer = False
LaTeX `matches` (LaTeX `or` Html) = True
LaTeX `matches` (TeX `except` Beamer) = True
LaTeX `matches` (TeX `except` Markdown) = True
LaTeX `matches` (LaTeX `and` Html) = False
LaTeX `matches` not Html = True

There is also castsTo function that writers can use to test raw elements with unknown format x, where x `castsTo` f if x can be included directly into f without modification. For example, the Html5 writer might look at the following:

x `castsTo` Html5 -- see if the raw element can be included directly
x `castsTo` LaTeX -- see if the raw element represents supported math

The listMatches function (which listed the super-formats of a given Format) is now listSubformats (and lists the sub-formats of a given Format) to handle these changes, meaning that matches is now the inverse relation to what it was before: x `matches` y when x is a subformat of y :: Format.

@jgm
Copy link
Owner

jgm commented Sep 5, 2020

I like the concept here. The thing that is giving me pause is that some of these formats are really just aliases for another format with a particular bundle of extensions. E.g., asciidoc and asciidoctor; commonmark and commonmark_x; markdown and markdown_phpextra.

So extension bundles that happen to have aliases get counted as Formats, but Formats can't represent arbitrary extension packages. That seems a bit odd to me.

@tarleb
Copy link
Contributor

tarleb commented Sep 5, 2020

Maybe we don't have to solve the aliases/extensions problem in here. Not all of pandoc's format handling has to rely on the formats defined here. In jgm/pandoc#5118, my approach was to distinguish between KnownFormats (i.e., Format in this PR) and other formats like "flavored formats" (format combined with extensions) and IOFormat (a Format or a Lua script path).

@despresc
Copy link
Contributor Author

despresc commented Sep 5, 2020

That should be handled to some extent, if I understand correctly. Right now listSubformats x will give a Formats set of all the formats that x can be directly included in. So

-- exactly = Formats . Set.fromList
listSubformats Markdown = exactly
  [Markdown, MarkdownGithub, MarkdownMmd, MarkdownPhpExtra, MarkdownStrict]
listSubformats MarkdownPhpExtra = exactly
  [Markdown, MarkdownGithub, MarkdownMmd, MarkdownPhpExtra, MarkdownStrict]

and castsTo is implemented with listSubformats, so, say, MarkdownGithub `castsTo` MarkdownPhpExtra = True. This also works in the combinators, so

Markdown `matches` MarkdownGithub = True
Markdown `matches` (MarkdownGithub `except` MarkdownPhpExtra) = False

I was conservative and didn't include relationships I didn't see in pandoc, but asciidoc and asciidoctor could be considered equivalent by listSubformats like this. So could commonmark and commonmark_x.

@despresc
Copy link
Contributor Author

despresc commented Sep 5, 2020

Oh, if you meant that Format includes more things than there are readers and writers individually, then yes, separate types could work there.

@jgm
Copy link
Owner

jgm commented Sep 5, 2020

That should be handled to some extent,

Yes, I understand that the matches relation allows "subtypes" of formats. What seems awkward to me is that, as I think of it, markdown_phpextra is just markdown + extensions W, X, Y, and Z. So it seems strange that you can represent this in a Format but you can't represent arbitrary combinations, e.g. markdown + W and Y.

One approach would be not to include things like markdown_phpextra or markdown_x or asciidoctor in the list of formats. (This would correspond to the way I've been thinking of it all along, though it isn't enforced since this is just "stringly typed.")

@despresc
Copy link
Contributor Author

despresc commented Sep 5, 2020

I understand now. It is a little awkward that this can't express things like markdown+old_dashes. The types here could be broadened to include that, but things would start to get unwieldy.

The only reason they're there is that right now pandoc does understand some of them as valid RawBlock formats. This was the example I saw. But many formats don't have this kind of detection (asciidoctor isn't recognized in a format block). So that could be an outlier, and should be removed.

@jgm
Copy link
Owner

jgm commented Sep 5, 2020

I think we can work things out so that we don't need to check for e.g. markdown_github.
(Part of this would mean parsing format specifiers with raw_attribute so that we can include the core Format even if they use an alias.)

Anyway, just to record what I think are the three alternatives:

  1. Include a Format constructor for every format name recognized by pandoc, including aliases like markdown_github.

  2. Only include constructors for the "core" types that aren't just aliases for packages of extensions, and perhaps for some generic types (TeX, HTML) that include several core types (though we may not actually need these any more if we have sets of Formats).

  3. Make the Format specifiers expressive enough that you can include packages of extensions. (But this opens up many new cans of worms, e.g. how to define inclusion relations among formats that include different packages of extensions: should base+A+B be considered a subtype of base+A?)

I'm leaning towards 2, but I'm not really sure.

@despresc
Copy link
Contributor Author

despresc commented Sep 5, 2020

(2) should also be fine as an alternative to what's done in this request, since it's nearly what's done now, unless preserving users' ability to use those variant markdown names is important. They wouldn't actually be useful otherwise, because the Markdown writer never knows what markdown variant it's writing at the moment, so it would never discriminate on the variant markdown types. That holds for the other extension-implying variants too.

I was leaning toward keeping Format in RawBlock, and formats like Html, because Writers.Markdown still needs to render the format in a RawBlock. It could check that, e.g., the Formats is exactly [Html4, Html5] and know to render Html, but I'm not sure if that's wise. If those types of formats were kept, it could become Formats and the writer could just pick one to render, but I'm also not sure if that's wise.

@despresc
Copy link
Contributor Author

despresc commented Sep 5, 2020

If that were the case, then there would certainly need to be a couple of types for enumerating the reader and writer formats. There could even be a total ReaderFormat -> Reader m function in Readers, and an analogue in writers.

The only other odd use of Format is here, where the HTML reader uses the "noteref" format to store raw HTML for later use. That may be refactorable, but we could also have

data Format = KnownFormat KnownFormat | CustomFormat Text
data KnownFormat = {- what Format is now -}

like was previously proposed (the writers would simply drop a CustomFormat block).

@despresc
Copy link
Contributor Author

despresc commented Sep 5, 2020

Sorry, it stores an identifier in it, not raw HTML.

@despresc
Copy link
Contributor Author

despresc commented Sep 6, 2020

The following Format elements could be kept if we go with option (2), since their writers know they're writing them and they're not identical to other formats:

  • jats_archiving, jats_publishing, jats_articleauthoring (increasingly restrictive subformats of jats according to this comment)
  • muse
  • commonmark (is it a super-format of markdown?)
  • docbook4 and docbook5, subformats of docbook
  • the html slides formats (dzslides and the others), probably subformats of html and either html4 or html5 (not sure if any are related to the others)

I mention this because they aren't recognized as Format strings now.

@tarleb
Copy link
Contributor

tarleb commented Sep 6, 2020

jats_archiving, jats_publishing, jats_articleauthoring (increasingly restrictive subformats of jats according to this comment)

That comment of mine could have been clearer: the tag sets are a partial order, in that jats_archivingjats_publishing and jats_archivingjats_articleauthoring, but no there is no ordering relation beween jats_publishing and jats_articleauthoring.

@despresc
Copy link
Contributor Author

despresc commented Sep 6, 2020

Interesting @tarleb, so there is a diamond in the Format partial order after all.

This latest commit separates out these types from Format:

data ReaderFormat
data KnownWriterFormat
data WriterFormat = KnownWriterFormat KnownWriterFormat | CustomLuaWriter FilePath

I've kept Formats that writers can distinguish (so epub* is still in, which would represent a little extra discrimination from what's done now), and the formats in this comment are still in, with those sub-format relations. (Except for the fixed jats relations).

@despresc
Copy link
Contributor Author

despresc commented Sep 6, 2020

There's clearly a lot of overlap between ReaderFormat and KnownWriterFormat, so those could be made into a single type, at the cost of having extra formats for Readers and Writers.

@despresc
Copy link
Contributor Author

despresc commented Sep 6, 2020

The Format could be changed to

data KnownFormat = {- what's there now -}
data Format = KnownFormat KnownFormat | CustomFormat Text

to accommodate Readers.HTML and filter writers. Then the Formats type would have to become more complex to remain a Boolean algebra, because Format would no longer be finite, so the not (Formats s) = Formats $ anyFormat \\ s definition couldn't be used.


Edit: the type in the other pull request would work, I think:

data Formats = OneOf (Set Format) | NoneOf (Set Format)

@tarleb
Copy link
Contributor

tarleb commented Sep 6, 2020

Apologies, it seems that last info of mine was wrong. I tried to remember which tags are supported by the JATS authoring set, but not the publishing tag set, but couldn't find any. Apparently I misremembered. In fact, Wikipedia says this about the authoring tag set: "Formally this model a subset of the Publishing model."

@despresc
Copy link
Contributor Author

despresc commented Sep 7, 2020

The reader and writer enumeration types don't need to be in this package, if they stay split from Format. I think it would require a version bump whenever an extension-implying variant output were added, though that might not happen all that often. It could definitely be put somewhere in pandoc (maybe in its own module, since it would need to be used by Extensions).

@despresc
Copy link
Contributor Author

despresc commented Sep 7, 2020

Sorry that these commits are a bit messy. I can tidy them up at the end if you'd like.

@despresc
Copy link
Contributor Author

despresc commented Sep 8, 2020

That should be option (2) finished. No changes to Definition yet.

@jgm
Copy link
Owner

jgm commented Sep 10, 2020

I was leaning toward keeping Format in RawBlock, and formats like Html, because Writers.Markdown still needs to render the format in a RawBlock. It could check that, e.g., the Formats is exactly [Html4, Html5] and know to render Html, but I'm not sure if that's wise.

I don't think I understood this comment. I would have thought that the Markdown writer would render the thing if the set of formats included either Html4 or Html5. Why would it be necessary to check for both? A finite atomic boolean algebra (sets of atomic formats) would be the simplest representation to work with if we could manage it.

If that were the case, then there would certainly need to be a couple of types for enumerating the reader and writer formats. There could even be a total ReaderFormat -> Reader m function in Readers, and an analogue in writers.

I think this is a good idea; we currently use strings in the readers and writers lookup tables, but a total function would be nice. (Of course we'd still need something that associates strings with Formats in both directions.)

The only other odd use of Format is here, where the HTML reader uses the "noteref" format to store raw HTML for later use. That may be refactorable, but we could also have

If it's just one loose end, then I think it would be good to figure out whether we can write it another way.

JATS tags:

I wasn't sure about the upshot of the discussion with @tarleb. Do we need these different jats formats to stand in an inclusion relation or not? (I would have thought that someone who explicitly writes jats_XXX in a raw attribute wants it to be rendered in just that output format; they could always write jats if they want to be indiscriminate.)

Just looking for the simplest thing that works!

@despresc
Copy link
Contributor Author

Agreed, finding the simplest solution is best.

Just to lay out why I gave the new Format this complexity (and to repeat things I'm sure you know), I'm operating under a model of Format based on the html and tex family of format strings. In pandoc now, there are currently the automatic inclusions

tex -> latex -> beamer
tex -> context
html -> html4
html -> html5

You can have a latex raw element and a beamer raw element, and both will be included in beamer output, but only the latex one will be included in latex output. (And neither will be included in context output).

So the markdown writer would have to check the different html4 and html5 cases, I think, because it needs to know whether to render the raw element format tag as {=html}, {=html4}, or {=html5}. Otherwise information gets lost: an html5 raw element would get an html format tag, which would signal it can be included in html4, something that doesn't currently happen. Or an html tag would be inappropriately specialized.

Having the jats_, epub, and HTML slide formats in Format is an extension of Pandoc's current behaviour. It doesn't actually recognize any of them right now. To be consistent with the tex formats, I kept in Format all of the formats that are distinct enough that the writers can tell if they're writing them (so more than just Pandoc extension variants), and gave them implicit inclusion behaviours with listSubformats that are analogous to the latex -> beamer inclusion behaviour. That's why the jats_publishing -> jats_archiving inclusion is intended to happen, since the publishing tag set is a subset of the archiving tag set. But these implicit inclusions (or all of these formats) can be removed without any loss in current Pandoc behaviour, and it would make Format simpler (if a little more inconsistent in what it can express, in my opinion).

The noteref thing can probably be replaced with a Span with a suitable class or key-value pair. If allowing that kind of storage behaviour (in pandoc or a filter) isn't that important, then having Format be just an enumeration (instead of KnownFormat KnownFormat | CustomFormat Text) would be simpler.

@jgm
Copy link
Owner

jgm commented Sep 11, 2020

I kept in Format all of the formats that are distinct enough that the writers can tell if they're writing them (so more than just Pandoc extension variants

This seems a good principle.

@jgm
Copy link
Owner

jgm commented Sep 11, 2020

I thought of a possible drawback of the use of Set to represent these things -- not sure how serious it is.
In native output, it's going to look very hairy if someone specifies e.g. "all but LateX and docx". It will render as a giant list, and it won't be perspicuous which ones are left out.
An alternative would be to make the format an algebraic data type, e.g.

data Format =
    BasicFormat
  | And Format Format
  | Or Format Format
  | Not Format

Then we could just write Not (Or LaTeX Docx). The matches function would have to evaluate this expression, which is straightforward enough.
With this approach, adding a Custom format wouldn't cause problems for the Boolean structure: just add

  | CustomFormat Text

@despresc
Copy link
Contributor Author

In the current approach (from the other PR) with

data Formats = OneOfFormats (Set Format) | NoneOfFormats (Set Format)

something like that would be

not (LaTeX `or` OpenXml) 
  = NoneOfFormats (fromList [KnownFormat Beamer,KnownFormat LaTeX,KnownFormat OpenXml])

(Docx is no longer in KnownFormat). That's still fairly long, I suppose, with the KnownFormat/CustomFormat distinction.

Writing the algebra directly is conceptually simpler too. I'm not sure if it would be slower or faster, since the sets involved will tend to be pretty small.

@despresc
Copy link
Contributor Author

For Writers.Markdown (and a couple of other writers for formats that support raw elements), one simple thing that might work if Formats is allowed on raw elements is to arrange KnownFormat so that for x, y :: KnownFormat, if x is a super-format of y, then x <= y in the KnownFormat order. That way, if something needs to pick a single format that matches a Formats, it can take the least and be assured that it will have picked a most general format, if not the most general format.

A test could be added to make sure that holds. The fact that KnownFormat wouldn't be alphabetized might possibly be a problem.

What would be the fallback if a raw element is encountered by Writers.Markdown and it has a Formats that doesn't match any format at all? Maybe just render it as a code block/inline, or decline to render it at all?

@despresc
Copy link
Contributor Author

Oh, but that would only work if formats like Html were kept in KnownFormat. So maybe that isn't a good solution after all.

A new Format type enumerates exactly what formats Pandoc can recognize
in some way (input, output, raw content).

A new Formats type and related functions can be used in future
IfFormatBlock and IfFormatInline blocks, and are used to implement the
castsTo function that writers can use to determine when they can
include raw content in their output.
The known reader and writer formats are now enumerated separately from
Format, and Format is now much smaller.
The new CustomFormat will allow for filter writers to use raw elements
with custom formats, and to use custom formats in
conditionally-rendered elements. They are invisible to castsToKnown
and knownMatch.

The Formats type had to be modified to accommodate that change, and
additional conversion and convenience functions were added.

jats_articleauthoring is a super-format of jats_publishing. The
listSubformats relation was modified to take this into account.
The new definition of Formats requires a more involved definition of
'and', so new tests ensure that 'and p q' behaves like it should. The
other functions are defined in terms of 'and' and 'not' (which plainly
does what it should), so they should work automatically.
The toKnownFormat and fromKnownFormat functions convert between text
strings and KnownFormats.

The ReaderFormat and WriterFormat types were removed. They can be
added in a module in pandoc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Format an enumerated type
3 participants