Write new Format and Formats types, some helper functions #78

despresc · 2020-09-04T16:01:50Z

Edit: The examples of Formats matching below are now wrong, and listMatches is now listSubformats (see this comment).

An initial Format and Formats design (discussed a little here and in the documentation below for the new Text.Pandoc.Format module) to resolve jgm/pandoc#547.

The Format type is straightforward. It just enumerates all of input and output formats that pandoc currently recognizes, as well as the openxml format that the manual says to use instead of docx. These happen to be nearly all of the Format strings that pandoc deals with, but there are two others that I found (and aren't in the new Format yet):

noteref (used in a RawInline internally in Readers.HTML)
doc (used to trigger an "unknown reader" error in App.FormatHeuristics)

The Formats type and the matches function are intended to capture the fuzziness in matching particular formats, so that the writers can figure out when to include raw content:

-- It is intended that x `matches` y when a format x can be included in a context with format y
-- without other modification, more or less. This is the how Format is used by writers, not how
-- it is used in the --to option (where Html is an alias for Html5).

-- TeX content can always be included in a TeX or LaTeX context
TeX `matches` TeX = True
TeX `matches` LaTeX = True

-- LaTeX can't always be included in a TeX context
LaTeX `matches` TeX = False

-- Replacing (TeX, LaTeX) with (Html, Html5) in the above also works

The matches work this way because the writers are normally faced with something like RawBlock f str, know what they're writing (more or less), and want to know if str should be included. They can do this by asking if f `matches` LaTeX or f `matches` Html5 (the HTML writer would do both when writing Html5, the former to see if supported math should be included). More general Formats patterns can also be constructed:

TeX `matches` (LaTeX `or` Html) = True
LaTeX `matches` (LaTeX `except` Beamer) = False
LaTeX `matches` (Beamer `except` TeX) = True
TeX `matches` (LaTeX `and` Html) = False
LaTeX `matches` not Html = True

This behaviour is defined in listMatches, which lists all the formats that will match the given format, i.e. it lists all of the super-formats (more general formats) of the given format. I tried to base listMatches on what pandoc currently does, so all of the Markdown* formats are equivalent to each other, Epub3 is a sub-format of Html and Html5, and so on. But that might be too conservative:

The commonmark* formats aren't recognized by any writer. Should these be equivalent (or otherwise related) to markdown*?
The docbook4 and docbook5 formats aren't recognized. Are these sub-formats of docbook?
The jats_* formats aren't recognized. I think they form a chain of sub-formats, all more specific than jats?
The asciidoctor format is not recognized. It might be equivalent to asciidoc?

There might be other cases I missed.

I haven't written it yet, but there might need to be a

toConcreteFormat :: Format -> Format
toConcreteFormat Html = Html5
toConcreteFormat Html5 = Html5
-- etc.

function that handles formats in the --to sense, as being possible aliases for a particular concrete format.

despresc · 2020-09-04T16:24:57Z

Actually, if Formats is going to be used in an IfFormatBlock or an IfFormatInline element, then the Boolean operations and match should probably be written slightly differently (the super-format strategy in listMatches is still fine, at least).

The reason is that in those elements, the roles in matches are reversed. Instead of being faced with an unknown Format and wanting to know if it matches our concrete one, instead the writers are given an unknown Formats pattern and we have to match our concrete one against it.

That means that this

ifFormatBlock Html content

would probably be interpreted as saying "render content in any Html-based format". Right now, it means "render content in any context that is pure Html", so content would never be rendered, since Html is not an output format. Even if pattern evaluation were changed for these blocks, people would also expect

ifFormatBlock (Html `except` Html5) content

to mean "render content in any Html-based format except Html5 ones", when right now it means "render content in any Html context that can't be included in an Html5 context", which happens to be no contexts at all.

But this can be fixed.

despresc · 2020-09-04T18:53:01Z

I updated the request. Now the idea is that if a writer for a concrete format f like Html5 encounters an IfFormatBlock p content (once that block exists), it will render content when f `matches` p. So the following are true:

TeX `matches` TeX = True
TeX `matches` LaTeX = False
LaTeX `matches` TeX = True
LaTeX `matches` LaTeX = True
LaTeX `matches` Beamer = False
LaTeX `matches` (LaTeX `or` Html) = True
LaTeX `matches` (TeX `except` Beamer) = True
LaTeX `matches` (TeX `except` Markdown) = True
LaTeX `matches` (LaTeX `and` Html) = False
LaTeX `matches` not Html = True

There is also castsTo function that writers can use to test raw elements with unknown format x, where x `castsTo` f if x can be included directly into f without modification. For example, the Html5 writer might look at the following:

x `castsTo` Html5 -- see if the raw element can be included directly
x `castsTo` LaTeX -- see if the raw element represents supported math

The listMatches function (which listed the super-formats of a given Format) is now listSubformats (and lists the sub-formats of a given Format) to handle these changes, meaning that matches is now the inverse relation to what it was before: x `matches` y when x is a subformat of y :: Format.

jgm · 2020-09-05T18:59:30Z

I like the concept here. The thing that is giving me pause is that some of these formats are really just aliases for another format with a particular bundle of extensions. E.g., asciidoc and asciidoctor; commonmark and commonmark_x; markdown and markdown_phpextra.

So extension bundles that happen to have aliases get counted as Formats, but Formats can't represent arbitrary extension packages. That seems a bit odd to me.

tarleb · 2020-09-05T19:59:04Z

Maybe we don't have to solve the aliases/extensions problem in here. Not all of pandoc's format handling has to rely on the formats defined here. In jgm/pandoc#5118, my approach was to distinguish between KnownFormats (i.e., Format in this PR) and other formats like "flavored formats" (format combined with extensions) and IOFormat (a Format or a Lua script path).

despresc · 2020-09-05T20:01:52Z

That should be handled to some extent, if I understand correctly. Right now listSubformats x will give a Formats set of all the formats that x can be directly included in. So

-- exactly = Formats . Set.fromList
listSubformats Markdown = exactly
  [Markdown, MarkdownGithub, MarkdownMmd, MarkdownPhpExtra, MarkdownStrict]
listSubformats MarkdownPhpExtra = exactly
  [Markdown, MarkdownGithub, MarkdownMmd, MarkdownPhpExtra, MarkdownStrict]

and castsTo is implemented with listSubformats, so, say, MarkdownGithub `castsTo` MarkdownPhpExtra = True. This also works in the combinators, so

Markdown `matches` MarkdownGithub = True
Markdown `matches` (MarkdownGithub `except` MarkdownPhpExtra) = False

I was conservative and didn't include relationships I didn't see in pandoc, but asciidoc and asciidoctor could be considered equivalent by listSubformats like this. So could commonmark and commonmark_x.

despresc · 2020-09-05T20:06:50Z

Oh, if you meant that Format includes more things than there are readers and writers individually, then yes, separate types could work there.

jgm · 2020-09-05T21:30:46Z

That should be handled to some extent,

Yes, I understand that the matches relation allows "subtypes" of formats. What seems awkward to me is that, as I think of it, markdown_phpextra is just markdown + extensions W, X, Y, and Z. So it seems strange that you can represent this in a Format but you can't represent arbitrary combinations, e.g. markdown + W and Y.

One approach would be not to include things like markdown_phpextra or markdown_x or asciidoctor in the list of formats. (This would correspond to the way I've been thinking of it all along, though it isn't enforced since this is just "stringly typed.")

despresc · 2020-09-05T22:36:53Z

I understand now. It is a little awkward that this can't express things like markdown+old_dashes. The types here could be broadened to include that, but things would start to get unwieldy.

The only reason they're there is that right now pandoc does understand some of them as valid RawBlock formats. This was the example I saw. But many formats don't have this kind of detection (asciidoctor isn't recognized in a format block). So that could be an outlier, and should be removed.

jgm · 2020-09-05T22:42:32Z

I think we can work things out so that we don't need to check for e.g. markdown_github.
(Part of this would mean parsing format specifiers with raw_attribute so that we can include the core Format even if they use an alias.)

Anyway, just to record what I think are the three alternatives:

Include a Format constructor for every format name recognized by pandoc, including aliases like markdown_github.
Only include constructors for the "core" types that aren't just aliases for packages of extensions, and perhaps for some generic types (TeX, HTML) that include several core types (though we may not actually need these any more if we have sets of Formats).
Make the Format specifiers expressive enough that you can include packages of extensions. (But this opens up many new cans of worms, e.g. how to define inclusion relations among formats that include different packages of extensions: should base+A+B be considered a subtype of base+A?)

I'm leaning towards 2, but I'm not really sure.

despresc · 2020-09-05T23:01:36Z

(2) should also be fine as an alternative to what's done in this request, since it's nearly what's done now, unless preserving users' ability to use those variant markdown names is important. They wouldn't actually be useful otherwise, because the Markdown writer never knows what markdown variant it's writing at the moment, so it would never discriminate on the variant markdown types. That holds for the other extension-implying variants too.

I was leaning toward keeping Format in RawBlock, and formats like Html, because Writers.Markdown still needs to render the format in a RawBlock. It could check that, e.g., the Formats is exactly [Html4, Html5] and know to render Html, but I'm not sure if that's wise. If those types of formats were kept, it could become Formats and the writer could just pick one to render, but I'm also not sure if that's wise.

despresc · 2020-09-05T23:12:50Z

If that were the case, then there would certainly need to be a couple of types for enumerating the reader and writer formats. There could even be a total ReaderFormat -> Reader m function in Readers, and an analogue in writers.

The only other odd use of Format is here, where the HTML reader uses the "noteref" format to store raw HTML for later use. That may be refactorable, but we could also have

data Format = KnownFormat KnownFormat | CustomFormat Text
data KnownFormat = {- what Format is now -}

like was previously proposed (the writers would simply drop a CustomFormat block).

despresc · 2020-09-05T23:18:26Z

Sorry, it stores an identifier in it, not raw HTML.

despresc · 2020-09-06T03:46:47Z

The following Format elements could be kept if we go with option (2), since their writers know they're writing them and they're not identical to other formats:

jats_archiving, jats_publishing, jats_articleauthoring (increasingly restrictive subformats of jats according to this comment)
muse
commonmark (is it a super-format of markdown?)
docbook4 and docbook5, subformats of docbook
the html slides formats (dzslides and the others), probably subformats of html and either html4 or html5 (not sure if any are related to the others)

I mention this because they aren't recognized as Format strings now.

tarleb · 2020-09-06T08:56:06Z

jats_archiving, jats_publishing, jats_articleauthoring (increasingly restrictive subformats of jats according to this comment)

That comment of mine could have been clearer: the tag sets are a partial order, in that jats_archiving ≻ jats_publishing and jats_archiving ≻ jats_articleauthoring, but no there is no ordering relation beween jats_publishing and jats_articleauthoring.

despresc · 2020-09-06T14:26:53Z

Interesting @tarleb, so there is a diamond in the Format partial order after all.

This latest commit separates out these types from Format:

data ReaderFormat
data KnownWriterFormat
data WriterFormat = KnownWriterFormat KnownWriterFormat | CustomLuaWriter FilePath

I've kept Formats that writers can distinguish (so epub* is still in, which would represent a little extra discrimination from what's done now), and the formats in this comment are still in, with those sub-format relations. (Except for the fixed jats relations).

despresc · 2020-09-06T14:30:16Z

There's clearly a lot of overlap between ReaderFormat and KnownWriterFormat, so those could be made into a single type, at the cost of having extra formats for Readers and Writers.

despresc · 2020-09-06T15:20:24Z

The Format could be changed to

data KnownFormat = {- what's there now -}
data Format = KnownFormat KnownFormat | CustomFormat Text

to accommodate Readers.HTML and filter writers. Then the Formats type would have to become more complex to remain a Boolean algebra, because Format would no longer be finite, so the not (Formats s) = Formats $ anyFormat \\ s definition couldn't be used.

Edit: the type in the other pull request would work, I think:

data Formats = OneOf (Set Format) | NoneOf (Set Format)

tarleb · 2020-09-06T21:05:56Z

Apologies, it seems that last info of mine was wrong. I tried to remember which tags are supported by the JATS authoring set, but not the publishing tag set, but couldn't find any. Apparently I misremembered. In fact, Wikipedia says this about the authoring tag set: "Formally this model a subset of the Publishing model."

despresc · 2020-09-07T02:44:02Z

The reader and writer enumeration types don't need to be in this package, if they stay split from Format. I think it would require a version bump whenever an extension-implying variant output were added, though that might not happen all that often. It could definitely be put somewhere in pandoc (maybe in its own module, since it would need to be used by Extensions).

despresc · 2020-09-07T17:51:58Z

Sorry that these commits are a bit messy. I can tidy them up at the end if you'd like.

despresc · 2020-09-08T00:48:05Z

That should be option (2) finished. No changes to Definition yet.

jgm · 2020-09-10T16:35:51Z

I was leaning toward keeping Format in RawBlock, and formats like Html, because Writers.Markdown still needs to render the format in a RawBlock. It could check that, e.g., the Formats is exactly [Html4, Html5] and know to render Html, but I'm not sure if that's wise.

I don't think I understood this comment. I would have thought that the Markdown writer would render the thing if the set of formats included either Html4 or Html5. Why would it be necessary to check for both? A finite atomic boolean algebra (sets of atomic formats) would be the simplest representation to work with if we could manage it.

If that were the case, then there would certainly need to be a couple of types for enumerating the reader and writer formats. There could even be a total ReaderFormat -> Reader m function in Readers, and an analogue in writers.

I think this is a good idea; we currently use strings in the readers and writers lookup tables, but a total function would be nice. (Of course we'd still need something that associates strings with Formats in both directions.)

The only other odd use of Format is here, where the HTML reader uses the "noteref" format to store raw HTML for later use. That may be refactorable, but we could also have

If it's just one loose end, then I think it would be good to figure out whether we can write it another way.

JATS tags:

I wasn't sure about the upshot of the discussion with @tarleb. Do we need these different jats formats to stand in an inclusion relation or not? (I would have thought that someone who explicitly writes jats_XXX in a raw attribute wants it to be rendered in just that output format; they could always write jats if they want to be indiscriminate.)

Just looking for the simplest thing that works!

despresc · 2020-09-10T19:36:02Z

Agreed, finding the simplest solution is best.

Just to lay out why I gave the new Format this complexity (and to repeat things I'm sure you know), I'm operating under a model of Format based on the html and tex family of format strings. In pandoc now, there are currently the automatic inclusions

tex -> latex -> beamer
tex -> context
html -> html4
html -> html5

You can have a latex raw element and a beamer raw element, and both will be included in beamer output, but only the latex one will be included in latex output. (And neither will be included in context output).

So the markdown writer would have to check the different html4 and html5 cases, I think, because it needs to know whether to render the raw element format tag as {=html}, {=html4}, or {=html5}. Otherwise information gets lost: an html5 raw element would get an html format tag, which would signal it can be included in html4, something that doesn't currently happen. Or an html tag would be inappropriately specialized.

Having the jats_, epub, and HTML slide formats in Format is an extension of Pandoc's current behaviour. It doesn't actually recognize any of them right now. To be consistent with the tex formats, I kept in Format all of the formats that are distinct enough that the writers can tell if they're writing them (so more than just Pandoc extension variants), and gave them implicit inclusion behaviours with listSubformats that are analogous to the latex -> beamer inclusion behaviour. That's why the jats_publishing -> jats_archiving inclusion is intended to happen, since the publishing tag set is a subset of the archiving tag set. But these implicit inclusions (or all of these formats) can be removed without any loss in current Pandoc behaviour, and it would make Format simpler (if a little more inconsistent in what it can express, in my opinion).

The noteref thing can probably be replaced with a Span with a suitable class or key-value pair. If allowing that kind of storage behaviour (in pandoc or a filter) isn't that important, then having Format be just an enumeration (instead of KnownFormat KnownFormat | CustomFormat Text) would be simpler.

jgm · 2020-09-11T01:53:40Z

I kept in Format all of the formats that are distinct enough that the writers can tell if they're writing them (so more than just Pandoc extension variants

This seems a good principle.

jgm · 2020-09-11T01:59:51Z

I thought of a possible drawback of the use of Set to represent these things -- not sure how serious it is.
In native output, it's going to look very hairy if someone specifies e.g. "all but LateX and docx". It will render as a giant list, and it won't be perspicuous which ones are left out.
An alternative would be to make the format an algebraic data type, e.g.

data Format =
    BasicFormat
  | And Format Format
  | Or Format Format
  | Not Format

Then we could just write Not (Or LaTeX Docx). The matches function would have to evaluate this expression, which is straightforward enough.
With this approach, adding a Custom format wouldn't cause problems for the Boolean structure: just add

  | CustomFormat Text

despresc · 2020-09-11T13:58:26Z

In the current approach (from the other PR) with

data Formats = OneOfFormats (Set Format) | NoneOfFormats (Set Format)

something like that would be

not (LaTeX `or` OpenXml) 
  = NoneOfFormats (fromList [KnownFormat Beamer,KnownFormat LaTeX,KnownFormat OpenXml])

(Docx is no longer in KnownFormat). That's still fairly long, I suppose, with the KnownFormat/CustomFormat distinction.

Writing the algebra directly is conceptually simpler too. I'm not sure if it would be slower or faster, since the sets involved will tend to be pretty small.

despresc · 2020-09-11T14:46:57Z

For Writers.Markdown (and a couple of other writers for formats that support raw elements), one simple thing that might work if Formats is allowed on raw elements is to arrange KnownFormat so that for x, y :: KnownFormat, if x is a super-format of y, then x <= y in the KnownFormat order. That way, if something needs to pick a single format that matches a Formats, it can take the least and be assured that it will have picked a most general format, if not the most general format.

A test could be added to make sure that holds. The fact that KnownFormat wouldn't be alphabetized might possibly be a problem.

What would be the fallback if a raw element is encountered by Writers.Markdown and it has a Formats that doesn't match any format at all? Maybe just render it as a code block/inline, or decline to render it at all?

despresc · 2020-09-11T15:33:24Z

Oh, but that would only work if formats like Html were kept in KnownFormat. So maybe that isn't a good solution after all.

A new Format type enumerates exactly what formats Pandoc can recognize in some way (input, output, raw content). A new Formats type and related functions can be used in future IfFormatBlock and IfFormatInline blocks, and are used to implement the castsTo function that writers can use to determine when they can include raw content in their output.

The known reader and writer formats are now enumerated separately from Format, and Format is now much smaller.

The new CustomFormat will allow for filter writers to use raw elements with custom formats, and to use custom formats in conditionally-rendered elements. They are invisible to castsToKnown and knownMatch. The Formats type had to be modified to accommodate that change, and additional conversion and convenience functions were added. jats_articleauthoring is a super-format of jats_publishing. The listSubformats relation was modified to take this into account.

The new definition of Formats requires a more involved definition of 'and', so new tests ensure that 'and p q' behaves like it should. The other functions are defined in terms of 'and' and 'not' (which plainly does what it should), so they should work automatically.

The toKnownFormat and fromKnownFormat functions convert between text strings and KnownFormats. The ReaderFormat and WriterFormat types were removed. They can be added in a module in pandoc.

despresc force-pushed the enumerate-format branch from bfbc5fc to d9f023a Compare September 4, 2020 18:36

despresc force-pushed the enumerate-format branch from d9f023a to fecbdda Compare September 4, 2020 18:59

despresc mentioned this pull request Sep 4, 2020

Make Format an enumerated type jgm/pandoc#547

Open

despresc force-pushed the enumerate-format branch from 9add1b7 to 80f9bd7 Compare September 6, 2020 21:42

despresc force-pushed the enumerate-format branch from 25ec7d9 to 9d26c55 Compare September 9, 2020 17:27

despresc added 7 commits September 18, 2020 15:28

Separate the ReaderFormat, WriterFormat types from Format

4894324

The known reader and writer formats are now enumerated separately from Format, and Format is now much smaller.

Fix test relating to irrelevance of custom formats in patterns

4fb4bb9

Add toKnownFormat, fromKnownFormat, remove Reader/Writer formats

1a3099c

The toKnownFormat and fromKnownFormat functions convert between text strings and KnownFormats. The ReaderFormat and WriterFormat types were removed. They can be added in a module in pandoc.

Modify toKnownFormat to ignore case

862f957

despresc force-pushed the enumerate-format branch from 9d26c55 to 862f957 Compare September 18, 2020 21:56

This was referenced Jun 22, 2022

Extensions cannot be used with custom writers jgm/pandoc#8120

Closed

Use ADT to represent input formats jgm/pandoc#5118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write new Format and Formats types, some helper functions #78

Write new Format and Formats types, some helper functions #78

despresc commented Sep 4, 2020 •

edited

Loading

despresc commented Sep 4, 2020 •

edited

Loading

despresc commented Sep 4, 2020 •

edited

Loading

jgm commented Sep 5, 2020

tarleb commented Sep 5, 2020

despresc commented Sep 5, 2020

despresc commented Sep 5, 2020 •

edited

Loading

jgm commented Sep 5, 2020

despresc commented Sep 5, 2020

jgm commented Sep 5, 2020

despresc commented Sep 5, 2020 •

edited

Loading

despresc commented Sep 5, 2020 •

edited

Loading

despresc commented Sep 5, 2020

despresc commented Sep 6, 2020 •

edited

Loading

tarleb commented Sep 6, 2020

despresc commented Sep 6, 2020

despresc commented Sep 6, 2020

despresc commented Sep 6, 2020 •

edited

Loading

tarleb commented Sep 6, 2020 •

edited

Loading

despresc commented Sep 7, 2020

despresc commented Sep 7, 2020

despresc commented Sep 8, 2020

jgm commented Sep 10, 2020

despresc commented Sep 10, 2020

jgm commented Sep 11, 2020

jgm commented Sep 11, 2020

despresc commented Sep 11, 2020

despresc commented Sep 11, 2020

despresc commented Sep 11, 2020

Write new Format and Formats types, some helper functions #78

Are you sure you want to change the base?

Write new Format and Formats types, some helper functions #78

Conversation

despresc commented Sep 4, 2020 • edited Loading

despresc commented Sep 4, 2020 • edited Loading

despresc commented Sep 4, 2020 • edited Loading

jgm commented Sep 5, 2020

tarleb commented Sep 5, 2020

despresc commented Sep 5, 2020

despresc commented Sep 5, 2020 • edited Loading

jgm commented Sep 5, 2020

despresc commented Sep 5, 2020

jgm commented Sep 5, 2020

despresc commented Sep 5, 2020 • edited Loading

despresc commented Sep 5, 2020 • edited Loading

despresc commented Sep 5, 2020

despresc commented Sep 6, 2020 • edited Loading

tarleb commented Sep 6, 2020

despresc commented Sep 6, 2020

despresc commented Sep 6, 2020

despresc commented Sep 6, 2020 • edited Loading

tarleb commented Sep 6, 2020 • edited Loading

despresc commented Sep 7, 2020

despresc commented Sep 7, 2020

despresc commented Sep 8, 2020

jgm commented Sep 10, 2020

despresc commented Sep 10, 2020

jgm commented Sep 11, 2020

jgm commented Sep 11, 2020

despresc commented Sep 11, 2020

despresc commented Sep 11, 2020

despresc commented Sep 11, 2020

despresc commented Sep 4, 2020 •

edited

Loading

despresc commented Sep 4, 2020 •

edited

Loading

despresc commented Sep 4, 2020 •

edited

Loading

despresc commented Sep 5, 2020 •

edited

Loading

despresc commented Sep 5, 2020 •

edited

Loading

despresc commented Sep 5, 2020 •

edited

Loading

despresc commented Sep 6, 2020 •

edited

Loading

despresc commented Sep 6, 2020 •

edited

Loading

tarleb commented Sep 6, 2020 •

edited

Loading