-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write new Format and Formats types, some helper functions #78
base: master
Are you sure you want to change the base?
Conversation
Actually, if The reason is that in those elements, the roles in That means that this ifFormatBlock Html content would probably be interpreted as saying "render ifFormatBlock (Html `except` Html5) content to mean "render But this can be fixed. |
bfbc5fc
to
d9f023a
Compare
I updated the request. Now the idea is that if a writer for a concrete format TeX `matches` TeX = True
TeX `matches` LaTeX = False
LaTeX `matches` TeX = True
LaTeX `matches` LaTeX = True
LaTeX `matches` Beamer = False
LaTeX `matches` (LaTeX `or` Html) = True
LaTeX `matches` (TeX `except` Beamer) = True
LaTeX `matches` (TeX `except` Markdown) = True
LaTeX `matches` (LaTeX `and` Html) = False
LaTeX `matches` not Html = True There is also x `castsTo` Html5 -- see if the raw element can be included directly
x `castsTo` LaTeX -- see if the raw element represents supported math The |
d9f023a
to
fecbdda
Compare
I like the concept here. The thing that is giving me pause is that some of these formats are really just aliases for another format with a particular bundle of extensions. E.g., So extension bundles that happen to have aliases get counted as Formats, but Formats can't represent arbitrary extension packages. That seems a bit odd to me. |
Maybe we don't have to solve the aliases/extensions problem in here. Not all of pandoc's format handling has to rely on the formats defined here. In jgm/pandoc#5118, my approach was to distinguish between |
That should be handled to some extent, if I understand correctly. Right now -- exactly = Formats . Set.fromList
listSubformats Markdown = exactly
[Markdown, MarkdownGithub, MarkdownMmd, MarkdownPhpExtra, MarkdownStrict]
listSubformats MarkdownPhpExtra = exactly
[Markdown, MarkdownGithub, MarkdownMmd, MarkdownPhpExtra, MarkdownStrict] and Markdown `matches` MarkdownGithub = True
Markdown `matches` (MarkdownGithub `except` MarkdownPhpExtra) = False I was conservative and didn't include relationships I didn't see in |
Oh, if you meant that |
Yes, I understand that the One approach would be not to include things like |
I understand now. It is a little awkward that this can't express things like The only reason they're there is that right now |
I think we can work things out so that we don't need to check for e.g. Anyway, just to record what I think are the three alternatives:
I'm leaning towards 2, but I'm not really sure. |
(2) should also be fine as an alternative to what's done in this request, since it's nearly what's done now, unless preserving users' ability to use those variant I was leaning toward keeping |
If that were the case, then there would certainly need to be a couple of types for enumerating the reader and writer formats. There could even be a total The only other odd use of data Format = KnownFormat KnownFormat | CustomFormat Text
data KnownFormat = {- what Format is now -} like was previously proposed (the writers would simply drop a |
Sorry, it stores an identifier in it, not raw HTML. |
The following
I mention this because they aren't recognized as |
That comment of mine could have been clearer: the tag sets are a partial order, in that |
Interesting @tarleb, so there is a diamond in the This latest commit separates out these types from data ReaderFormat
data KnownWriterFormat
data WriterFormat = KnownWriterFormat KnownWriterFormat | CustomLuaWriter FilePath I've kept |
There's clearly a lot of overlap between |
The data KnownFormat = {- what's there now -}
data Format = KnownFormat KnownFormat | CustomFormat Text to accommodate Edit: the type in the other pull request would work, I think: data Formats = OneOf (Set Format) | NoneOf (Set Format) |
Apologies, it seems that last info of mine was wrong. I tried to remember which tags are supported by the JATS authoring set, but not the publishing tag set, but couldn't find any. Apparently I misremembered. In fact, Wikipedia says this about the authoring tag set: "Formally this model a subset of the Publishing model." |
9add1b7
to
80f9bd7
Compare
The reader and writer enumeration types don't need to be in this package, if they stay split from |
Sorry that these commits are a bit messy. I can tidy them up at the end if you'd like. |
That should be option (2) finished. No changes to |
25ec7d9
to
9d26c55
Compare
I don't think I understood this comment. I would have thought that the Markdown writer would render the thing if the set of formats included either Html4 or Html5. Why would it be necessary to check for both? A finite atomic boolean algebra (sets of atomic formats) would be the simplest representation to work with if we could manage it.
I think this is a good idea; we currently use strings in the
If it's just one loose end, then I think it would be good to figure out whether we can write it another way.
I wasn't sure about the upshot of the discussion with @tarleb. Do we need these different jats formats to stand in an inclusion relation or not? (I would have thought that someone who explicitly writes Just looking for the simplest thing that works! |
Agreed, finding the simplest solution is best. Just to lay out why I gave the new
You can have a So the markdown writer would have to check the different Having the The |
This seems a good principle. |
I thought of a possible drawback of the use of Set to represent these things -- not sure how serious it is. data Format =
BasicFormat
| And Format Format
| Or Format Format
| Not Format Then we could just write | CustomFormat Text |
In the current approach (from the other PR) with data Formats = OneOfFormats (Set Format) | NoneOfFormats (Set Format) something like that would be not (LaTeX `or` OpenXml)
= NoneOfFormats (fromList [KnownFormat Beamer,KnownFormat LaTeX,KnownFormat OpenXml]) ( Writing the algebra directly is conceptually simpler too. I'm not sure if it would be slower or faster, since the sets involved will tend to be pretty small. |
For A test could be added to make sure that holds. The fact that What would be the fallback if a raw element is encountered by |
Oh, but that would only work if formats like |
A new Format type enumerates exactly what formats Pandoc can recognize in some way (input, output, raw content). A new Formats type and related functions can be used in future IfFormatBlock and IfFormatInline blocks, and are used to implement the castsTo function that writers can use to determine when they can include raw content in their output.
The known reader and writer formats are now enumerated separately from Format, and Format is now much smaller.
The new CustomFormat will allow for filter writers to use raw elements with custom formats, and to use custom formats in conditionally-rendered elements. They are invisible to castsToKnown and knownMatch. The Formats type had to be modified to accommodate that change, and additional conversion and convenience functions were added. jats_articleauthoring is a super-format of jats_publishing. The listSubformats relation was modified to take this into account.
The new definition of Formats requires a more involved definition of 'and', so new tests ensure that 'and p q' behaves like it should. The other functions are defined in terms of 'and' and 'not' (which plainly does what it should), so they should work automatically.
The toKnownFormat and fromKnownFormat functions convert between text strings and KnownFormats. The ReaderFormat and WriterFormat types were removed. They can be added in a module in pandoc.
9d26c55
to
862f957
Compare
Edit: The examples of
Formats
matching below are now wrong, andlistMatches
is nowlistSubformats
(see this comment).An initial
Format
andFormats
design (discussed a little here and in the documentation below for the newText.Pandoc.Format
module) to resolve jgm/pandoc#547.The
Format
type is straightforward. It just enumerates all of input and output formats thatpandoc
currently recognizes, as well as theopenxml
format that the manual says to use instead ofdocx
. These happen to be nearly all of theFormat
strings thatpandoc
deals with, but there are two others that I found (and aren't in the newFormat
yet):noteref
(used in aRawInline
internally inReaders.HTML
)doc
(used to trigger an "unknown reader" error inApp.FormatHeuristics
)The
Formats
type and thematches
function are intended to capture the fuzziness in matching particular formats, so that the writers can figure out when to include raw content:The matches work this way because the writers are normally faced with something like
RawBlock f str
, know what they're writing (more or less), and want to know ifstr
should be included. They can do this by asking iff `matches` LaTeX
orf `matches` Html5
(the HTML writer would do both when writingHtml5
, the former to see if supported math should be included). More generalFormats
patterns can also be constructed:This behaviour is defined in
listMatches
, which lists all the formats that will match the given format, i.e. it lists all of the super-formats (more general formats) of the given format. I tried to baselistMatches
on whatpandoc
currently does, so all of theMarkdown*
formats are equivalent to each other,Epub3
is a sub-format ofHtml
andHtml5
, and so on. But that might be too conservative:commonmark*
formats aren't recognized by any writer. Should these be equivalent (or otherwise related) tomarkdown*
?docbook4
anddocbook5
formats aren't recognized. Are these sub-formats ofdocbook
?jats_*
formats aren't recognized. I think they form a chain of sub-formats, all more specific thanjats
?asciidoctor
format is not recognized. It might be equivalent toasciidoc
?There might be other cases I missed.
I haven't written it yet, but there might need to be a
function that handles formats in the
--to
sense, as being possible aliases for a particular concrete format.