Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy HTML breaks result #88

Open
nachitox opened this issue May 29, 2023 · 2 comments · May be fixed by #150
Open

Tidy HTML breaks result #88

nachitox opened this issue May 29, 2023 · 2 comments · May be fixed by #150

Comments

@nachitox
Copy link

On version 0.11.6 when I run:

markdownify("<h2>\n\tHeadline 2 on new line\n</h2>", heading_style="ATX")

I expect this:

## Headline 2 on new line\n\n

but I get this:

'## \n Headline 2 on new line\n\n'

which I believe is a bug.

@5yato4ok
Copy link
Contributor

Yes, I believe a tiny fix should remove the bug. #89

@mirabilos
Copy link

mirabilos commented May 30, 2023

Possibly a duplicate of #31 (and see my comment there, maybe the code snippet helps?) Even untidy HTML has… issues with whitespace.

jsm28 added a commit to jsm28/python-markdownify that referenced this issue Oct 3, 2024
This fixes various issues relating to how input whitespace is handled
and how wrapping handles whitespace resulting from hard line breaks.

This PR uses a branch based on that for matthewwithanm#120 to avoid conflicts with
the fixes and associated test changes there.  My suggestion is thus
first to merge matthewwithanm#120 (which fixes two open issues), then to merge the
remaining changes from this PR.

Wrapping paragraphs has the effect of losing all newlines including
those from `<br>` tags, contrary to HTML semantics (wrapping should be
a matter of pretty-printing the output; input whitespace from the HTML
input should be normalized, but `<br>` should remain as a hard line
break).  To fix this, we need to wrap the portions of a paragraph
between hard line breaks separately.  For this to work, ensure that
when wrapping, all input whitespace is normalized at an early stage,
including turning newlines into spaces.  (Only ASCII whitespace is
handled this way; `\s` is not used as it's not clear Unicode
whitespace should get such normalization.)

When not wrapping, there is still too much input whitespace
preservation.  If the input contains a blank line, that ends up as a
paragraph break in the output, or breaks the header formatting when
appearing in a header tag, though in terms of HTML semantics such a
blank line is no different from a space.  In the case of an ATX
header, even a single newline appearing in the output breaks the
Markdown.  Thus, when not wrapping, arrange for input whitespace
containing at least one `\r` or `\n` to be normalized to a single
newline, and in the ATX header case, normalize to a space.

Fixes matthewwithanm#130
(probably, not sure exactly what the HTML input there is)

Fixes matthewwithanm#88
(a related case, anyway; the actual input in matthewwithanm#88 has already been fixed)
@jsm28 jsm28 linked a pull request Oct 3, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants