Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leading spaces in values are automatically stripped? #313

Open
MerlijnWajer opened this issue Oct 10, 2022 · 3 comments
Open

Leading spaces in values are automatically stripped? #313

MerlijnWajer opened this issue Oct 10, 2022 · 3 comments

Comments

@MerlijnWajer
Copy link

I ran into a problem parsing this file with xmltodict: https://archive.org/download/janus-34-scan-zapman/janus-34-scan-zapman_files.xml

The value of 'original' has it's leading space stripped, it should be ' JANUS 34_Scan Zapman_chocr.html.gz', but it is turned into 'JANUS 34_Scan Zapman_chocr.html.gz'

This is probably caused by the commit from this issue: #15

Given the above commit, it is not clear to me if there is any way to keep spaces inside an element in XML. Is there a way to disable this behaviour?

Here's the relevant part from the file linked above:

<file name=" JANUS 34_Scan Zapman_hocr.html" source="derivative">
<hocr_char_to_word_module_version>1.1.0</hocr_char_to_word_module_version>
<hocr_char_to_word_hocr_version>1.1.15</hocr_char_to_word_hocr_version>
<ocr_parameters>-l fra</ocr_parameters>
<ocr_module_version>0.0.18</ocr_module_version>
<ocr_detected_script>Latin</ocr_detected_script>
<ocr_detected_script_conf>0.4311</ocr_detected_script_conf>
<ocr_detected_lang>fr</ocr_detected_lang>
<ocr_detected_lang_conf>1.0000</ocr_detected_lang_conf>
<format>hOCR</format>
<original> JANUS 34_Scan Zapman_chocr.html.gz</original>
<mtime>1664638619</mtime>
<size>2140105</size>
<md5>1596964e7b6e5aee5e6faedc6d3cb47b</md5>
<crc32>b0c6226b</crc32>
<sha1>07eca05572e97b5abb66fcba4252956ada5f7b10</sha1>
</file>
@MerlijnWajer
Copy link
Author

Ah, I think I figured out the solution, there is a strip_whitespace=False argument that can be passed to the kwargs of parse -- it just wasn't documented.

@MerlijnWajer
Copy link
Author

Leaving this issue open for any discussion about the default behaviour, silently truncating data seems problematic to me.

@bartolootrit
Copy link

@MerlijnWajer Thank you for bringing this up.

Repository owner deleted a comment from javadev Jan 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants