This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: 2 XML parsing errors
Type: Stage:
Components: Library (Lib) Versions: Python 2.3
process
Status: closed Resolution: duplicate
Dependencies: Superseder:
Assigned To: Nosy List: peerjanssen
Priority: normal Keywords:

Created on 2004-11-26 18:03 by peerjanssen, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (2)
msg23323 - (view) Author: Peer Janssen (peerjanssen) Date: 2004-11-26 18:03
In a XML document generated by Trados Translators
Workbench (a TMX V 1.1 Translation Memory), the Unicode
characters U+0001 ("START OF HEADING", see
http://www.fileformat.info/info/unicode/char/0001/index.htm)
and SINGLE LOW-9 QUOTATION MARK (U+201A, see
http://www.fileformat.info/info/unicode/char/201a/index.htm)
produce errors when parsing it from a file with
"xml.dom.minidom".

The first one (0001) produces this output:

Traceback (most recent call last):
  File "G:\_Prog\TMworks\domtree.py", line 7, in ?
    dom=parse(tm)
  File "C:\Python23\lib\xml\dom\minidom.py", line 1919,
in parse
    return expatbuilder.parse(file)
  File "C:\Python23\lib\xml\dom\expatbuilder.py", line
928, in parse
    result = builder.parseFile(file)
  File "C:\Python23\lib\xml\dom\expatbuilder.py", line
207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid
token): line 420, column 106

The second one (201A) produces this output:

Traceback (most recent call last):
  File "G:\_Prog\TMworks\domtree.py", line 7, in ?
    dom=parse(tm)
  File "C:\Python23\lib\xml\dom\minidom.py", line 1919,
in parse
    return expatbuilder.parse(file)
  File "C:\Python23\lib\xml\dom\expatbuilder.py", line
928, in parse
    result = builder.parseFile(file)
  File "C:\Python23\lib\xml\dom\expatbuilder.py", line
207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 624,
column 2

Deleting these two characters in the whole document
produces the desired result.

I don't see why these characters should be of any
problem, especially the quotation mark.
msg23324 - (view) Author: Peer Janssen (peerjanssen) Date: 2004-11-27 14:05
Logged In: YES 
user_id=896722

This is a duplicate. I filed it again as 

http://sourceforge.net/tracker/index.php?func=detail&aid=1074200&group_id=5470&atid=105470

because I didn't find it in the list after submitting. But
it was my being new with the bugtracker interface which
caused this. Sorry.

So now I close this one as a duplicate and leave the
duplicate open, because I prefer the title of the duplicate,
being more precise.
History
Date User Action Args
2022-04-11 14:56:08adminsetgithub: 41232
2004-11-26 18:03:04peerjanssencreate