Quantcast

[osmosis-dev] osmosis wrongly claims to see UTF8 problem

Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[osmosis-dev] osmosis wrongly claims to see UTF8 problem

Frederik Ramm
Hi,

   I have recently had osmosis in --rri mode refuse to apply an update
it had downloaded from OSM, claiming there was an UTF8 error in the
file. I looked and looked but the file was fine, passed UTF8 and XML
validity checks.

I tried to isolate the line that gave me the "error" but isolating it
made the problem go away. Only including the 583379 previous lines makes
the error occur.

So I now have two .osc files, one with 583380 lines and one with 583379
lines:

$ wc -l x.osc y.osc
   583380 x.osc
   583379 y.osc

their only difference is one line at the beginning of the longer file:

$ diff x.osc y.osc
2d1
<     <node id="4585086821" version="1" timestamp="2017-01-02T09:18:33Z"
uid="72020" user="Petr1868" changeset="44840247" lat="49.9957035"
lon="14.2460943"/>

But the longer one fails to process in osmosis, and the shorter one works:

$ osmosis --read-xml-change x.osc --write-null-change
Jan 11, 2017 10:19:41 AM org.openstreetmap.osmosis.core.Osmosis run
INFO: Osmosis Version 0.43.1
...
SEVERE: Thread for task 1-read-xml-change failed
org.openstreetmap.osmosis.core.OsmosisRuntimeException: Unable to parse
xml file x.osc.  publicId=(null), systemId=(null), lineNumber=583379,
columnNumber=90.
        at
org.openstreetmap.osmosis.xml.v0_6.XmlChangeReader.run(XmlChangeReader.java:114)

$ osmosis --read-xml-change y.osc --write-null-change
Jan 11, 2017 10:20:34 AM org.openstreetmap.osmosis.core.Osmosis run
INFO: Osmosis Version 0.43.1
...
Jan 11, 2017 10:20:35 AM org.openstreetmap.osmosis.core.Osmosis run
INFO: Total execution time: 1448 milliseconds.

Since the line which supposedly contains the "error" is identical in
both files, it can't really be an error (and the line does not contain
any non-ASCII characters).

Re-formatting the XML file with "xmlstarlet fo" or "xmlstarlet c14n"
makes the problem go away.

I've reproduced this bug on different machines with different Osmosis
versions. I've tried these java versions with identical results:

$ java -showversion
java version "1.7.0_121"
OpenJDK Runtime Environment (IcedTea 2.6.8) (7u121-2.6.8-1ubuntu0.14.04.1)

$ java -showversion
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)

I have uploaded the two .osc files here:

http://www.remote.org/frederik/tmp/osmosis-bug-try-read-xml-change-write-null-change-on-these-files-which-differ-only-by-one-line.zip

I'd be interested in any insights anyone has to share.

Bye
Frederik

--
Frederik Ramm  ##  eMail [hidden email]  ##  N49°00'09" E008°23'33"

_______________________________________________
osmosis-dev mailing list
[hidden email]
https://lists.openstreetmap.org/listinfo/osmosis-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [osmosis-dev] osmosis wrongly claims to see UTF8 problem

Frederik Ramm
Hi,

On 01/11/2017 10:30 AM, Frederik Ramm wrote:
> SEVERE: Thread for task 1-read-xml-change failed

I was a bit over-eager in shortening the stack trace. Full detail:

org.openstreetmap.osmosis.core.OsmosisRuntimeException: Unable to parse
xml file x.osc.  publicId=(null), systemId=(null), lineNumber=583379,
columnNumber=90.
        at
org.openstreetmap.osmosis.xml.v0_6.XmlChangeReader.run(XmlChangeReader.java:114)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.xml.sax.SAXParseException; lineNumber: 583379;
columnNumber: 90; Invalid byte 2 of 4-byte UTF-8 sequence.
        at
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
Source)
        at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at
org.openstreetmap.osmosis.xml.v0_6.XmlChangeReader.run(XmlChangeReader.java:109)
        ... 1 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException:
Invalid byte 2 of 4-byte UTF-8 sequence.
        at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanLiteral(Unknown Source)
        at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanAttribute(Unknown
Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown
Source)
        ... 11 more

--
Frederik Ramm  ##  eMail [hidden email]  ##  N49°00'09" E008°23'33"

_______________________________________________
osmosis-dev mailing list
[hidden email]
https://lists.openstreetmap.org/listinfo/osmosis-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [osmosis-dev] osmosis wrongly claims to see UTF8 problem

Brett Henderson
If the file is valid then perhaps it's a bug in the Xerces parser bundled with Osmosis.  The JDK version you use shouldn't matter because I don't use its XML parser (Java bundles an ancient version of Xerces with more serious unicode bugs).

I don't have any suggestions other than to check if there's a later version of Xerces available.  To change it, modify the following file:

Change this line:
dependencyVersionXerces=2.9.1

I see I added the following comments above that line which explains why I haven't upgraded it yet.

# Remaining on 2.9.1 instead of 2.10.0 for now because the newer version
# depends on org.w3c.dom.ElementTraversal which is not being transitively
# included. This could be possibly be fixed by including a newer version
# of xml-apis but this hasn't been verified.

Perhaps it's currently using the JDK version of xml-apis, but we may need to explicitly include a later version of that as well.  Ugh.  As an aside, I think Java 9 is supposed to be fixing some of this bundled dependency mess and allowing a newer XML library to be included.

I'd offer to help but I just don't have time.  Osmosis isn't getting much love from me any more :-(

On Wed, 11 Jan 2017 at 20:33 Frederik Ramm <[hidden email]> wrote:
Hi,

On 01/11/2017 10:30 AM, Frederik Ramm wrote:
> SEVERE: Thread for task 1-read-xml-change failed

I was a bit over-eager in shortening the stack trace. Full detail:

org.openstreetmap.osmosis.core.OsmosisRuntimeException: Unable to parse
xml file x.osc.  publicId=(null), systemId=(null), lineNumber=583379,
columnNumber=90.
        at
org.openstreetmap.osmosis.xml.v0_6.XmlChangeReader.run(XmlChangeReader.java:114)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.xml.sax.SAXParseException; lineNumber: 583379;
columnNumber: 90; Invalid byte 2 of 4-byte UTF-8 sequence.
        at
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
Source)
        at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
        at
org.openstreetmap.osmosis.xml.v0_6.XmlChangeReader.run(XmlChangeReader.java:109)
        ... 1 more
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException:
Invalid byte 2 of 4-byte UTF-8 sequence.
        at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanLiteral(Unknown Source)
        at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanAttribute(Unknown
Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown
Source)
        ... 11 more

--
Frederik Ramm  ##  eMail [hidden email]  ##  N49°00'09" E008°23'33"

_______________________________________________
osmosis-dev mailing list
[hidden email]
https://lists.openstreetmap.org/listinfo/osmosis-dev

_______________________________________________
osmosis-dev mailing list
[hidden email]
https://lists.openstreetmap.org/listinfo/osmosis-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [osmosis-dev] osmosis wrongly claims to see UTF8 problem

Frederik Ramm
Brett,

   thank you for your comment. The issue is not an urgent one for me
since workarounds exist, and on the many osmosis-based OSM updating
machines I've been running continuously for years, this is only the
second time I run into it. So it is a rare quirk, but of course I would
feel better if I knew where it came from.

I've re-built osmosis with Xerces 2.11.0 and this doesn't change the
situation.

Should I perhaps try and build a minimal "use Xerces to parse this XML
file" program, and if I can replicate the problem with that, file a bug
with Xerces? Or is the way in which Osmosis uses Xerces somehow special
so that a simple program like that would be very unlikely to trigger the
bug?

Bye
Frederik

--
Frederik Ramm  ##  eMail [hidden email]  ##  N49°00'09" E008°23'33"

_______________________________________________
osmosis-dev mailing list
[hidden email]
https://lists.openstreetmap.org/listinfo/osmosis-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [osmosis-dev] osmosis wrongly claims to see UTF8 problem

Brett Henderson
Oops, lost this in my inbox :-(

On Thu, 12 Jan 2017 at 19:22 Frederik Ramm <[hidden email]> wrote:

<snip> 
I've re-built osmosis with Xerces 2.11.0 and this doesn't change the
situation.

Should I perhaps try and build a minimal "use Xerces to parse this XML
file" program, and if I can replicate the problem with that, file a bug
with Xerces? Or is the way in which Osmosis uses Xerces somehow special
so that a simple program like that would be very unlikely to trigger the
bug?

I think it'd be a great place to start and think it *should* trigger the bug.  But I'm not sure what we'd do about it :-)  Osmosis doesn't do anything special that I can think of.  It just uses the standard Java mechanisms to invoke XML parsing.

One possible thing to try would be to use the XML parser used in the "fast" XML processor.  It uses XML stream parsing as opposed to SAX parsing (i.e. pull vs. push processing).

Brett

_______________________________________________
osmosis-dev mailing list
[hidden email]
https://lists.openstreetmap.org/listinfo/osmosis-dev
Loading...