TYP files and character encoding

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

TYP files and character encoding

Ticker Berkin
Hi

A couple of problems with typ-files and unicode.

With 'Codepage=65001' the final contents of the labels in mapnik.typ
that is included with the composite map is unicode, but if the map is
codepage 1252, the unicode characters with the top bit set are simply
displayed as if in 1252.

Removing the codepage statement from mapnik.txt and making fixes
elsewhere to ensure that the file is read correctly as utf-8 and then
generating a map with --code-page=1252, it gives the error:

SEVE: uk.me.parabola.imgfmt.MapFailedException
 ../svn/trunk/resources/typ-files/mapnik.txt:
 (thrown in TypCompiler.makeMap())
 TYP file cannot be written in code page 1252

Changing the exception handling in imgfmt/app/typ/TypElement.java, so
that makeLabelBlock() reads as
...
    CharBuffer cb = CharBuffer.wrap(tl.getText());
    try {
        ByteBuffer buffer = encoder.encode(cb);
        out.put((byte) tl.getLang());
        out.put(buffer);
        out.put((byte) 0);
     }  catch (CharacterCodingException ignore) {
//        ignore.printStackTrace();      
        String name = encoder.charset().name();
        System.out.println("Cannot represent String=" +
            tl.getLang() + "," + tl.getText() +
            " in CodePage=" + name);
//        throw newTypLabelException(name);
     }
...

It gives output like:
Cannot represent String=21,Gara|e in CodePage=windows-1252
Cannot represent String=21,Obszar przemysBowy in CodePage=windows-1252
Cannot represent String=21,ZieleD in CodePage=windows-1252
Cannot represent String=21,Zaro[la in CodePage=windows-1252
Cannot represent String=21,MokradBa in CodePage=windows-1252
Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
CodePage=windows-1252
Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
CodePage=windows-1252
Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
CodePage=windows-1252
Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows-1252
Cannot represent String=21,Wybrze|e in CodePage=windows-1252
Cannot represent String=21,Zcie|ka in CodePage=windows-1252
Cannot represent String=21,StrumieD in CodePage=windows-1252
Cannot represent String=21,Granica paDstwa in CodePage=windows-1252
Cannot represent String=21,Rzeka, KanaB in CodePage=windows-1252
Cannot represent String=21,StrumieD in CodePage=windows-1252
Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
Cannot represent String=21,Kabel wysokiego napi^Ycia in
CodePage=windows-1252
Cannot represent String=21,Tor wy[cigowy in CodePage=windows-1252
Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
CodePage=windows-1252
Cannot represent String=21,Droga krajowa (B^Ecznik) in CodePage=windows
-1252
Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
CodePage=windows-1252
Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows-1252
Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows-1252
Cannot represent String=21,Restauracja (AmerykaDska) in
CodePage=windows-1252
Cannot represent String=21,Restauracja (ChiDska) in CodePage=windows
-1252
Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
CodePage=windows-1252
Cannot represent String=21,Restauracja (WBoska) in CodePage=windows
-1252
Cannot represent String=21,Restauracja (MeksykaDska) in
CodePage=windows-1252
Cannot represent String=21,Restauracja (P^Eczki) in CodePage=windows
-1252
Cannot represent String=21,Restauracja (WegetariaDska) in
CodePage=windows-1252
Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
Cannot represent String=21,Sklep odzie|owy in CodePage=windows-1252
Cannot represent String=21,Wypo|yczalnia samochod\363w in
CodePage=windows-1252
Cannot represent String=21,Gara| in CodePage=windows-1252
Cannot represent String=21,Sprzeda| samochod\363w in CodePage=windows
-1252
Cannot represent String=21,Sklep |eglarski in CodePage=windows-1252
Cannot represent String=21,S^Ed in CodePage=windows-1252
Cannot represent String=21,O[rodek kultury in CodePage=windows-1252
Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252
Cannot represent String=21,Stra| po|arna in CodePage=windows-1252
Cannot represent String=21,SBupek in CodePage=windows-1252
Cannot represent String=21,PrzystaD in CodePage=windows-1252
Cannot represent String=21,L^Edowisko helikopterowe in CodePage=windows
-1252
Cannot represent String=21,Wie|a in CodePage=windows-1252
Cannot represent String=21,yr\363dBo in CodePage=windows-1252
Cannot represent String=21,Pla|a in CodePage=windows-1252
Cannot represent String=21,Przyl^Edek in CodePage=windows-1252
Cannot represent String=21,SkaBa in CodePage=windows-1252

Which makes sense if codepage 1252 doesn't handle Polish (hex 0x15,
decimal 21).

NB the non ascii characters in above are messed up by my cutting and pasting.

Checking the French, on my Garmin device, the type descriptions now display accents correctly.

Ticker

_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Gerd Petermann
Hi Ticker,

I think I understand now why we didn't have a default typ file ;)
If I got that right I should revert the changes in r4395 and mkgmap should not allow or warn loudly when a typ file with a different codepage is merged?
Or should we force the usage of unicode codepage?
Or is it possible to compile mapnik.txt with cp 1252 (or any other) in a way that only those lines which contain non-matching characters are ignored?

Gerd


________________________________________
Von: mkgmap-dev <[hidden email]> im Auftrag von Ticker Berkin <[hidden email]>
Gesendet: Mittwoch, 18. Dezember 2019 19:46
An: mkgmap development
Betreff: [mkgmap-dev] TYP files and character encoding

Hi

A couple of problems with typ-files and unicode.

With 'Codepage=65001' the final contents of the labels in mapnik.typ
that is included with the composite map is unicode, but if the map is
codepage 1252, the unicode characters with the top bit set are simply
displayed as if in 1252.

Removing the codepage statement from mapnik.txt and making fixes
elsewhere to ensure that the file is read correctly as utf-8 and then
generating a map with --code-page=1252, it gives the error:

SEVE: uk.me.parabola.imgfmt.MapFailedException
 ../svn/trunk/resources/typ-files/mapnik.txt:
 (thrown in TypCompiler.makeMap())
 TYP file cannot be written in code page 1252

Changing the exception handling in imgfmt/app/typ/TypElement.java, so
that makeLabelBlock() reads as
...
    CharBuffer cb = CharBuffer.wrap(tl.getText());
    try {
        ByteBuffer buffer = encoder.encode(cb);
        out.put((byte) tl.getLang());
        out.put(buffer);
        out.put((byte) 0);
     }  catch (CharacterCodingException ignore) {
//        ignore.printStackTrace();
        String name = encoder.charset().name();
        System.out.println("Cannot represent String=" +
            tl.getLang() + "," + tl.getText() +
            " in CodePage=" + name);
//        throw newTypLabelException(name);
     }
...

It gives output like:
Cannot represent String=21,Gara|e in CodePage=windows-1252
Cannot represent String=21,Obszar przemysBowy in CodePage=windows-1252
Cannot represent String=21,ZieleD in CodePage=windows-1252
Cannot represent String=21,Zaro[la in CodePage=windows-1252
Cannot represent String=21,MokradBa in CodePage=windows-1252
Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
CodePage=windows-1252
Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
CodePage=windows-1252
Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
CodePage=windows-1252
Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows-1252
Cannot represent String=21,Wybrze|e in CodePage=windows-1252
Cannot represent String=21,Zcie|ka in CodePage=windows-1252
Cannot represent String=21,StrumieD in CodePage=windows-1252
Cannot represent String=21,Granica paDstwa in CodePage=windows-1252
Cannot represent String=21,Rzeka, KanaB in CodePage=windows-1252
Cannot represent String=21,StrumieD in CodePage=windows-1252
Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
Cannot represent String=21,Kabel wysokiego napi^Ycia in
CodePage=windows-1252
Cannot represent String=21,Tor wy[cigowy in CodePage=windows-1252
Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
CodePage=windows-1252
Cannot represent String=21,Droga krajowa (B^Ecznik) in CodePage=windows
-1252
Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
CodePage=windows-1252
Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows-1252
Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows-1252
Cannot represent String=21,Restauracja (AmerykaDska) in
CodePage=windows-1252
Cannot represent String=21,Restauracja (ChiDska) in CodePage=windows
-1252
Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
CodePage=windows-1252
Cannot represent String=21,Restauracja (WBoska) in CodePage=windows
-1252
Cannot represent String=21,Restauracja (MeksykaDska) in
CodePage=windows-1252
Cannot represent String=21,Restauracja (P^Eczki) in CodePage=windows
-1252
Cannot represent String=21,Restauracja (WegetariaDska) in
CodePage=windows-1252
Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
Cannot represent String=21,Sklep odzie|owy in CodePage=windows-1252
Cannot represent String=21,Wypo|yczalnia samochod\363w in
CodePage=windows-1252
Cannot represent String=21,Gara| in CodePage=windows-1252
Cannot represent String=21,Sprzeda| samochod\363w in CodePage=windows
-1252
Cannot represent String=21,Sklep |eglarski in CodePage=windows-1252
Cannot represent String=21,S^Ed in CodePage=windows-1252
Cannot represent String=21,O[rodek kultury in CodePage=windows-1252
Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252
Cannot represent String=21,Stra| po|arna in CodePage=windows-1252
Cannot represent String=21,SBupek in CodePage=windows-1252
Cannot represent String=21,PrzystaD in CodePage=windows-1252
Cannot represent String=21,L^Edowisko helikopterowe in CodePage=windows
-1252
Cannot represent String=21,Wie|a in CodePage=windows-1252
Cannot represent String=21,yr\363dBo in CodePage=windows-1252
Cannot represent String=21,Pla|a in CodePage=windows-1252
Cannot represent String=21,Przyl^Edek in CodePage=windows-1252
Cannot represent String=21,SkaBa in CodePage=windows-1252

Which makes sense if codepage 1252 doesn't handle Polish (hex 0x15,
decimal 21).

NB the non ascii characters in above are messed up by my cutting and pasting.

Checking the French, on my Garmin device, the type descriptions now display accents correctly.

Ticker

_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Ticker Berkin
Hi Gerd

I think it is best to continue with the ideas for typ-files that:

1/ they can be in any character set and we just need a better way of
working out the correct one - see my posting earlier today.

2/ it can include as many languages as anyone can be bothered to add,
and so has to be an a character set that allows the languages to be
added, implying unicode for a common one (more particulary, UTF-8)

3/ the codepage= statement should be redundant and ignored for
controlling the output character set, which should be taken from the
map, but its use for determining the input coding might need to be kept
for a while for compatability.

4/ the messages my hack generates should be turned into 1 warning or
information message per language or maybe suppressed altogether. If
someone is generating a map with a character set that doesn't support a
particular language, they really won't care that that data for other
languages that have an incompatible representation with their language
won't be there.

Ticker

On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:

> Hi Ticker,
>
> I think I understand now why we didn't have a default typ file ;)
> If I got that right I should revert the changes in r4395 and mkgmap
> should not allow or warn loudly when a typ file with a different
> codepage is merged?
> Or should we force the usage of unicode codepage?
> Or is it possible to compile mapnik.txt with cp 1252 (or any other)
> in a way that only those lines which contain non-matching characters
> are ignored?
>
> Gerd
>
>
> ________________________________________
> Von: mkgmap-dev <[hidden email]> im Auftrag
> von Ticker Berkin <[hidden email]>
> Gesendet: Mittwoch, 18. Dezember 2019 19:46
> An: mkgmap development
> Betreff: [mkgmap-dev] TYP files and character encoding
>
> Hi
>
> A couple of problems with typ-files and unicode.
>
> With 'Codepage=65001' the final contents of the labels in mapnik.typ
> that is included with the composite map is unicode, but if the map is
> codepage 1252, the unicode characters with the top bit set are simply
> displayed as if in 1252.
>
> Removing the codepage statement from mapnik.txt and making fixes
> elsewhere to ensure that the file is read correctly as utf-8 and then
> generating a map with --code-page=1252, it gives the error:
>
> SEVE: uk.me.parabola.imgfmt.MapFailedException
>  ../svn/trunk/resources/typ-files/mapnik.txt:
>  (thrown in TypCompiler.makeMap())
>  TYP file cannot be written in code page 1252
>
> Changing the exception handling in imgfmt/app/typ/TypElement.java, so
> that makeLabelBlock() reads as
> ...
>     CharBuffer cb = CharBuffer.wrap(tl.getText());
>     try {
>         ByteBuffer buffer = encoder.encode(cb);
>         out.put((byte) tl.getLang());
>         out.put(buffer);
>         out.put((byte) 0);
>      }  catch (CharacterCodingException ignore) {
> //        ignore.printStackTrace();
>         String name = encoder.charset().name();
>         System.out.println("Cannot represent String=" +
>             tl.getLang() + "," + tl.getText() +
>             " in CodePage=" + name);
> //        throw newTypLabelException(name);
>      }
> ...
>
> It gives output like:
> Cannot represent String=21,Gara|e in CodePage=windows-1252
> Cannot represent String=21,Obszar przemysBowy in CodePage=windows
> -1252
> Cannot represent String=21,ZieleD in CodePage=windows-1252
> Cannot represent String=21,Zaro[la in CodePage=windows-1252
> Cannot represent String=21,MokradBa in CodePage=windows-1252
> Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> CodePage=windows-1252
> Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> CodePage=windows-1252
> Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> CodePage=windows-1252
> Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows-1252
> Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> Cannot represent String=21,StrumieD in CodePage=windows-1252
> Cannot represent String=21,Granica paDstwa in CodePage=windows-1252
> Cannot represent String=21,Rzeka, KanaB in CodePage=windows-1252
> Cannot represent String=21,StrumieD in CodePage=windows-1252
> Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> Cannot represent String=21,Kabel wysokiego napi^Ycia in
> CodePage=windows-1252
> Cannot represent String=21,Tor wy[cigowy in CodePage=windows-1252
> Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> CodePage=windows-1252
> Cannot represent String=21,Droga krajowa (B^Ecznik) in
> CodePage=windows
> -1252
> Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> CodePage=windows-1252
> Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows-1252
> Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows-1252
> Cannot represent String=21,Restauracja (AmerykaDska) in
> CodePage=windows-1252
> Cannot represent String=21,Restauracja (ChiDska) in CodePage=windows
> -1252
> Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> CodePage=windows-1252
> Cannot represent String=21,Restauracja (WBoska) in CodePage=windows
> -1252
> Cannot represent String=21,Restauracja (MeksykaDska) in
> CodePage=windows-1252
> Cannot represent String=21,Restauracja (P^Eczki) in CodePage=windows
> -1252
> Cannot represent String=21,Restauracja (WegetariaDska) in
> CodePage=windows-1252
> Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> Cannot represent String=21,Sklep odzie|owy in CodePage=windows-1252
> Cannot represent String=21,Wypo|yczalnia samochod\363w in
> CodePage=windows-1252
> Cannot represent String=21,Gara| in CodePage=windows-1252
> Cannot represent String=21,Sprzeda| samochod\363w in CodePage=windows
> -1252
> Cannot represent String=21,Sklep |eglarski in CodePage=windows-1252
> Cannot represent String=21,S^Ed in CodePage=windows-1252
> Cannot represent String=21,O[rodek kultury in CodePage=windows-1252
> Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252
> Cannot represent String=21,Stra| po|arna in CodePage=windows-1252
> Cannot represent String=21,SBupek in CodePage=windows-1252
> Cannot represent String=21,PrzystaD in CodePage=windows-1252
> Cannot represent String=21,L^Edowisko helikopterowe in
> CodePage=windows
> -1252
> Cannot represent String=21,Wie|a in CodePage=windows-1252
> Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> Cannot represent String=21,Pla|a in CodePage=windows-1252
> Cannot represent String=21,Przyl^Edek in CodePage=windows-1252
> Cannot represent String=21,SkaBa in CodePage=windows-1252
>
> Which makes sense if codepage 1252 doesn't handle Polish (hex 0x15,
> decimal 21).
>
> NB the non ascii characters in above are messed up by my cutting and
> pasting.
>
> Checking the French, on my Garmin device, the type descriptions now
> display accents correctly.
>
> Ticker
>
> _______________________________________________
> mkgmap-dev mailing list
> [hidden email]
> http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> _______________________________________________
> mkgmap-dev mailing list
> [hidden email]
> http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Ticker Berkin
Hi Gerd

Attached is a patch that:

Doesn't use the 'CodePage=' command in the typ-file to determine output
character encoding of the typ-file, rather it uses the main map
encoding from the --code-page argument.

log.warn's any typ labels that can't be encoded in the --code-page,
rather than just giving up with message like:
> TYP file cannot be written in code page 1252

The message:
> WARNING: SortCode in TYP txt file different from command line setting
that was written direct to system.out is changed to a log.warn and it
shouldn't happen anyway now

For the moment, the 'CodePage=' command in the typ-file is, under some
circumstances, used to determine the encoding of the typ-file itself
and I've left this alone for compatibility with existing useage.
Sometime in January I'll provide a better method for this
 
Ticker


On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:

> Hi Gerd
>
> I think it is best to continue with the ideas for typ-files that:
>
> 1/ they can be in any character set and we just need a better way of
> working out the correct one - see my posting earlier today.
>
> 2/ it can include as many languages as anyone can be bothered to add,
> and so has to be an a character set that allows the languages to be
> added, implying unicode for a common one (more particulary, UTF-8)
>
> 3/ the codepage= statement should be redundant and ignored for
> controlling the output character set, which should be taken from the
> map, but its use for determining the input coding might need to be
> kept
> for a while for compatability.
>
> 4/ the messages my hack generates should be turned into 1 warning or
> information message per language or maybe suppressed altogether. If
> someone is generating a map with a character set that doesn't support
> a
> particular language, they really won't care that that data for other
> languages that have an incompatible representation with their
> language
> won't be there.
>
> Ticker
>
> On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> >
> > I think I understand now why we didn't have a default typ file ;)
> > If I got that right I should revert the changes in r4395 and mkgmap
> > should not allow or warn loudly when a typ file with a different
> > codepage is merged?
> > Or should we force the usage of unicode codepage?
> > Or is it possible to compile mapnik.txt with cp 1252 (or any other)
> > in a way that only those lines which contain non-matching
> > characters
> > are ignored?
> >
> > Gerd
> >
> >
> > ________________________________________
> > Von: mkgmap-dev <[hidden email]> im Auftrag
> > von Ticker Berkin <[hidden email]>
> > Gesendet: Mittwoch, 18. Dezember 2019 19:46
> > An: mkgmap development
> > Betreff: [mkgmap-dev] TYP files and character encoding
> >
> > Hi
> >
> > A couple of problems with typ-files and unicode.
> >
> > With 'Codepage=65001' the final contents of the labels in
> > mapnik.typ
> > that is included with the composite map is unicode, but if the map
> > is
> > codepage 1252, the unicode characters with the top bit set are
> > simply
> > displayed as if in 1252.
> >
> > Removing the codepage statement from mapnik.txt and making fixes
> > elsewhere to ensure that the file is read correctly as utf-8 and
> > then
> > generating a map with --code-page=1252, it gives the error:
> >
> > SEVE: uk.me.parabola.imgfmt.MapFailedException
> >  ../svn/trunk/resources/typ-files/mapnik.txt:
> >  (thrown in TypCompiler.makeMap())
> >  TYP file cannot be written in code page 1252
> >
> > Changing the exception handling in imgfmt/app/typ/TypElement.java,
> > so
> > that makeLabelBlock() reads as
> > ...
> >     CharBuffer cb = CharBuffer.wrap(tl.getText());
> >     try {
> >         ByteBuffer buffer = encoder.encode(cb);
> >         out.put((byte) tl.getLang());
> >         out.put(buffer);
> >         out.put((byte) 0);
> >      }  catch (CharacterCodingException ignore) {
> > //        ignore.printStackTrace();
> >         String name = encoder.charset().name();
> >         System.out.println("Cannot represent String=" +
> >             tl.getLang() + "," + tl.getText() +
> >             " in CodePage=" + name);
> > //        throw newTypLabelException(name);
> >      }
> > ...
> >
> > It gives output like:
> > Cannot represent String=21,Gara|e in CodePage=windows-1252
> > Cannot represent String=21,Obszar przemysBowy in CodePage=windows
> > -1252
> > Cannot represent String=21,ZieleD in CodePage=windows-1252
> > Cannot represent String=21,Zaro[la in CodePage=windows-1252
> > Cannot represent String=21,MokradBa in CodePage=windows-1252
> > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > CodePage=windows-1252
> > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > CodePage=windows-1252
> > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > CodePage=windows-1252
> > Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows
> > -1252
> > Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> > Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > Cannot represent String=21,Granica paDstwa in CodePage=windows-1252
> > Cannot represent String=21,Rzeka, KanaB in CodePage=windows-1252
> > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> > Cannot represent String=21,Kabel wysokiego napi^Ycia in
> > CodePage=windows-1252
> > Cannot represent String=21,Tor wy[cigowy in CodePage=windows-1252
> > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > CodePage=windows-1252
> > Cannot represent String=21,Droga krajowa (B^Ecznik) in
> > CodePage=windows
> > -1252
> > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > CodePage=windows-1252
> > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows-1252
> > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows-1252
> > Cannot represent String=21,Restauracja (AmerykaDska) in
> > CodePage=windows-1252
> > Cannot represent String=21,Restauracja (ChiDska) in
> > CodePage=windows
> > -1252
> > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> > CodePage=windows-1252
> > Cannot represent String=21,Restauracja (WBoska) in CodePage=windows
> > -1252
> > Cannot represent String=21,Restauracja (MeksykaDska) in
> > CodePage=windows-1252
> > Cannot represent String=21,Restauracja (P^Eczki) in
> > CodePage=windows
> > -1252
> > Cannot represent String=21,Restauracja (WegetariaDska) in
> > CodePage=windows-1252
> > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> > Cannot represent String=21,Sklep odzie|owy in CodePage=windows-1252
> > Cannot represent String=21,Wypo|yczalnia samochod\363w in
> > CodePage=windows-1252
> > Cannot represent String=21,Gara| in CodePage=windows-1252
> > Cannot represent String=21,Sprzeda| samochod\363w in
> > CodePage=windows
> > -1252
> > Cannot represent String=21,Sklep |eglarski in CodePage=windows-1252
> > Cannot represent String=21,S^Ed in CodePage=windows-1252
> > Cannot represent String=21,O[rodek kultury in CodePage=windows-1252
> > Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252
> > Cannot represent String=21,Stra| po|arna in CodePage=windows-1252
> > Cannot represent String=21,SBupek in CodePage=windows-1252
> > Cannot represent String=21,PrzystaD in CodePage=windows-1252
> > Cannot represent String=21,L^Edowisko helikopterowe in
> > CodePage=windows
> > -1252
> > Cannot represent String=21,Wie|a in CodePage=windows-1252
> > Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> > Cannot represent String=21,Pla|a in CodePage=windows-1252
> > Cannot represent String=21,Przyl^Edek in CodePage=windows-1252
> > Cannot represent String=21,SkaBa in CodePage=windows-1252
> >
> > Which makes sense if codepage 1252 doesn't handle Polish (hex 0x15,
> > decimal 21).
> >
> > NB the non ascii characters in above are messed up by my cutting
> > and
> > pasting.
> >
> > Checking the French, on my Garmin device, the type descriptions now
> > display accents correctly.
> >
> > Ticker
> >
> > _______________________________________________
> > mkgmap-dev mailing list
> > [hidden email]
> > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > _______________________________________________
> > mkgmap-dev mailing list
> > [hidden email]
> > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> _______________________________________________
> mkgmap-dev mailing list
> [hidden email]
> http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

typCodePage.patch (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Ticker Berkin
Hi Gerd

I've updated this patch with changes to TypCompiler CharsetProbe:

1/ looks for unicode BOM in various encodings near start of file.
2/ looks for line containing "-*- coding: charset -*-" near start of
the file.
3/ retains the check for "CodePage=" coding for compatibility.
4/ in the absence of the above, sets the reading charset to utf-8 if
the file is valid utf-8, otherwise to Cp1252.
5/ fixes the bad character message from the scanner to say what the
charset really is rather than saying "uft-8" regardless.
6/ removes the logic to that checks if String... lines, read in the
charset it is currently trying, can be encoded in the presumed output
CodePage.

The final result of this patch should be that:

a/ No existing usage is broken
b/ 2 methods to indicate the charset/encoding of the file that are
commonly used by text editors can be used and are taken notice of.
Previously, just the UTF-8 BOM was detected.
c/ Typ files can, and should from now on, be written in utf-8
d/ labels for languages not supported in the --code-page of the output
img just generate a warning in mkgmap.log.x

Ticker


On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote:

> Hi Gerd
>
> Attached is a patch that:
>
> Doesn't use the 'CodePage=' command in the typ-file to determine
> output
> character encoding of the typ-file, rather it uses the main map
> encoding from the --code-page argument.
>
> log.warn's any typ labels that can't be encoded in the --code-page,
> rather than just giving up with message like:
> > TYP file cannot be written in code page 1252
>
> The message:
> > WARNING: SortCode in TYP txt file different from command line
> > setting
> that was written direct to system.out is changed to a log.warn and it
> shouldn't happen anyway now
>
> For the moment, the 'CodePage=' command in the typ-file is, under
> some
> circumstances, used to determine the encoding of the typ-file itself
> and I've left this alone for compatibility with existing useage.
> Sometime in January I'll provide a better method for this
>  
> Ticker
>
>
> On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:
> > Hi Gerd
> >
> > I think it is best to continue with the ideas for typ-files that:
> >
> > 1/ they can be in any character set and we just need a better way
> > of
> > working out the correct one - see my posting earlier today.
> >
> > 2/ it can include as many languages as anyone can be bothered to
> > add,
> > and so has to be an a character set that allows the languages to be
> > added, implying unicode for a common one (more particulary, UTF-8)
> >
> > 3/ the codepage= statement should be redundant and ignored for
> > controlling the output character set, which should be taken from
> > the
> > map, but its use for determining the input coding might need to be
> > kept
> > for a while for compatability.
> >
> > 4/ the messages my hack generates should be turned into 1 warning
> > or
> > information message per language or maybe suppressed altogether. If
> > someone is generating a map with a character set that doesn't
> > support
> > a
> > particular language, they really won't care that that data for
> > other
> > languages that have an incompatible representation with their
> > language
> > won't be there.
> >
> > Ticker
> >
> > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
> > > Hi Ticker,
> > >
> > > I think I understand now why we didn't have a default typ file ;)
> > > If I got that right I should revert the changes in r4395 and
> > > mkgmap
> > > should not allow or warn loudly when a typ file with a different
> > > codepage is merged?
> > > Or should we force the usage of unicode codepage?
> > > Or is it possible to compile mapnik.txt with cp 1252 (or any
> > > other)
> > > in a way that only those lines which contain non-matching
> > > characters
> > > are ignored?
> > >
> > > Gerd
> > >
> > >
> > > ________________________________________
> > > Von: mkgmap-dev <[hidden email]> im
> > > Auftrag
> > > von Ticker Berkin <[hidden email]>
> > > Gesendet: Mittwoch, 18. Dezember 2019 19:46
> > > An: mkgmap development
> > > Betreff: [mkgmap-dev] TYP files and character encoding
> > >
> > > Hi
> > >
> > > A couple of problems with typ-files and unicode.
> > >
> > > With 'Codepage=65001' the final contents of the labels in
> > > mapnik.typ
> > > that is included with the composite map is unicode, but if the
> > > map
> > > is
> > > codepage 1252, the unicode characters with the top bit set are
> > > simply
> > > displayed as if in 1252.
> > >
> > > Removing the codepage statement from mapnik.txt and making fixes
> > > elsewhere to ensure that the file is read correctly as utf-8 and
> > > then
> > > generating a map with --code-page=1252, it gives the error:
> > >
> > > SEVE: uk.me.parabola.imgfmt.MapFailedException
> > >  ../svn/trunk/resources/typ-files/mapnik.txt:
> > >  (thrown in TypCompiler.makeMap())
> > >  TYP file cannot be written in code page 1252
> > >
> > > Changing the exception handling in
> > > imgfmt/app/typ/TypElement.java,
> > > so
> > > that makeLabelBlock() reads as
> > > ...
> > >     CharBuffer cb = CharBuffer.wrap(tl.getText());
> > >     try {
> > >         ByteBuffer buffer = encoder.encode(cb);
> > >         out.put((byte) tl.getLang());
> > >         out.put(buffer);
> > >         out.put((byte) 0);
> > >      }  catch (CharacterCodingException ignore) {
> > > //        ignore.printStackTrace();
> > >         String name = encoder.charset().name();
> > >         System.out.println("Cannot represent String=" +
> > >             tl.getLang() + "," + tl.getText() +
> > >             " in CodePage=" + name);
> > > //        throw newTypLabelException(name);
> > >      }
> > > ...
> > >
> > > It gives output like:
> > > Cannot represent String=21,Gara|e in CodePage=windows-1252
> > > Cannot represent String=21,Obszar przemysBowy in CodePage=windows
> > > -1252
> > > Cannot represent String=21,ZieleD in CodePage=windows-1252
> > > Cannot represent String=21,Zaro[la in CodePage=windows-1252
> > > Cannot represent String=21,MokradBa in CodePage=windows-1252
> > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > Cannot represent String=21,Granica paDstwa in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows-1252
> > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> > > Cannot represent String=21,Kabel wysokiego napi^Ycia in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows-1252
> > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Droga krajowa (B^Ecznik) in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Restauracja (AmerykaDska) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Restauracja (ChiDska) in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Restauracja (WBoska) in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Restauracja (MeksykaDska) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Restauracja (P^Eczki) in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Restauracja (WegetariaDska) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> > > Cannot represent String=21,Sklep odzie|owy in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Wypo|yczalnia samochod\363w in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Gara| in CodePage=windows-1252
> > > Cannot represent String=21,Sprzeda| samochod\363w in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Sklep |eglarski in CodePage=windows
> > > -1252
> > > Cannot represent String=21,S^Ed in CodePage=windows-1252
> > > Cannot represent String=21,O[rodek kultury in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252
> > > Cannot represent String=21,Stra| po|arna in CodePage=windows-1252
> > > Cannot represent String=21,SBupek in CodePage=windows-1252
> > > Cannot represent String=21,PrzystaD in CodePage=windows-1252
> > > Cannot represent String=21,L^Edowisko helikopterowe in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Wie|a in CodePage=windows-1252
> > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> > > Cannot represent String=21,Pla|a in CodePage=windows-1252
> > > Cannot represent String=21,Przyl^Edek in CodePage=windows-1252
> > > Cannot represent String=21,SkaBa in CodePage=windows-1252
> > >
> > > Which makes sense if codepage 1252 doesn't handle Polish (hex
> > > 0x15,
> > > decimal 21).
> > >
> > > NB the non ascii characters in above are messed up by my cutting
> > > and
> > > pasting.
> > >
> > > Checking the French, on my Garmin device, the type descriptions
> > > now
> > > display accents correctly.
> > >
> > > Ticker
> > >
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > [hidden email]
> > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > [hidden email]
> > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > _______________________________________________
> > mkgmap-dev mailing list
> > [hidden email]
> _______________________________________________
> mkgmap-dev mailing list
> [hidden email]
> http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

typCodePage_v2.patch (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Gerd Petermann
Hi Ticker,

thanks for the patch.

Please review TypCompiler.CharsetProbe.  BufferedReader br is not closed. Is that intended?

I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap sources. I think it would be good to use StandardCharsets.UTF_8 where possible
and unify the rest.

Gerd

________________________________________
Von: mkgmap-dev <[hidden email]> im Auftrag von Ticker Berkin <[hidden email]>
Gesendet: Montag, 13. Januar 2020 11:34
An: Development list for mkgmap
Betreff: Re: [mkgmap-dev] TYP files and character encoding

Hi Gerd

I've updated this patch with changes to TypCompiler CharsetProbe:

1/ looks for unicode BOM in various encodings near start of file.
2/ looks for line containing "-*- coding: charset -*-" near start of
the file.
3/ retains the check for "CodePage=" coding for compatibility.
4/ in the absence of the above, sets the reading charset to utf-8 if
the file is valid utf-8, otherwise to Cp1252.
5/ fixes the bad character message from the scanner to say what the
charset really is rather than saying "uft-8" regardless.
6/ removes the logic to that checks if String... lines, read in the
charset it is currently trying, can be encoded in the presumed output
CodePage.

The final result of this patch should be that:

a/ No existing usage is broken
b/ 2 methods to indicate the charset/encoding of the file that are
commonly used by text editors can be used and are taken notice of.
Previously, just the UTF-8 BOM was detected.
c/ Typ files can, and should from now on, be written in utf-8
d/ labels for languages not supported in the --code-page of the output
img just generate a warning in mkgmap.log.x

Ticker


On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote:

> Hi Gerd
>
> Attached is a patch that:
>
> Doesn't use the 'CodePage=' command in the typ-file to determine
> output
> character encoding of the typ-file, rather it uses the main map
> encoding from the --code-page argument.
>
> log.warn's any typ labels that can't be encoded in the --code-page,
> rather than just giving up with message like:
> > TYP file cannot be written in code page 1252
>
> The message:
> > WARNING: SortCode in TYP txt file different from command line
> > setting
> that was written direct to system.out is changed to a log.warn and it
> shouldn't happen anyway now
>
> For the moment, the 'CodePage=' command in the typ-file is, under
> some
> circumstances, used to determine the encoding of the typ-file itself
> and I've left this alone for compatibility with existing useage.
> Sometime in January I'll provide a better method for this
>
> Ticker
>
>
> On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:
> > Hi Gerd
> >
> > I think it is best to continue with the ideas for typ-files that:
> >
> > 1/ they can be in any character set and we just need a better way
> > of
> > working out the correct one - see my posting earlier today.
> >
> > 2/ it can include as many languages as anyone can be bothered to
> > add,
> > and so has to be an a character set that allows the languages to be
> > added, implying unicode for a common one (more particulary, UTF-8)
> >
> > 3/ the codepage= statement should be redundant and ignored for
> > controlling the output character set, which should be taken from
> > the
> > map, but its use for determining the input coding might need to be
> > kept
> > for a while for compatability.
> >
> > 4/ the messages my hack generates should be turned into 1 warning
> > or
> > information message per language or maybe suppressed altogether. If
> > someone is generating a map with a character set that doesn't
> > support
> > a
> > particular language, they really won't care that that data for
> > other
> > languages that have an incompatible representation with their
> > language
> > won't be there.
> >
> > Ticker
> >
> > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
> > > Hi Ticker,
> > >
> > > I think I understand now why we didn't have a default typ file ;)
> > > If I got that right I should revert the changes in r4395 and
> > > mkgmap
> > > should not allow or warn loudly when a typ file with a different
> > > codepage is merged?
> > > Or should we force the usage of unicode codepage?
> > > Or is it possible to compile mapnik.txt with cp 1252 (or any
> > > other)
> > > in a way that only those lines which contain non-matching
> > > characters
> > > are ignored?
> > >
> > > Gerd
> > >
> > >
> > > ________________________________________
> > > Von: mkgmap-dev <[hidden email]> im
> > > Auftrag
> > > von Ticker Berkin <[hidden email]>
> > > Gesendet: Mittwoch, 18. Dezember 2019 19:46
> > > An: mkgmap development
> > > Betreff: [mkgmap-dev] TYP files and character encoding
> > >
> > > Hi
> > >
> > > A couple of problems with typ-files and unicode.
> > >
> > > With 'Codepage=65001' the final contents of the labels in
> > > mapnik.typ
> > > that is included with the composite map is unicode, but if the
> > > map
> > > is
> > > codepage 1252, the unicode characters with the top bit set are
> > > simply
> > > displayed as if in 1252.
> > >
> > > Removing the codepage statement from mapnik.txt and making fixes
> > > elsewhere to ensure that the file is read correctly as utf-8 and
> > > then
> > > generating a map with --code-page=1252, it gives the error:
> > >
> > > SEVE: uk.me.parabola.imgfmt.MapFailedException
> > >  ../svn/trunk/resources/typ-files/mapnik.txt:
> > >  (thrown in TypCompiler.makeMap())
> > >  TYP file cannot be written in code page 1252
> > >
> > > Changing the exception handling in
> > > imgfmt/app/typ/TypElement.java,
> > > so
> > > that makeLabelBlock() reads as
> > > ...
> > >     CharBuffer cb = CharBuffer.wrap(tl.getText());
> > >     try {
> > >         ByteBuffer buffer = encoder.encode(cb);
> > >         out.put((byte) tl.getLang());
> > >         out.put(buffer);
> > >         out.put((byte) 0);
> > >      }  catch (CharacterCodingException ignore) {
> > > //        ignore.printStackTrace();
> > >         String name = encoder.charset().name();
> > >         System.out.println("Cannot represent String=" +
> > >             tl.getLang() + "," + tl.getText() +
> > >             " in CodePage=" + name);
> > > //        throw newTypLabelException(name);
> > >      }
> > > ...
> > >
> > > It gives output like:
> > > Cannot represent String=21,Gara|e in CodePage=windows-1252
> > > Cannot represent String=21,Obszar przemysBowy in CodePage=windows
> > > -1252
> > > Cannot represent String=21,ZieleD in CodePage=windows-1252
> > > Cannot represent String=21,Zaro[la in CodePage=windows-1252
> > > Cannot represent String=21,MokradBa in CodePage=windows-1252
> > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > Cannot represent String=21,Granica paDstwa in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows-1252
> > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> > > Cannot represent String=21,Kabel wysokiego napi^Ycia in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows-1252
> > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Droga krajowa (B^Ecznik) in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Restauracja (AmerykaDska) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Restauracja (ChiDska) in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Restauracja (WBoska) in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Restauracja (MeksykaDska) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Restauracja (P^Eczki) in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Restauracja (WegetariaDska) in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> > > Cannot represent String=21,Sklep odzie|owy in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Wypo|yczalnia samochod\363w in
> > > CodePage=windows-1252
> > > Cannot represent String=21,Gara| in CodePage=windows-1252
> > > Cannot represent String=21,Sprzeda| samochod\363w in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Sklep |eglarski in CodePage=windows
> > > -1252
> > > Cannot represent String=21,S^Ed in CodePage=windows-1252
> > > Cannot represent String=21,O[rodek kultury in CodePage=windows
> > > -1252
> > > Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252
> > > Cannot represent String=21,Stra| po|arna in CodePage=windows-1252
> > > Cannot represent String=21,SBupek in CodePage=windows-1252
> > > Cannot represent String=21,PrzystaD in CodePage=windows-1252
> > > Cannot represent String=21,L^Edowisko helikopterowe in
> > > CodePage=windows
> > > -1252
> > > Cannot represent String=21,Wie|a in CodePage=windows-1252
> > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> > > Cannot represent String=21,Pla|a in CodePage=windows-1252
> > > Cannot represent String=21,Przyl^Edek in CodePage=windows-1252
> > > Cannot represent String=21,SkaBa in CodePage=windows-1252
> > >
> > > Which makes sense if codepage 1252 doesn't handle Polish (hex
> > > 0x15,
> > > decimal 21).
> > >
> > > NB the non ascii characters in above are messed up by my cutting
> > > and
> > > pasting.
> > >
> > > Checking the French, on my Garmin device, the type descriptions
> > > now
> > > display accents correctly.
> > >
> > > Ticker
> > >
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > [hidden email]
> > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > [hidden email]
> > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > _______________________________________________
> > mkgmap-dev mailing list
> > [hidden email]
> _______________________________________________
> mkgmap-dev mailing list
> [hidden email]
> http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Ticker Berkin
Hi Gerd

Here is updated patch that closes the file, although I find many files
in mkgmap that don't have explicit close(), but I presume .finalize()
will close them eventually.

I'll do another patch for other text file handling, using
StandardCharset where possible and fixing TokenScanner message for bad
characters if not utf-8 and, if reasonable, allowing a BOM even if the
file is opened as utf-8 anyway.

Ticker

On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:

> Hi Ticker,
>
> thanks for the patch.
>
> Please review TypCompiler.CharsetProbe.  BufferedReader br is not
> closed. Is that intended?
>
> I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> sources. I think it would be good to use StandardCharsets.UTF_8 where
> possible
> and unify the rest.
>
> Gerd
>
> ________________________________________
> Von: mkgmap-dev <[hidden email]> im Auftrag
> von Ticker Berkin <[hidden email]>
> Gesendet: Montag, 13. Januar 2020 11:34
> An: Development list for mkgmap
> Betreff: Re: [mkgmap-dev] TYP files and character encoding
>
> Hi Gerd
>
> I've updated this patch with changes to TypCompiler CharsetProbe:
>
> 1/ looks for unicode BOM in various encodings near start of file.
> 2/ looks for line containing "-*- coding: charset -*-" near start of
> the file.
> 3/ retains the check for "CodePage=" coding for compatibility.
> 4/ in the absence of the above, sets the reading charset to utf-8 if
> the file is valid utf-8, otherwise to Cp1252.
> 5/ fixes the bad character message from the scanner to say what the
> charset really is rather than saying "uft-8" regardless.
> 6/ removes the logic to that checks if String... lines, read in the
> charset it is currently trying, can be encoded in the presumed output
> CodePage.
>
> The final result of this patch should be that:
>
> a/ No existing usage is broken
> b/ 2 methods to indicate the charset/encoding of the file that are
> commonly used by text editors can be used and are taken notice of.
> Previously, just the UTF-8 BOM was detected.
> c/ Typ files can, and should from now on, be written in utf-8
> d/ labels for languages not supported in the --code-page of the
> output
> img just generate a warning in mkgmap.log.x
>
> Ticker
>
>
> On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote:
> > Hi Gerd
> >
> > Attached is a patch that:
> >
> > Doesn't use the 'CodePage=' command in the typ-file to determine
> > output
> > character encoding of the typ-file, rather it uses the main map
> > encoding from the --code-page argument.
> >
> > log.warn's any typ labels that can't be encoded in the --code-page,
> > rather than just giving up with message like:
> > > TYP file cannot be written in code page 1252
> >
> > The message:
> > > WARNING: SortCode in TYP txt file different from command line
> > > setting
> > that was written direct to system.out is changed to a log.warn and
> > it
> > shouldn't happen anyway now
> >
> > For the moment, the 'CodePage=' command in the typ-file is, under
> > some
> > circumstances, used to determine the encoding of the typ-file
> > itself
> > and I've left this alone for compatibility with existing useage.
> > Sometime in January I'll provide a better method for this
> >
> > Ticker
> >
> >
> > On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:
> > > Hi Gerd
> > >
> > > I think it is best to continue with the ideas for typ-files that:
> > >
> > > 1/ they can be in any character set and we just need a better way
> > > of
> > > working out the correct one - see my posting earlier today.
> > >
> > > 2/ it can include as many languages as anyone can be bothered to
> > > add,
> > > and so has to be an a character set that allows the languages to
> > > be
> > > added, implying unicode for a common one (more particulary, UTF
> > > -8)
> > >
> > > 3/ the codepage= statement should be redundant and ignored for
> > > controlling the output character set, which should be taken from
> > > the
> > > map, but its use for determining the input coding might need to
> > > be
> > > kept
> > > for a while for compatability.
> > >
> > > 4/ the messages my hack generates should be turned into 1 warning
> > > or
> > > information message per language or maybe suppressed altogether.
> > > If
> > > someone is generating a map with a character set that doesn't
> > > support
> > > a
> > > particular language, they really won't care that that data for
> > > other
> > > languages that have an incompatible representation with their
> > > language
> > > won't be there.
> > >
> > > Ticker
> > >
> > > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
> > > > Hi Ticker,
> > > >
> > > > I think I understand now why we didn't have a default typ file
> > > > ;)
> > > > If I got that right I should revert the changes in r4395 and
> > > > mkgmap
> > > > should not allow or warn loudly when a typ file with a
> > > > different
> > > > codepage is merged?
> > > > Or should we force the usage of unicode codepage?
> > > > Or is it possible to compile mapnik.txt with cp 1252 (or any
> > > > other)
> > > > in a way that only those lines which contain non-matching
> > > > characters
> > > > are ignored?
> > > >
> > > > Gerd
> > > >
> > > >
> > > > ________________________________________
> > > > Von: mkgmap-dev <[hidden email]> im
> > > > Auftrag
> > > > von Ticker Berkin <[hidden email]>
> > > > Gesendet: Mittwoch, 18. Dezember 2019 19:46
> > > > An: mkgmap development
> > > > Betreff: [mkgmap-dev] TYP files and character encoding
> > > >
> > > > Hi
> > > >
> > > > A couple of problems with typ-files and unicode.
> > > >
> > > > With 'Codepage=65001' the final contents of the labels in
> > > > mapnik.typ
> > > > that is included with the composite map is unicode, but if the
> > > > map
> > > > is
> > > > codepage 1252, the unicode characters with the top bit set are
> > > > simply
> > > > displayed as if in 1252.
> > > >
> > > > Removing the codepage statement from mapnik.txt and making
> > > > fixes
> > > > elsewhere to ensure that the file is read correctly as utf-8
> > > > and
> > > > then
> > > > generating a map with --code-page=1252, it gives the error:
> > > >
> > > > SEVE: uk.me.parabola.imgfmt.MapFailedException
> > > >  ../svn/trunk/resources/typ-files/mapnik.txt:
> > > >  (thrown in TypCompiler.makeMap())
> > > >  TYP file cannot be written in code page 1252
> > > >
> > > > Changing the exception handling in
> > > > imgfmt/app/typ/TypElement.java,
> > > > so
> > > > that makeLabelBlock() reads as
> > > > ...
> > > >     CharBuffer cb = CharBuffer.wrap(tl.getText());
> > > >     try {
> > > >         ByteBuffer buffer = encoder.encode(cb);
> > > >         out.put((byte) tl.getLang());
> > > >         out.put(buffer);
> > > >         out.put((byte) 0);
> > > >      }  catch (CharacterCodingException ignore) {
> > > > //        ignore.printStackTrace();
> > > >         String name = encoder.charset().name();
> > > >         System.out.println("Cannot represent String=" +
> > > >             tl.getLang() + "," + tl.getText() +
> > > >             " in CodePage=" + name);
> > > > //        throw newTypLabelException(name);
> > > >      }
> > > > ...
> > > >
> > > > It gives output like:
> > > > Cannot represent String=21,Gara|e in CodePage=windows-1252
> > > > Cannot represent String=21,Obszar przemysBowy in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,ZieleD in CodePage=windows-1252
> > > > Cannot represent String=21,Zaro[la in CodePage=windows-1252
> > > > Cannot represent String=21,MokradBa in CodePage=windows-1252
> > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> > > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > Cannot represent String=21,Granica paDstwa in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> > > > Cannot represent String=21,Kabel wysokiego napi^Ycia in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga krajowa (B^Ecznik) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (AmerykaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (ChiDska) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (WBoska) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (MeksykaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (P^Eczki) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (WegetariaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> > > > Cannot represent String=21,Sklep odzie|owy in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wypo|yczalnia samochod\363w in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Gara| in CodePage=windows-1252
> > > > Cannot represent String=21,Sprzeda| samochod\363w in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Sklep |eglarski in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,S^Ed in CodePage=windows-1252
> > > > Cannot represent String=21,O[rodek kultury in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252
> > > > Cannot represent String=21,Stra| po|arna in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,SBupek in CodePage=windows-1252
> > > > Cannot represent String=21,PrzystaD in CodePage=windows-1252
> > > > Cannot represent String=21,L^Edowisko helikopterowe in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wie|a in CodePage=windows-1252
> > > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> > > > Cannot represent String=21,Pla|a in CodePage=windows-1252
> > > > Cannot represent String=21,Przyl^Edek in CodePage=windows-1252
> > > > Cannot represent String=21,SkaBa in CodePage=windows-1252
> > > >
> > > > Which makes sense if codepage 1252 doesn't handle Polish (hex
> > > > 0x15,
> > > > decimal 21).
> > > >
> > > > NB the non ascii characters in above are messed up by my
> > > > cutting
> > > > and
> > > > pasting.
> > > >
> > > > Checking the French, on my Garmin device, the type descriptions
> > > > now
> > > > display accents correctly.
> > > >
> > > > Ticker
> > > >
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > [hidden email]
> > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > [hidden email]
> > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > [hidden email]
> > _______________________________________________
> > mkgmap-dev mailing list
> > [hidden email]
> > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

typCodePage_v3.patch (16K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Gerd Petermann
Hi Ticker,

yes, and every missing close() is a brain teaser ;)
We have a few places where files are opened and closed in a different method. This is likely to cause trouble in unit tests, esp. on Windows.
Whereever possible we should use try-with-ressources instead of Utils.closeFile() and add a comment
like in SeaGenerator line
in zipFile = new ZipFile(precompSeaDir); // don't close here!
when a file is intentionally kept open.

Gerd

________________________________________
Von: mkgmap-dev <[hidden email]> im Auftrag von Ticker Berkin <[hidden email]>
Gesendet: Dienstag, 14. Januar 2020 10:43
An: Development list for mkgmap
Betreff: Re: [mkgmap-dev] TYP files and character encoding

Hi Gerd

Here is updated patch that closes the file, although I find many files
in mkgmap that don't have explicit close(), but I presume .finalize()
will close them eventually.

I'll do another patch for other text file handling, using
StandardCharset where possible and fixing TokenScanner message for bad
characters if not utf-8 and, if reasonable, allowing a BOM even if the
file is opened as utf-8 anyway.

Ticker

On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:

> Hi Ticker,
>
> thanks for the patch.
>
> Please review TypCompiler.CharsetProbe.  BufferedReader br is not
> closed. Is that intended?
>
> I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> sources. I think it would be good to use StandardCharsets.UTF_8 where
> possible
> and unify the rest.
>
> Gerd
>
> ________________________________________
> Von: mkgmap-dev <[hidden email]> im Auftrag
> von Ticker Berkin <[hidden email]>
> Gesendet: Montag, 13. Januar 2020 11:34
> An: Development list for mkgmap
> Betreff: Re: [mkgmap-dev] TYP files and character encoding
>
> Hi Gerd
>
> I've updated this patch with changes to TypCompiler CharsetProbe:
>
> 1/ looks for unicode BOM in various encodings near start of file.
> 2/ looks for line containing "-*- coding: charset -*-" near start of
> the file.
> 3/ retains the check for "CodePage=" coding for compatibility.
> 4/ in the absence of the above, sets the reading charset to utf-8 if
> the file is valid utf-8, otherwise to Cp1252.
> 5/ fixes the bad character message from the scanner to say what the
> charset really is rather than saying "uft-8" regardless.
> 6/ removes the logic to that checks if String... lines, read in the
> charset it is currently trying, can be encoded in the presumed output
> CodePage.
>
> The final result of this patch should be that:
>
> a/ No existing usage is broken
> b/ 2 methods to indicate the charset/encoding of the file that are
> commonly used by text editors can be used and are taken notice of.
> Previously, just the UTF-8 BOM was detected.
> c/ Typ files can, and should from now on, be written in utf-8
> d/ labels for languages not supported in the --code-page of the
> output
> img just generate a warning in mkgmap.log.x
>
> Ticker
>
>
> On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote:
> > Hi Gerd
> >
> > Attached is a patch that:
> >
> > Doesn't use the 'CodePage=' command in the typ-file to determine
> > output
> > character encoding of the typ-file, rather it uses the main map
> > encoding from the --code-page argument.
> >
> > log.warn's any typ labels that can't be encoded in the --code-page,
> > rather than just giving up with message like:
> > > TYP file cannot be written in code page 1252
> >
> > The message:
> > > WARNING: SortCode in TYP txt file different from command line
> > > setting
> > that was written direct to system.out is changed to a log.warn and
> > it
> > shouldn't happen anyway now
> >
> > For the moment, the 'CodePage=' command in the typ-file is, under
> > some
> > circumstances, used to determine the encoding of the typ-file
> > itself
> > and I've left this alone for compatibility with existing useage.
> > Sometime in January I'll provide a better method for this
> >
> > Ticker
> >
> >
> > On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:
> > > Hi Gerd
> > >
> > > I think it is best to continue with the ideas for typ-files that:
> > >
> > > 1/ they can be in any character set and we just need a better way
> > > of
> > > working out the correct one - see my posting earlier today.
> > >
> > > 2/ it can include as many languages as anyone can be bothered to
> > > add,
> > > and so has to be an a character set that allows the languages to
> > > be
> > > added, implying unicode for a common one (more particulary, UTF
> > > -8)
> > >
> > > 3/ the codepage= statement should be redundant and ignored for
> > > controlling the output character set, which should be taken from
> > > the
> > > map, but its use for determining the input coding might need to
> > > be
> > > kept
> > > for a while for compatability.
> > >
> > > 4/ the messages my hack generates should be turned into 1 warning
> > > or
> > > information message per language or maybe suppressed altogether.
> > > If
> > > someone is generating a map with a character set that doesn't
> > > support
> > > a
> > > particular language, they really won't care that that data for
> > > other
> > > languages that have an incompatible representation with their
> > > language
> > > won't be there.
> > >
> > > Ticker
> > >
> > > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
> > > > Hi Ticker,
> > > >
> > > > I think I understand now why we didn't have a default typ file
> > > > ;)
> > > > If I got that right I should revert the changes in r4395 and
> > > > mkgmap
> > > > should not allow or warn loudly when a typ file with a
> > > > different
> > > > codepage is merged?
> > > > Or should we force the usage of unicode codepage?
> > > > Or is it possible to compile mapnik.txt with cp 1252 (or any
> > > > other)
> > > > in a way that only those lines which contain non-matching
> > > > characters
> > > > are ignored?
> > > >
> > > > Gerd
> > > >
> > > >
> > > > ________________________________________
> > > > Von: mkgmap-dev <[hidden email]> im
> > > > Auftrag
> > > > von Ticker Berkin <[hidden email]>
> > > > Gesendet: Mittwoch, 18. Dezember 2019 19:46
> > > > An: mkgmap development
> > > > Betreff: [mkgmap-dev] TYP files and character encoding
> > > >
> > > > Hi
> > > >
> > > > A couple of problems with typ-files and unicode.
> > > >
> > > > With 'Codepage=65001' the final contents of the labels in
> > > > mapnik.typ
> > > > that is included with the composite map is unicode, but if the
> > > > map
> > > > is
> > > > codepage 1252, the unicode characters with the top bit set are
> > > > simply
> > > > displayed as if in 1252.
> > > >
> > > > Removing the codepage statement from mapnik.txt and making
> > > > fixes
> > > > elsewhere to ensure that the file is read correctly as utf-8
> > > > and
> > > > then
> > > > generating a map with --code-page=1252, it gives the error:
> > > >
> > > > SEVE: uk.me.parabola.imgfmt.MapFailedException
> > > >  ../svn/trunk/resources/typ-files/mapnik.txt:
> > > >  (thrown in TypCompiler.makeMap())
> > > >  TYP file cannot be written in code page 1252
> > > >
> > > > Changing the exception handling in
> > > > imgfmt/app/typ/TypElement.java,
> > > > so
> > > > that makeLabelBlock() reads as
> > > > ...
> > > >     CharBuffer cb = CharBuffer.wrap(tl.getText());
> > > >     try {
> > > >         ByteBuffer buffer = encoder.encode(cb);
> > > >         out.put((byte) tl.getLang());
> > > >         out.put(buffer);
> > > >         out.put((byte) 0);
> > > >      }  catch (CharacterCodingException ignore) {
> > > > //        ignore.printStackTrace();
> > > >         String name = encoder.charset().name();
> > > >         System.out.println("Cannot represent String=" +
> > > >             tl.getLang() + "," + tl.getText() +
> > > >             " in CodePage=" + name);
> > > > //        throw newTypLabelException(name);
> > > >      }
> > > > ...
> > > >
> > > > It gives output like:
> > > > Cannot represent String=21,Gara|e in CodePage=windows-1252
> > > > Cannot represent String=21,Obszar przemysBowy in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,ZieleD in CodePage=windows-1252
> > > > Cannot represent String=21,Zaro[la in CodePage=windows-1252
> > > > Cannot represent String=21,MokradBa in CodePage=windows-1252
> > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Zcie|ka rowerowa in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> > > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > Cannot represent String=21,Granica paDstwa in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> > > > Cannot represent String=21,Kabel wysokiego napi^Ycia in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Droga krajowa (B^Ecznik) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (AmerykaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (ChiDska) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (WBoska) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (MeksykaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Restauracja (P^Eczki) in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Restauracja (WegetariaDska) in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> > > > Cannot represent String=21,Sklep odzie|owy in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wypo|yczalnia samochod\363w in
> > > > CodePage=windows-1252
> > > > Cannot represent String=21,Gara| in CodePage=windows-1252
> > > > Cannot represent String=21,Sprzeda| samochod\363w in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Sklep |eglarski in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,S^Ed in CodePage=windows-1252
> > > > Cannot represent String=21,O[rodek kultury in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wi^Yzienie in CodePage=windows-1252
> > > > Cannot represent String=21,Stra| po|arna in CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,SBupek in CodePage=windows-1252
> > > > Cannot represent String=21,PrzystaD in CodePage=windows-1252
> > > > Cannot represent String=21,L^Edowisko helikopterowe in
> > > > CodePage=windows
> > > > -1252
> > > > Cannot represent String=21,Wie|a in CodePage=windows-1252
> > > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> > > > Cannot represent String=21,Pla|a in CodePage=windows-1252
> > > > Cannot represent String=21,Przyl^Edek in CodePage=windows-1252
> > > > Cannot represent String=21,SkaBa in CodePage=windows-1252
> > > >
> > > > Which makes sense if codepage 1252 doesn't handle Polish (hex
> > > > 0x15,
> > > > decimal 21).
> > > >
> > > > NB the non ascii characters in above are messed up by my
> > > > cutting
> > > > and
> > > > pasting.
> > > >
> > > > Checking the French, on my Garmin device, the type descriptions
> > > > now
> > > > display accents correctly.
> > > >
> > > > Ticker
> > > >
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > [hidden email]
> > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > [hidden email]
> > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > [hidden email]
> > _______________________________________________
> > mkgmap-dev mailing list
> > [hidden email]
> > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Ticker Berkin
Hi Gerd

I'll attempt to get everything closed, either directly or with try ()
{}

I couldn't work out from the documentation if parameter nested
auto/closeable objects get closed, ie in:

try (reader r = new BufferedReader(
         new InputStreamReader(
             new FileInputStream(filename),
             charset))) {
    processFile(r);
    }

do all 3 get closed or do I need to split it into 3 declarations?

Ticker

On Tue, 2020-01-14 at 09:55 +0000, Gerd Petermann wrote:

> Hi Ticker,
>
> yes, and every missing close() is a brain teaser ;)
> We have a few places where files are opened and closed in a different
> method. This is likely to cause trouble in unit tests, esp. on
> Windows.
> Whereever possible we should use try-with-ressources instead of
> Utils.closeFile() and add a comment
> like in SeaGenerator line
> in zipFile = new ZipFile(precompSeaDir); // don't close here!
> when a file is intentionally kept open.
>
> Gerd
>
> ________________________________________
> Von: mkgmap-dev <[hidden email]> im Auftrag
> von Ticker Berkin <[hidden email]>
> Gesendet: Dienstag, 14. Januar 2020 10:43
> An: Development list for mkgmap
> Betreff: Re: [mkgmap-dev] TYP files and character encoding
>
> Hi Gerd
>
> Here is updated patch that closes the file, although I find many
> files
> in mkgmap that don't have explicit close(), but I presume .finalize()
> will close them eventually.
>
> I'll do another patch for other text file handling, using
> StandardCharset where possible and fixing TokenScanner message for
> bad
> characters if not utf-8 and, if reasonable, allowing a BOM even if
> the
> file is opened as utf-8 anyway.
>
> Ticker
>
> On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> >
> > thanks for the patch.
> >
> > Please review TypCompiler.CharsetProbe.  BufferedReader br is not
> > closed. Is that intended?
> >
> > I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> > sources. I think it would be good to use StandardCharsets.UTF_8
> > where
> > possible
> > and unify the rest.
> >
> > Gerd
> >
> > ________________________________________
> > Von: mkgmap-dev <[hidden email]> im Auftrag
> > von Ticker Berkin <[hidden email]>
> > Gesendet: Montag, 13. Januar 2020 11:34
> > An: Development list for mkgmap
> > Betreff: Re: [mkgmap-dev] TYP files and character encoding
> >
> > Hi Gerd
> >
> > I've updated this patch with changes to TypCompiler CharsetProbe:
> >
> > 1/ looks for unicode BOM in various encodings near start of file.
> > 2/ looks for line containing "-*- coding: charset -*-" near start
> > of
> > the file.
> > 3/ retains the check for "CodePage=" coding for compatibility.
> > 4/ in the absence of the above, sets the reading charset to utf-8
> > if
> > the file is valid utf-8, otherwise to Cp1252.
> > 5/ fixes the bad character message from the scanner to say what the
> > charset really is rather than saying "uft-8" regardless.
> > 6/ removes the logic to that checks if String... lines, read in the
> > charset it is currently trying, can be encoded in the presumed
> > output
> > CodePage.
> >
> > The final result of this patch should be that:
> >
> > a/ No existing usage is broken
> > b/ 2 methods to indicate the charset/encoding of the file that are
> > commonly used by text editors can be used and are taken notice of.
> > Previously, just the UTF-8 BOM was detected.
> > c/ Typ files can, and should from now on, be written in utf-8
> > d/ labels for languages not supported in the --code-page of the
> > output
> > img just generate a warning in mkgmap.log.x
> >
> > Ticker
> >
> >
> > On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote:
> > > Hi Gerd
> > >
> > > Attached is a patch that:
> > >
> > > Doesn't use the 'CodePage=' command in the typ-file to determine
> > > output
> > > character encoding of the typ-file, rather it uses the main map
> > > encoding from the --code-page argument.
> > >
> > > log.warn's any typ labels that can't be encoded in the --code
> > > -page,
> > > rather than just giving up with message like:
> > > > TYP file cannot be written in code page 1252
> > >
> > > The message:
> > > > WARNING: SortCode in TYP txt file different from command line
> > > > setting
> > > that was written direct to system.out is changed to a log.warn
> > > and
> > > it
> > > shouldn't happen anyway now
> > >
> > > For the moment, the 'CodePage=' command in the typ-file is, under
> > > some
> > > circumstances, used to determine the encoding of the typ-file
> > > itself
> > > and I've left this alone for compatibility with existing useage.
> > > Sometime in January I'll provide a better method for this
> > >
> > > Ticker
> > >
> > >
> > > On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:
> > > > Hi Gerd
> > > >
> > > > I think it is best to continue with the ideas for typ-files
> > > > that:
> > > >
> > > > 1/ they can be in any character set and we just need a better
> > > > way
> > > > of
> > > > working out the correct one - see my posting earlier today.
> > > >
> > > > 2/ it can include as many languages as anyone can be bothered
> > > > to
> > > > add,
> > > > and so has to be an a character set that allows the languages
> > > > to
> > > > be
> > > > added, implying unicode for a common one (more particulary, UTF
> > > > -8)
> > > >
> > > > 3/ the codepage= statement should be redundant and ignored for
> > > > controlling the output character set, which should be taken
> > > > from
> > > > the
> > > > map, but its use for determining the input coding might need to
> > > > be
> > > > kept
> > > > for a while for compatability.
> > > >
> > > > 4/ the messages my hack generates should be turned into 1
> > > > warning
> > > > or
> > > > information message per language or maybe suppressed
> > > > altogether.
> > > > If
> > > > someone is generating a map with a character set that doesn't
> > > > support
> > > > a
> > > > particular language, they really won't care that that data for
> > > > other
> > > > languages that have an incompatible representation with their
> > > > language
> > > > won't be there.
> > > >
> > > > Ticker
> > > >
> > > > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
> > > > > Hi Ticker,
> > > > >
> > > > > I think I understand now why we didn't have a default typ
> > > > > file
> > > > > ;)
> > > > > If I got that right I should revert the changes in r4395 and
> > > > > mkgmap
> > > > > should not allow or warn loudly when a typ file with a
> > > > > different
> > > > > codepage is merged?
> > > > > Or should we force the usage of unicode codepage?
> > > > > Or is it possible to compile mapnik.txt with cp 1252 (or any
> > > > > other)
> > > > > in a way that only those lines which contain non-matching
> > > > > characters
> > > > > are ignored?
> > > > >
> > > > > Gerd
> > > > >
> > > > >
> > > > > ________________________________________
> > > > > Von: mkgmap-dev <[hidden email]> im
> > > > > Auftrag
> > > > > von Ticker Berkin <[hidden email]>
> > > > > Gesendet: Mittwoch, 18. Dezember 2019 19:46
> > > > > An: mkgmap development
> > > > > Betreff: [mkgmap-dev] TYP files and character encoding
> > > > >
> > > > > Hi
> > > > >
> > > > > A couple of problems with typ-files and unicode.
> > > > >
> > > > > With 'Codepage=65001' the final contents of the labels in
> > > > > mapnik.typ
> > > > > that is included with the composite map is unicode, but if
> > > > > the
> > > > > map
> > > > > is
> > > > > codepage 1252, the unicode characters with the top bit set
> > > > > are
> > > > > simply
> > > > > displayed as if in 1252.
> > > > >
> > > > > Removing the codepage statement from mapnik.txt and making
> > > > > fixes
> > > > > elsewhere to ensure that the file is read correctly as utf-8
> > > > > and
> > > > > then
> > > > > generating a map with --code-page=1252, it gives the error:
> > > > >
> > > > > SEVE: uk.me.parabola.imgfmt.MapFailedException
> > > > >  ../svn/trunk/resources/typ-files/mapnik.txt:
> > > > >  (thrown in TypCompiler.makeMap())
> > > > >  TYP file cannot be written in code page 1252
> > > > >
> > > > > Changing the exception handling in
> > > > > imgfmt/app/typ/TypElement.java,
> > > > > so
> > > > > that makeLabelBlock() reads as
> > > > > ...
> > > > >     CharBuffer cb = CharBuffer.wrap(tl.getText());
> > > > >     try {
> > > > >         ByteBuffer buffer = encoder.encode(cb);
> > > > >         out.put((byte) tl.getLang());
> > > > >         out.put(buffer);
> > > > >         out.put((byte) 0);
> > > > >      }  catch (CharacterCodingException ignore) {
> > > > > //        ignore.printStackTrace();
> > > > >         String name = encoder.charset().name();
> > > > >         System.out.println("Cannot represent String=" +
> > > > >             tl.getLang() + "," + tl.getText() +
> > > > >             " in CodePage=" + name);
> > > > > //        throw newTypLabelException(name);
> > > > >      }
> > > > > ...
> > > > >
> > > > > It gives output like:
> > > > > Cannot represent String=21,Gara|e in CodePage=windows-1252
> > > > > Cannot represent String=21,Obszar przemysBowy in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,ZieleD in CodePage=windows-1252
> > > > > Cannot represent String=21,Zaro[la in CodePage=windows-1252
> > > > > Cannot represent String=21,MokradBa in CodePage=windows-1252
> > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Zcie|ka rowerowa in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> > > > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > > Cannot represent String=21,Granica paDstwa in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> > > > > Cannot represent String=21,Kabel wysokiego napi^Ycia in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga krajowa (B^Ecznik) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (AmerykaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (ChiDska) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (WBoska) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (MeksykaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (P^Eczki) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (WegetariaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> > > > > Cannot represent String=21,Sklep odzie|owy in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wypo|yczalnia samochod\363w in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Gara| in CodePage=windows-1252
> > > > > Cannot represent String=21,Sprzeda| samochod\363w in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Sklep |eglarski in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,S^Ed in CodePage=windows-1252
> > > > > Cannot represent String=21,O[rodek kultury in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wi^Yzienie in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Stra| po|arna in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,SBupek in CodePage=windows-1252
> > > > > Cannot represent String=21,PrzystaD in CodePage=windows-1252
> > > > > Cannot represent String=21,L^Edowisko helikopterowe in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wie|a in CodePage=windows-1252
> > > > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> > > > > Cannot represent String=21,Pla|a in CodePage=windows-1252
> > > > > Cannot represent String=21,Przyl^Edek in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,SkaBa in CodePage=windows-1252
> > > > >
> > > > > Which makes sense if codepage 1252 doesn't handle Polish (hex
> > > > > 0x15,
> > > > > decimal 21).
> > > > >
> > > > > NB the non ascii characters in above are messed up by my
> > > > > cutting
> > > > > and
> > > > > pasting.
> > > > >
> > > > > Checking the French, on my Garmin device, the type
> > > > > descriptions
> > > > > now
> > > > > display accents correctly.
> > > > >
> > > > > Ticker
> > > > >
> > > > > _______________________________________________
> > > > > mkgmap-dev mailing list
> > > > > [hidden email]
> > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > > _______________________________________________
> > > > > mkgmap-dev mailing list
> > > > > [hidden email]
> > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > [hidden email]
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > [hidden email]
> > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Gerd Petermann
Hi Ticker,

my understanding is that all are closed.

Gerd

________________________________________
Von: mkgmap-dev <[hidden email]> im Auftrag von Ticker Berkin <[hidden email]>
Gesendet: Dienstag, 14. Januar 2020 11:28
An: Development list for mkgmap
Betreff: Re: [mkgmap-dev] TYP files and character encoding

Hi Gerd

I'll attempt to get everything closed, either directly or with try ()
{}

I couldn't work out from the documentation if parameter nested
auto/closeable objects get closed, ie in:

try (reader r = new BufferedReader(
         new InputStreamReader(
             new FileInputStream(filename),
             charset))) {
    processFile(r);
    }

do all 3 get closed or do I need to split it into 3 declarations?

Ticker

On Tue, 2020-01-14 at 09:55 +0000, Gerd Petermann wrote:

> Hi Ticker,
>
> yes, and every missing close() is a brain teaser ;)
> We have a few places where files are opened and closed in a different
> method. This is likely to cause trouble in unit tests, esp. on
> Windows.
> Whereever possible we should use try-with-ressources instead of
> Utils.closeFile() and add a comment
> like in SeaGenerator line
> in zipFile = new ZipFile(precompSeaDir); // don't close here!
> when a file is intentionally kept open.
>
> Gerd
>
> ________________________________________
> Von: mkgmap-dev <[hidden email]> im Auftrag
> von Ticker Berkin <[hidden email]>
> Gesendet: Dienstag, 14. Januar 2020 10:43
> An: Development list for mkgmap
> Betreff: Re: [mkgmap-dev] TYP files and character encoding
>
> Hi Gerd
>
> Here is updated patch that closes the file, although I find many
> files
> in mkgmap that don't have explicit close(), but I presume .finalize()
> will close them eventually.
>
> I'll do another patch for other text file handling, using
> StandardCharset where possible and fixing TokenScanner message for
> bad
> characters if not utf-8 and, if reasonable, allowing a BOM even if
> the
> file is opened as utf-8 anyway.
>
> Ticker
>
> On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> >
> > thanks for the patch.
> >
> > Please review TypCompiler.CharsetProbe.  BufferedReader br is not
> > closed. Is that intended?
> >
> > I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> > sources. I think it would be good to use StandardCharsets.UTF_8
> > where
> > possible
> > and unify the rest.
> >
> > Gerd
> >
> > ________________________________________
> > Von: mkgmap-dev <[hidden email]> im Auftrag
> > von Ticker Berkin <[hidden email]>
> > Gesendet: Montag, 13. Januar 2020 11:34
> > An: Development list for mkgmap
> > Betreff: Re: [mkgmap-dev] TYP files and character encoding
> >
> > Hi Gerd
> >
> > I've updated this patch with changes to TypCompiler CharsetProbe:
> >
> > 1/ looks for unicode BOM in various encodings near start of file.
> > 2/ looks for line containing "-*- coding: charset -*-" near start
> > of
> > the file.
> > 3/ retains the check for "CodePage=" coding for compatibility.
> > 4/ in the absence of the above, sets the reading charset to utf-8
> > if
> > the file is valid utf-8, otherwise to Cp1252.
> > 5/ fixes the bad character message from the scanner to say what the
> > charset really is rather than saying "uft-8" regardless.
> > 6/ removes the logic to that checks if String... lines, read in the
> > charset it is currently trying, can be encoded in the presumed
> > output
> > CodePage.
> >
> > The final result of this patch should be that:
> >
> > a/ No existing usage is broken
> > b/ 2 methods to indicate the charset/encoding of the file that are
> > commonly used by text editors can be used and are taken notice of.
> > Previously, just the UTF-8 BOM was detected.
> > c/ Typ files can, and should from now on, be written in utf-8
> > d/ labels for languages not supported in the --code-page of the
> > output
> > img just generate a warning in mkgmap.log.x
> >
> > Ticker
> >
> >
> > On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote:
> > > Hi Gerd
> > >
> > > Attached is a patch that:
> > >
> > > Doesn't use the 'CodePage=' command in the typ-file to determine
> > > output
> > > character encoding of the typ-file, rather it uses the main map
> > > encoding from the --code-page argument.
> > >
> > > log.warn's any typ labels that can't be encoded in the --code
> > > -page,
> > > rather than just giving up with message like:
> > > > TYP file cannot be written in code page 1252
> > >
> > > The message:
> > > > WARNING: SortCode in TYP txt file different from command line
> > > > setting
> > > that was written direct to system.out is changed to a log.warn
> > > and
> > > it
> > > shouldn't happen anyway now
> > >
> > > For the moment, the 'CodePage=' command in the typ-file is, under
> > > some
> > > circumstances, used to determine the encoding of the typ-file
> > > itself
> > > and I've left this alone for compatibility with existing useage.
> > > Sometime in January I'll provide a better method for this
> > >
> > > Ticker
> > >
> > >
> > > On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:
> > > > Hi Gerd
> > > >
> > > > I think it is best to continue with the ideas for typ-files
> > > > that:
> > > >
> > > > 1/ they can be in any character set and we just need a better
> > > > way
> > > > of
> > > > working out the correct one - see my posting earlier today.
> > > >
> > > > 2/ it can include as many languages as anyone can be bothered
> > > > to
> > > > add,
> > > > and so has to be an a character set that allows the languages
> > > > to
> > > > be
> > > > added, implying unicode for a common one (more particulary, UTF
> > > > -8)
> > > >
> > > > 3/ the codepage= statement should be redundant and ignored for
> > > > controlling the output character set, which should be taken
> > > > from
> > > > the
> > > > map, but its use for determining the input coding might need to
> > > > be
> > > > kept
> > > > for a while for compatability.
> > > >
> > > > 4/ the messages my hack generates should be turned into 1
> > > > warning
> > > > or
> > > > information message per language or maybe suppressed
> > > > altogether.
> > > > If
> > > > someone is generating a map with a character set that doesn't
> > > > support
> > > > a
> > > > particular language, they really won't care that that data for
> > > > other
> > > > languages that have an incompatible representation with their
> > > > language
> > > > won't be there.
> > > >
> > > > Ticker
> > > >
> > > > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
> > > > > Hi Ticker,
> > > > >
> > > > > I think I understand now why we didn't have a default typ
> > > > > file
> > > > > ;)
> > > > > If I got that right I should revert the changes in r4395 and
> > > > > mkgmap
> > > > > should not allow or warn loudly when a typ file with a
> > > > > different
> > > > > codepage is merged?
> > > > > Or should we force the usage of unicode codepage?
> > > > > Or is it possible to compile mapnik.txt with cp 1252 (or any
> > > > > other)
> > > > > in a way that only those lines which contain non-matching
> > > > > characters
> > > > > are ignored?
> > > > >
> > > > > Gerd
> > > > >
> > > > >
> > > > > ________________________________________
> > > > > Von: mkgmap-dev <[hidden email]> im
> > > > > Auftrag
> > > > > von Ticker Berkin <[hidden email]>
> > > > > Gesendet: Mittwoch, 18. Dezember 2019 19:46
> > > > > An: mkgmap development
> > > > > Betreff: [mkgmap-dev] TYP files and character encoding
> > > > >
> > > > > Hi
> > > > >
> > > > > A couple of problems with typ-files and unicode.
> > > > >
> > > > > With 'Codepage=65001' the final contents of the labels in
> > > > > mapnik.typ
> > > > > that is included with the composite map is unicode, but if
> > > > > the
> > > > > map
> > > > > is
> > > > > codepage 1252, the unicode characters with the top bit set
> > > > > are
> > > > > simply
> > > > > displayed as if in 1252.
> > > > >
> > > > > Removing the codepage statement from mapnik.txt and making
> > > > > fixes
> > > > > elsewhere to ensure that the file is read correctly as utf-8
> > > > > and
> > > > > then
> > > > > generating a map with --code-page=1252, it gives the error:
> > > > >
> > > > > SEVE: uk.me.parabola.imgfmt.MapFailedException
> > > > >  ../svn/trunk/resources/typ-files/mapnik.txt:
> > > > >  (thrown in TypCompiler.makeMap())
> > > > >  TYP file cannot be written in code page 1252
> > > > >
> > > > > Changing the exception handling in
> > > > > imgfmt/app/typ/TypElement.java,
> > > > > so
> > > > > that makeLabelBlock() reads as
> > > > > ...
> > > > >     CharBuffer cb = CharBuffer.wrap(tl.getText());
> > > > >     try {
> > > > >         ByteBuffer buffer = encoder.encode(cb);
> > > > >         out.put((byte) tl.getLang());
> > > > >         out.put(buffer);
> > > > >         out.put((byte) 0);
> > > > >      }  catch (CharacterCodingException ignore) {
> > > > > //        ignore.printStackTrace();
> > > > >         String name = encoder.charset().name();
> > > > >         System.out.println("Cannot represent String=" +
> > > > >             tl.getLang() + "," + tl.getText() +
> > > > >             " in CodePage=" + name);
> > > > > //        throw newTypLabelException(name);
> > > > >      }
> > > > > ...
> > > > >
> > > > > It gives output like:
> > > > > Cannot represent String=21,Gara|e in CodePage=windows-1252
> > > > > Cannot represent String=21,Obszar przemysBowy in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,ZieleD in CodePage=windows-1252
> > > > > Cannot represent String=21,Zaro[la in CodePage=windows-1252
> > > > > Cannot represent String=21,MokradBa in CodePage=windows-1252
> > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Zcie|ka rowerowa in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> > > > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > > Cannot represent String=21,Granica paDstwa in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> > > > > Cannot represent String=21,Kabel wysokiego napi^Ycia in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga krajowa (B^Ecznik) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (AmerykaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (ChiDska) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (WBoska) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (MeksykaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (P^Eczki) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (WegetariaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> > > > > Cannot represent String=21,Sklep odzie|owy in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wypo|yczalnia samochod\363w in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Gara| in CodePage=windows-1252
> > > > > Cannot represent String=21,Sprzeda| samochod\363w in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Sklep |eglarski in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,S^Ed in CodePage=windows-1252
> > > > > Cannot represent String=21,O[rodek kultury in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wi^Yzienie in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Stra| po|arna in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,SBupek in CodePage=windows-1252
> > > > > Cannot represent String=21,PrzystaD in CodePage=windows-1252
> > > > > Cannot represent String=21,L^Edowisko helikopterowe in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wie|a in CodePage=windows-1252
> > > > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> > > > > Cannot represent String=21,Pla|a in CodePage=windows-1252
> > > > > Cannot represent String=21,Przyl^Edek in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,SkaBa in CodePage=windows-1252
> > > > >
> > > > > Which makes sense if codepage 1252 doesn't handle Polish (hex
> > > > > 0x15,
> > > > > decimal 21).
> > > > >
> > > > > NB the non ascii characters in above are messed up by my
> > > > > cutting
> > > > > and
> > > > > pasting.
> > > > >
> > > > > Checking the French, on my Garmin device, the type
> > > > > descriptions
> > > > > now
> > > > > display accents correctly.
> > > > >
> > > > > Ticker
> > > > >
> > > > > _______________________________________________
> > > > > mkgmap-dev mailing list
> > > > > [hidden email]
> > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > > _______________________________________________
> > > > > mkgmap-dev mailing list
> > > > > [hidden email]
> > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > [hidden email]
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > [hidden email]
> > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Ticker Berkin
In reply to this post by Ticker Berkin
Hi Gerd

I've just noticed that a change to a function profile stopped a test
from compiling, so here is a patch for that

Ticker

On Tue, 2020-01-14 at 09:43 +0000, Ticker Berkin wrote:

> Hi Gerd
>
> Here is updated patch that closes the file, although I find many
> files
> in mkgmap that don't have explicit close(), but I presume .finalize()
> will close them eventually.
>
> I'll do another patch for other text file handling, using
> StandardCharset where possible and fixing TokenScanner message for
> bad
> characters if not utf-8 and, if reasonable, allowing a BOM even if
> the
> file is opened as utf-8 anyway.
>
> Ticker
>
> On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> >
> > thanks for the patch.
> >
> > Please review TypCompiler.CharsetProbe.  BufferedReader br is not
> > closed. Is that intended?
> >
> > I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> > sources. I think it would be good to use StandardCharsets.UTF_8
> > where
> > possible
> > and unify the rest.
> >
> > Gerd
> >
> > ________________________________________
> > Von: mkgmap-dev <[hidden email]> im Auftrag
> > von Ticker Berkin <[hidden email]>
> > Gesendet: Montag, 13. Januar 2020 11:34
> > An: Development list for mkgmap
> > Betreff: Re: [mkgmap-dev] TYP files and character encoding
> >
> > Hi Gerd
> >
> > I've updated this patch with changes to TypCompiler CharsetProbe:
> >
> > 1/ looks for unicode BOM in various encodings near start of file.
> > 2/ looks for line containing "-*- coding: charset -*-" near start
> > of
> > the file.
> > 3/ retains the check for "CodePage=" coding for compatibility.
> > 4/ in the absence of the above, sets the reading charset to utf-8
> > if
> > the file is valid utf-8, otherwise to Cp1252.
> > 5/ fixes the bad character message from the scanner to say what the
> > charset really is rather than saying "uft-8" regardless.
> > 6/ removes the logic to that checks if String... lines, read in the
> > charset it is currently trying, can be encoded in the presumed
> > output
> > CodePage.
> >
> > The final result of this patch should be that:
> >
> > a/ No existing usage is broken
> > b/ 2 methods to indicate the charset/encoding of the file that are
> > commonly used by text editors can be used and are taken notice of.
> > Previously, just the UTF-8 BOM was detected.
> > c/ Typ files can, and should from now on, be written in utf-8
> > d/ labels for languages not supported in the --code-page of the
> > output
> > img just generate a warning in mkgmap.log.x
> >
> > Ticker
> >
> >
> > On Sat, 2019-12-21 at 16:11 +0000, Ticker Berkin wrote:
> > > Hi Gerd
> > >
> > > Attached is a patch that:
> > >
> > > Doesn't use the 'CodePage=' command in the typ-file to determine
> > > output
> > > character encoding of the typ-file, rather it uses the main map
> > > encoding from the --code-page argument.
> > >
> > > log.warn's any typ labels that can't be encoded in the --code
> > > -page,
> > > rather than just giving up with message like:
> > > > TYP file cannot be written in code page 1252
> > >
> > > The message:
> > > > WARNING: SortCode in TYP txt file different from command line
> > > > setting
> > > that was written direct to system.out is changed to a log.warn
> > > and
> > > it
> > > shouldn't happen anyway now
> > >
> > > For the moment, the 'CodePage=' command in the typ-file is, under
> > > some
> > > circumstances, used to determine the encoding of the typ-file
> > > itself
> > > and I've left this alone for compatibility with existing useage.
> > > Sometime in January I'll provide a better method for this
> > >
> > > Ticker
> > >
> > >
> > > On Wed, 2019-12-18 at 19:54 +0000, Ticker Berkin wrote:
> > > > Hi Gerd
> > > >
> > > > I think it is best to continue with the ideas for typ-files
> > > > that:
> > > >
> > > > 1/ they can be in any character set and we just need a better
> > > > way
> > > > of
> > > > working out the correct one - see my posting earlier today.
> > > >
> > > > 2/ it can include as many languages as anyone can be bothered
> > > > to
> > > > add,
> > > > and so has to be an a character set that allows the languages
> > > > to
> > > > be
> > > > added, implying unicode for a common one (more particulary, UTF
> > > > -8)
> > > >
> > > > 3/ the codepage= statement should be redundant and ignored for
> > > > controlling the output character set, which should be taken
> > > > from
> > > > the
> > > > map, but its use for determining the input coding might need to
> > > > be
> > > > kept
> > > > for a while for compatability.
> > > >
> > > > 4/ the messages my hack generates should be turned into 1
> > > > warning
> > > > or
> > > > information message per language or maybe suppressed
> > > > altogether.
> > > > If
> > > > someone is generating a map with a character set that doesn't
> > > > support
> > > > a
> > > > particular language, they really won't care that that data for
> > > > other
> > > > languages that have an incompatible representation with their
> > > > language
> > > > won't be there.
> > > >
> > > > Ticker
> > > >
> > > > On Wed, 2019-12-18 at 19:08 +0000, Gerd Petermann wrote:
> > > > > Hi Ticker,
> > > > >
> > > > > I think I understand now why we didn't have a default typ
> > > > > file
> > > > > ;)
> > > > > If I got that right I should revert the changes in r4395 and
> > > > > mkgmap
> > > > > should not allow or warn loudly when a typ file with a
> > > > > different
> > > > > codepage is merged?
> > > > > Or should we force the usage of unicode codepage?
> > > > > Or is it possible to compile mapnik.txt with cp 1252 (or any
> > > > > other)
> > > > > in a way that only those lines which contain non-matching
> > > > > characters
> > > > > are ignored?
> > > > >
> > > > > Gerd
> > > > >
> > > > >
> > > > > ________________________________________
> > > > > Von: mkgmap-dev <[hidden email]> im
> > > > > Auftrag
> > > > > von Ticker Berkin <[hidden email]>
> > > > > Gesendet: Mittwoch, 18. Dezember 2019 19:46
> > > > > An: mkgmap development
> > > > > Betreff: [mkgmap-dev] TYP files and character encoding
> > > > >
> > > > > Hi
> > > > >
> > > > > A couple of problems with typ-files and unicode.
> > > > >
> > > > > With 'Codepage=65001' the final contents of the labels in
> > > > > mapnik.typ
> > > > > that is included with the composite map is unicode, but if
> > > > > the
> > > > > map
> > > > > is
> > > > > codepage 1252, the unicode characters with the top bit set
> > > > > are
> > > > > simply
> > > > > displayed as if in 1252.
> > > > >
> > > > > Removing the codepage statement from mapnik.txt and making
> > > > > fixes
> > > > > elsewhere to ensure that the file is read correctly as utf-8
> > > > > and
> > > > > then
> > > > > generating a map with --code-page=1252, it gives the error:
> > > > >
> > > > > SEVE: uk.me.parabola.imgfmt.MapFailedException
> > > > >  ../svn/trunk/resources/typ-files/mapnik.txt:
> > > > >  (thrown in TypCompiler.makeMap())
> > > > >  TYP file cannot be written in code page 1252
> > > > >
> > > > > Changing the exception handling in
> > > > > imgfmt/app/typ/TypElement.java,
> > > > > so
> > > > > that makeLabelBlock() reads as
> > > > > ...
> > > > >     CharBuffer cb = CharBuffer.wrap(tl.getText());
> > > > >     try {
> > > > >         ByteBuffer buffer = encoder.encode(cb);
> > > > >         out.put((byte) tl.getLang());
> > > > >         out.put(buffer);
> > > > >         out.put((byte) 0);
> > > > >      }  catch (CharacterCodingException ignore) {
> > > > > //        ignore.printStackTrace();
> > > > >         String name = encoder.charset().name();
> > > > >         System.out.println("Cannot represent String=" +
> > > > >             tl.getLang() + "," + tl.getText() +
> > > > >             " in CodePage=" + name);
> > > > > //        throw newTypLabelException(name);
> > > > >      }
> > > > > ...
> > > > >
> > > > > It gives output like:
> > > > > Cannot represent String=21,Gara|e in CodePage=windows-1252
> > > > > Cannot represent String=21,Obszar przemysBowy in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,ZieleD in CodePage=windows-1252
> > > > > Cannot represent String=21,Zaro[la in CodePage=windows-1252
> > > > > Cannot represent String=21,MokradBa in CodePage=windows-1252
> > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Zcie|ka rowerowa in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wybrze|e in CodePage=windows-1252
> > > > > Cannot represent String=21,Zcie|ka in CodePage=windows-1252
> > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > > Cannot represent String=21,Granica paDstwa in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Rzeka, KanaB in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,StrumieD in CodePage=windows-1252
> > > > > Cannot represent String=21,Ruroci^Eg in CodePage=windows-1252
> > > > > Cannot represent String=21,Kabel wysokiego napi^Ycia in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Tor wy[cigowy in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Droga szybkiego ruchu  (B^Ecznik)
> > > > > in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Droga krajowa (B^Ecznik) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Droga wojew\363dzka (B^Ecznik) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wie[ (>5 tys.) in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (AmerykaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (ChiDska) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (Mi^Ydzynarodowa) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (WBoska) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (MeksykaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Restauracja (P^Eczki) in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Restauracja (WegetariaDska) in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Kr^Ygle in CodePage=windows-1252
> > > > > Cannot represent String=21,Sklep odzie|owy in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wypo|yczalnia samochod\363w in
> > > > > CodePage=windows-1252
> > > > > Cannot represent String=21,Gara| in CodePage=windows-1252
> > > > > Cannot represent String=21,Sprzeda| samochod\363w in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Sklep |eglarski in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,S^Ed in CodePage=windows-1252
> > > > > Cannot represent String=21,O[rodek kultury in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wi^Yzienie in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Stra| po|arna in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,SBupek in CodePage=windows-1252
> > > > > Cannot represent String=21,PrzystaD in CodePage=windows-1252
> > > > > Cannot represent String=21,L^Edowisko helikopterowe in
> > > > > CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,Wie|a in CodePage=windows-1252
> > > > > Cannot represent String=21,yr\363dBo in CodePage=windows-1252
> > > > > Cannot represent String=21,Pla|a in CodePage=windows-1252
> > > > > Cannot represent String=21,Przyl^Edek in CodePage=windows
> > > > > -1252
> > > > > Cannot represent String=21,SkaBa in CodePage=windows-1252
> > > > >
> > > > > Which makes sense if codepage 1252 doesn't handle Polish (hex
> > > > > 0x15,
> > > > > decimal 21).
> > > > >
> > > > > NB the non ascii characters in above are messed up by my
> > > > > cutting
> > > > > and
> > > > > pasting.
> > > > >
> > > > > Checking the French, on my Garmin device, the type
> > > > > descriptions
> > > > > now
> > > > > display accents correctly.
> > > > >
> > > > > Ticker
> > > > >
> > > > > _______________________________________________
> > > > > mkgmap-dev mailing list
> > > > > [hidden email]
> > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > > _______________________________________________
> > > > > mkgmap-dev mailing list
> > > > > [hidden email]
> > > > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
> > > > _______________________________________________
> > > > mkgmap-dev mailing list
> > > > [hidden email]
> > > _______________________________________________
> > > mkgmap-dev mailing list
> > > [hidden email]
> _______________________________________________
> mkgmap-dev mailing list
> [hidden email]
> http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

typCodePage-test.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Ticker Berkin
Hi Gerd

Here is typCodePage_v4 that uses try () in both CharsetProbe and
compile, getting rid of Utils.closeFile().

The patch includes the change in typCodePage-test.patch

My javac doesn't seem to have an option to detect unused imports, but
when I run it with -Xlint I get a variety of errors - I've attached the
log.

Ticker

On Fri, 2020-01-17 at 11:13 +0000, Gerd Petermann wrote:

> Hi Ticker,
>
> I use Eclipse with customized settings in Preferences -> Java
> ->Compiler-> Error/Warnings as well as the SonarLint plugin.
>
> My understanding is that the InputStream is only closed if everything
> goes well. The nature of unit tests is that they produce special
> cases".
> The try-with-ressources was introduced to handle this.
> Maybe you can post a v4 which uses try-with-ressources in class
> CharsetProbe (as the unpatched version does)?
>
> Gerd
>
>
> ________________________________________
> Von: Ticker Berkin <[hidden email]>
> Gesendet: Freitag, 17. Januar 2020 11:54
> An: Gerd Petermann
> Betreff: Re: AW: AW: [mkgmap-dev] TYP files and character encoding
>
> Hi Gerd
>
> I have another patch almost ready for the StandardCharset utf8 / try
> (with-resources) etc, but this is quite wide-ranging and unrelated to
> the need for the typCodePage patch.
>
> I wanted to get the typCodePage patches committed first so that I can
> get on with mapnik.txt patches and also improve some bits of
> TypCompiler as the last part of the utf8 patch.
>
> My compilation system doesn't warn about unused imports - what
> options/tool do you use for this?
>
> Concerning the close, the FileInputSteam is closed, which should
> release any OS file handle; it's just the InputSteamReader &
> BufferedReader that arn't, but these are just java data structures.
>
> Ticker
>
> On Fri, 2020-01-17 at 10:27 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> >
> > ah, seems I got another post wrong. I thought you'd work on a
> > typCodePage_v4.patch which would use try-with-ressources.
> > With typCodePage_v3.patch  and typCodePage-test.patch I still see
> > some warnings for TypCompiler:
> > - unused imports
> > - br is not closed (line 232)
> >
> > Gerd
> >
> > ________________________________________
> > Von: Ticker Berkin <[hidden email]>
> > Gesendet: Freitag, 17. Januar 2020 10:17
> > An: Gerd Petermann
> > Betreff: Re: AW: [mkgmap-dev] TYP files and character encoding
> >
> > Hi Gerd
> >
> > Yes. Sorry - I didn't explain it at all well. This needs to be
> > applied
> > at the same time as typCodePage_v3.patch from 14-Jan
> >
> > Ticker
> >
> > On Fri, 2020-01-17 at 08:48 +0000, Gerd Petermann wrote:
> > > Hi Ticker,
> > >
> > > I don't understand this patch. Do I have to use it in combination
> > > with another one?
> > >
> > > Gerd
> > >
> > > ________________________________________
> > > Von: mkgmap-dev <[hidden email]> im
> > > Auftrag
> > > von Ticker Berkin <[hidden email]>
> > > Gesendet: Freitag, 17. Januar 2020 00:02
> > > An: Development list for mkgmap
> > > Betreff: Re: [mkgmap-dev] TYP files and character encoding
> > >
> > > Hi Gerd
> > >
> > > I've just noticed that a change to a function profile stopped a
> > > test
> > > from compiling, so here is a patch for that
> > >
> > > Ticker

_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

lint.log (9K) Download Attachment
typCodePage_v4.patch (18K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

StandardCharsets and try (with-resources)

Ticker Berkin
In reply to this post by Gerd Petermann
Hi Gerd

Attached patch

- uses StandardCharsets.* where possible.

- notes some usage of the java local DefaultCharset.

- changed a couple of these to force utf-8 instead.

- if --read-config file gives decoding errors, names the charset used
to read the file (ie DefaultCharset) instead of 'utf-8' in the error
message.

- accepts/ignores unicode BOM in more files

- uses try (open...) {} where possible in files changed for the above
reasons.

There is some code in mkgmap/srt/SrtTextReader.java:sortForCodepage()
that I don't understand; it would appear to get into a recursive loop
on IOException.

Ticker

On Tue, 2020-01-14 at 09:55 +0000, Gerd Petermann wrote:

> Hi Ticker,
>
> yes, and every missing close() is a brain teaser ;)
> We have a few places where files are opened and closed in a different
> method. This is likely to cause trouble in unit tests, esp. on
> Windows.
> Whereever possible we should use try-with-ressources instead of
> Utils.closeFile() and add a comment
> like in SeaGenerator line
> in zipFile = new ZipFile(precompSeaDir); // don't close here!
> when a file is intentionally kept open.
>
> Gerd
> > ________________________________________
> Von: mkgmap-dev <[hidden email]> im Auftrag
> von Ticker Berkin <[hidden email]>
> Gesendet: Dienstag, 14. Januar 2020 10:43
> An: Development list for mkgmap
> Betreff: Re: [mkgmap-dev] TYP files and character encoding
> > Hi Gerd
> > Here is updated patch that closes the file, although I find many
> files
> in mkgmap that don't have explicit close(), but I presume .finalize()
> will close them eventually.
> > I'll do another patch for other text file handling, using
> StandardCharset where possible and fixing TokenScanner message for
> bad
> characters if not utf-8 and, if reasonable, allowing a BOM even if
> the
> file is opened as utf-8 anyway.
> > Ticker
> > On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> >
> > thanks for the patch.
> >
> > Please review TypCompiler.CharsetProbe.  BufferedReader br is not
> > closed. Is that intended?
> >
> > I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> > sources. I think it would be good to use StandardCharsets.UTF_8
> > where
> > possible
> > and unify the rest.

_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

utf8.patch (30K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Gerd Petermann
In reply to this post by Ticker Berkin
Hi Ticker,

thanks, I've committed the patch with 4423, please check if my svn log message.

Gerd


________________________________________
Von: mkgmap-dev <[hidden email]> im Auftrag von Ticker Berkin <[hidden email]>
Gesendet: Freitag, 17. Januar 2020 13:43
An: mkgmap development
Betreff: Re: [mkgmap-dev] TYP files and character encoding

Hi Gerd

Here is typCodePage_v4 that uses try () in both CharsetProbe and
compile, getting rid of Utils.closeFile().

The patch includes the change in typCodePage-test.patch

My javac doesn't seem to have an option to detect unused imports, but
when I run it with -Xlint I get a variety of errors - I've attached the
log.

Ticker

On Fri, 2020-01-17 at 11:13 +0000, Gerd Petermann wrote:

> Hi Ticker,
>
> I use Eclipse with customized settings in Preferences -> Java
> ->Compiler-> Error/Warnings as well as the SonarLint plugin.
>
> My understanding is that the InputStream is only closed if everything
> goes well. The nature of unit tests is that they produce special
> cases".
> The try-with-ressources was introduced to handle this.
> Maybe you can post a v4 which uses try-with-ressources in class
> CharsetProbe (as the unpatched version does)?
>
> Gerd
>
>
> ________________________________________
> Von: Ticker Berkin <[hidden email]>
> Gesendet: Freitag, 17. Januar 2020 11:54
> An: Gerd Petermann
> Betreff: Re: AW: AW: [mkgmap-dev] TYP files and character encoding
>
> Hi Gerd
>
> I have another patch almost ready for the StandardCharset utf8 / try
> (with-resources) etc, but this is quite wide-ranging and unrelated to
> the need for the typCodePage patch.
>
> I wanted to get the typCodePage patches committed first so that I can
> get on with mapnik.txt patches and also improve some bits of
> TypCompiler as the last part of the utf8 patch.
>
> My compilation system doesn't warn about unused imports - what
> options/tool do you use for this?
>
> Concerning the close, the FileInputSteam is closed, which should
> release any OS file handle; it's just the InputSteamReader &
> BufferedReader that arn't, but these are just java data structures.
>
> Ticker
>
> On Fri, 2020-01-17 at 10:27 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> >
> > ah, seems I got another post wrong. I thought you'd work on a
> > typCodePage_v4.patch which would use try-with-ressources.
> > With typCodePage_v3.patch  and typCodePage-test.patch I still see
> > some warnings for TypCompiler:
> > - unused imports
> > - br is not closed (line 232)
> >
> > Gerd
> >
> > ________________________________________
> > Von: Ticker Berkin <[hidden email]>
> > Gesendet: Freitag, 17. Januar 2020 10:17
> > An: Gerd Petermann
> > Betreff: Re: AW: [mkgmap-dev] TYP files and character encoding
> >
> > Hi Gerd
> >
> > Yes. Sorry - I didn't explain it at all well. This needs to be
> > applied
> > at the same time as typCodePage_v3.patch from 14-Jan
> >
> > Ticker
> >
> > On Fri, 2020-01-17 at 08:48 +0000, Gerd Petermann wrote:
> > > Hi Ticker,
> > >
> > > I don't understand this patch. Do I have to use it in combination
> > > with another one?
> > >
> > > Gerd
> > >
> > > ________________________________________
> > > Von: mkgmap-dev <[hidden email]> im
> > > Auftrag
> > > von Ticker Berkin <[hidden email]>
> > > Gesendet: Freitag, 17. Januar 2020 00:02
> > > An: Development list for mkgmap
> > > Betreff: Re: [mkgmap-dev] TYP files and character encoding
> > >
> > > Hi Gerd
> > >
> > > I've just noticed that a change to a function profile stopped a
> > > test
> > > from compiling, so here is a patch for that
> > >
> > > Ticker
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
Reply | Threaded
Open this post in threaded view
|

Re: StandardCharsets and try (with-resources)

Gerd Petermann
In reply to this post by Ticker Berkin
Hi Ticker,

- I think there is a small change in the handling of lines in OsmMapDataSource.readDeleteTagsFile. The old code used
line = line.trim();
This is missing now. Is that intended?

- I also don't understand the line with your comment "// ??? I don't understand this" . Looks like an endless recursive call?

- You sometimes replaced FileReader, but not in CombinedStyleFileLoader. Why not?

We have a few places where we read files which use "#" for comment lines.  Would it help to create a class for that?

I made a few minor mods, see attachment.

Gerd

________________________________________
Von: mkgmap-dev <[hidden email]> im Auftrag von Ticker Berkin <[hidden email]>
Gesendet: Freitag, 17. Januar 2020 13:53
An: Development list for mkgmap
Betreff: [mkgmap-dev] StandardCharsets and try (with-resources)

Hi Gerd

Attached patch

- uses StandardCharsets.* where possible.

- notes some usage of the java local DefaultCharset.

- changed a couple of these to force utf-8 instead.

- if --read-config file gives decoding errors, names the charset used
to read the file (ie DefaultCharset) instead of 'utf-8' in the error
message.

- accepts/ignores unicode BOM in more files

- uses try (open...) {} where possible in files changed for the above
reasons.

There is some code in mkgmap/srt/SrtTextReader.java:sortForCodepage()
that I don't understand; it would appear to get into a recursive loop
on IOException.

Ticker

On Tue, 2020-01-14 at 09:55 +0000, Gerd Petermann wrote:

> Hi Ticker,
>
> yes, and every missing close() is a brain teaser ;)
> We have a few places where files are opened and closed in a different
> method. This is likely to cause trouble in unit tests, esp. on
> Windows.
> Whereever possible we should use try-with-ressources instead of
> Utils.closeFile() and add a comment
> like in SeaGenerator line
> in zipFile = new ZipFile(precompSeaDir); // don't close here!
> when a file is intentionally kept open.
>
> Gerd
> > ________________________________________
> Von: mkgmap-dev <[hidden email]> im Auftrag
> von Ticker Berkin <[hidden email]>
> Gesendet: Dienstag, 14. Januar 2020 10:43
> An: Development list for mkgmap
> Betreff: Re: [mkgmap-dev] TYP files and character encoding
> > Hi Gerd
> > Here is updated patch that closes the file, although I find many
> files
> in mkgmap that don't have explicit close(), but I presume .finalize()
> will close them eventually.
> > I'll do another patch for other text file handling, using
> StandardCharset where possible and fixing TokenScanner message for
> bad
> characters if not utf-8 and, if reasonable, allowing a BOM even if
> the
> file is opened as utf-8 anyway.
> > Ticker
> > On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> >
> > thanks for the patch.
> >
> > Please review TypCompiler.CharsetProbe.  BufferedReader br is not
> > closed. Is that intended?
> >
> > I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> > sources. I think it would be good to use StandardCharsets.UTF_8
> > where
> > possible
> > and unify the rest.

_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

utf8-v2.patch (32K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TYP files and character encoding

Ticker Berkin
In reply to this post by Gerd Petermann
Hi Gerd

That looks fine.

I'll add something to ./doc/typ-compiler.txt sometime soon that
describes the source/destination charset/codepage behaviour.

Ticker  


On Fri, 2020-01-17 at 15:21 +0000, Gerd Petermann wrote:

> Hi Ticker,
>
> thanks, I've committed the patch with 4423, please check if my svn
> log message.
>
> Gerd
>
>
> ________________________________________
> Von: mkgmap-dev <[hidden email]> im Auftrag
> von Ticker Berkin <[hidden email]>
> Gesendet: Freitag, 17. Januar 2020 13:43
> An: mkgmap development
> Betreff: Re: [mkgmap-dev] TYP files and character encoding
>
> Hi Gerd
>
> Here is typCodePage_v4 that uses try () in both CharsetProbe and
> compile, getting rid of Utils.closeFile().
>
> The patch includes the change in typCodePage-test.patch
>
> My javac doesn't seem to have an option to detect unused imports, but
> when I run it with -Xlint I get a variety of errors - I've attached
> the
> log.
>
> Ticker
>
> On Fri, 2020-01-17 at 11:13 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> >
> > I use Eclipse with customized settings in Preferences -> Java
> > ->Compiler-> Error/Warnings as well as the SonarLint plugin.
> >
> > My understanding is that the InputStream is only closed if
> > everything
> > goes well. The nature of unit tests is that they produce special
> > cases".
> > The try-with-ressources was introduced to handle this.
> > Maybe you can post a v4 which uses try-with-ressources in class
> > CharsetProbe (as the unpatched version does)?
> >
> > Gerd
> >
> >
> > ________________________________________
> > Von: Ticker Berkin <[hidden email]>
> > Gesendet: Freitag, 17. Januar 2020 11:54
> > An: Gerd Petermann
> > Betreff: Re: AW: AW: [mkgmap-dev] TYP files and character encoding
> >
> > Hi Gerd
> >
> > I have another patch almost ready for the StandardCharset utf8 /
> > try
> > (with-resources) etc, but this is quite wide-ranging and unrelated
> > to
> > the need for the typCodePage patch.
> >
> > I wanted to get the typCodePage patches committed first so that I
> > can
> > get on with mapnik.txt patches and also improve some bits of
> > TypCompiler as the last part of the utf8 patch.
> >
> > My compilation system doesn't warn about unused imports - what
> > options/tool do you use for this?
> >
> > Concerning the close, the FileInputSteam is closed, which should
> > release any OS file handle; it's just the InputSteamReader &
> > BufferedReader that arn't, but these are just java data structures.
> >
> > Ticker
> >
> > On Fri, 2020-01-17 at 10:27 +0000, Gerd Petermann wrote:
> > > Hi Ticker,
> > >
> > > ah, seems I got another post wrong. I thought you'd work on a
> > > typCodePage_v4.patch which would use try-with-ressources.
> > > With typCodePage_v3.patch  and typCodePage-test.patch I still see
> > > some warnings for TypCompiler:
> > > - unused imports
> > > - br is not closed (line 232)
> > >
> > > Gerd
> > >
> > > ________________________________________
> > > Von: Ticker Berkin <[hidden email]>
> > > Gesendet: Freitag, 17. Januar 2020 10:17
> > > An: Gerd Petermann
> > > Betreff: Re: AW: [mkgmap-dev] TYP files and character encoding
> > >
> > > Hi Gerd
> > >
> > > Yes. Sorry - I didn't explain it at all well. This needs to be
> > > applied
> > > at the same time as typCodePage_v3.patch from 14-Jan
> > >
> > > Ticker
> > >
> > > On Fri, 2020-01-17 at 08:48 +0000, Gerd Petermann wrote:
> > > > Hi Ticker,
> > > >
> > > > I don't understand this patch. Do I have to use it in
> > > > combination
> > > > with another one?
> > > >
> > > > Gerd
> > > >
> > > > ________________________________________
> > > > Von: mkgmap-dev <[hidden email]> im
> > > > Auftrag
> > > > von Ticker Berkin <[hidden email]>
> > > > Gesendet: Freitag, 17. Januar 2020 00:02
> > > > An: Development list for mkgmap
> > > > Betreff: Re: [mkgmap-dev] TYP files and character encoding
> > > >
> > > > Hi Gerd
> > > >
> > > > I've just noticed that a change to a function profile stopped a
> > > > test
> > > > from compiling, so here is a patch for that
> > > >
> > > > Ticker
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
Reply | Threaded
Open this post in threaded view
|

Re: StandardCharsets and try (with-resources)

Ticker Berkin
In reply to this post by Gerd Petermann
Hi Gerd

The line.trim() deletion wasn't intended - I'll put it back.

I think it best to change sortForCode IOException to throw
ExitException. Maybe they meant to return some default "Sort", ie
sortForCodepage(1252), but this seems wrong.

I started looking at CombinedStyleFileLoader. It does its Input and
Output in the default charset and I don't know if anyone uses it
anymore, but I didn't want to change any of its behaviour, so I thought
best not to touch it.

Reg. new class for files that use '#' for comments. Some of these
already use TokenScanner which can be configured. The only other one
that a quick grep finds is the character transliteration tables, so I
don't think it is worth it at the moment.

Ticker

On Fri, 2020-01-17 at 16:20 +0000, Gerd Petermann wrote:

> Hi Ticker,
>
> - I think there is a small change in the handling of lines in
> OsmMapDataSource.readDeleteTagsFile. The old code used
> line = line.trim();
> This is missing now. Is that intended?
>
> - I also don't understand the line with your comment "// ??? I don't
> understand this" . Looks like an endless recursive call?
>
> - You sometimes replaced FileReader, but not in
> CombinedStyleFileLoader. Why not?
>
> We have a few places where we read files which use "#" for comment
> lines.  Would it help to create a class for that?
>
> I made a few minor mods, see attachment.
>
> Gerd
>
> ________________________________________
> Von: mkgmap-dev <[hidden email]> im Auftrag
> von Ticker Berkin <[hidden email]>
> Gesendet: Freitag, 17. Januar 2020 13:53
> An: Development list for mkgmap
> Betreff: [mkgmap-dev] StandardCharsets and try (with-resources)
>
> Hi Gerd
>
> Attached patch
>
> - uses StandardCharsets.* where possible.
>
> - notes some usage of the java local DefaultCharset.
>
> - changed a couple of these to force utf-8 instead.
>
> - if --read-config file gives decoding errors, names the charset used
> to read the file (ie DefaultCharset) instead of 'utf-8' in the error
> message.
>
> - accepts/ignores unicode BOM in more files
>
> - uses try (open...) {} where possible in files changed for the above
> reasons.
>
> There is some code in mkgmap/srt/SrtTextReader.java:sortForCodepage()
> that I don't understand; it would appear to get into a recursive loop
> on IOException.
>
> Ticker
>
> On Tue, 2020-01-14 at 09:55 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> >
> > yes, and every missing close() is a brain teaser ;)
> > We have a few places where files are opened and closed in a
> > different
> > method. This is likely to cause trouble in unit tests, esp. on
> > Windows.
> > Whereever possible we should use try-with-ressources instead of
> > Utils.closeFile() and add a comment
> > like in SeaGenerator line
> > in zipFile = new ZipFile(precompSeaDir); // don't close here!
> > when a file is intentionally kept open.
> >
> > Gerd
> > > ________________________________________
> > Von: mkgmap-dev <[hidden email]> im Auftrag
> > von Ticker Berkin <[hidden email]>
> > Gesendet: Dienstag, 14. Januar 2020 10:43
> > An: Development list for mkgmap
> > Betreff: Re: [mkgmap-dev] TYP files and character encoding
> > > Hi Gerd
> > > Here is updated patch that closes the file, although I find many
> > files
> > in mkgmap that don't have explicit close(), but I presume
> > .finalize()
> > will close them eventually.
> > > I'll do another patch for other text file handling, using
> > StandardCharset where possible and fixing TokenScanner message for
> > bad
> > characters if not utf-8 and, if reasonable, allowing a BOM even if
> > the
> > file is opened as utf-8 anyway.
> > > Ticker
> > > On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:
> > > Hi Ticker,
> > >
> > > thanks for the patch.
> > >
> > > Please review TypCompiler.CharsetProbe.  BufferedReader br is not
> > > closed. Is that intended?
> > >
> > > I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> > > sources. I think it would be good to use StandardCharsets.UTF_8
> > > where
> > > possible
> > > and unify the rest.
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
Reply | Threaded
Open this post in threaded view
|

Re: StandardCharsets and try (with-resources)

Ticker Berkin
Hi Gerd

Here is new version of patch with line.trim() restored and exception
thrown.

@mike - It is likely that this will fix your problem with the display
of option text with non-ascii characters; with previous code, mkgmap
*read* the text incorrectly unless your local charset is was utf-8.

Ticker

On Fri, 2020-01-17 at 17:04 +0000, Ticker Berkin wrote:

> Hi Gerd
>
> The line.trim() deletion wasn't intended - I'll put it back.
>
> I think it best to change sortForCode IOException to throw
> ExitException. Maybe they meant to return some default "Sort", ie
> sortForCodepage(1252), but this seems wrong.
>
> I started looking at CombinedStyleFileLoader. It does its Input and
> Output in the default charset and I don't know if anyone uses it
> anymore, but I didn't want to change any of its behaviour, so I
> thought
> best not to touch it.
>
> Reg. new class for files that use '#' for comments. Some of these
> already use TokenScanner which can be configured. The only other one
> that a quick grep finds is the character transliteration tables, so I
> don't think it is worth it at the moment.
>
> Ticker
>
> On Fri, 2020-01-17 at 16:20 +0000, Gerd Petermann wrote:
> > Hi Ticker,
> >
> > - I think there is a small change in the handling of lines in
> > OsmMapDataSource.readDeleteTagsFile. The old code used
> > line = line.trim();
> > This is missing now. Is that intended?
> >
> > - I also don't understand the line with your comment "// ??? I
> > don't
> > understand this" . Looks like an endless recursive call?
> >
> > - You sometimes replaced FileReader, but not in
> > CombinedStyleFileLoader. Why not?
> >
> > We have a few places where we read files which use "#" for comment
> > lines.  Would it help to create a class for that?
> >
> > I made a few minor mods, see attachment.
> >
> > Gerd
> >
> > ________________________________________
> > Von: mkgmap-dev <[hidden email]> im Auftrag
> > von Ticker Berkin <[hidden email]>
> > Gesendet: Freitag, 17. Januar 2020 13:53
> > An: Development list for mkgmap
> > Betreff: [mkgmap-dev] StandardCharsets and try (with-resources)
> >
> > Hi Gerd
> >
> > Attached patch
> >
> > - uses StandardCharsets.* where possible.
> >
> > - notes some usage of the java local DefaultCharset.
> >
> > - changed a couple of these to force utf-8 instead.
> >
> > - if --read-config file gives decoding errors, names the charset
> > used
> > to read the file (ie DefaultCharset) instead of 'utf-8' in the
> > error
> > message.
> >
> > - accepts/ignores unicode BOM in more files
> >
> > - uses try (open...) {} where possible in files changed for the
> > above
> > reasons.
> >
> > There is some code in
> > mkgmap/srt/SrtTextReader.java:sortForCodepage()
> > that I don't understand; it would appear to get into a recursive
> > loop
> > on IOException.
> >
> > Ticker
> >
> > On Tue, 2020-01-14 at 09:55 +0000, Gerd Petermann wrote:
> > > Hi Ticker,
> > >
> > > yes, and every missing close() is a brain teaser ;)
> > > We have a few places where files are opened and closed in a
> > > different
> > > method. This is likely to cause trouble in unit tests, esp. on
> > > Windows.
> > > Whereever possible we should use try-with-ressources instead of
> > > Utils.closeFile() and add a comment
> > > like in SeaGenerator line
> > > in zipFile = new ZipFile(precompSeaDir); // don't close here!
> > > when a file is intentionally kept open.
> > >
> > > Gerd
> > > > ________________________________________
> > > Von: mkgmap-dev <[hidden email]> im
> > > Auftrag
> > > von Ticker Berkin <[hidden email]>
> > > Gesendet: Dienstag, 14. Januar 2020 10:43
> > > An: Development list for mkgmap
> > > Betreff: Re: [mkgmap-dev] TYP files and character encoding
> > > > Hi Gerd
> > > > Here is updated patch that closes the file, although I find
> > > > many
> > > files
> > > in mkgmap that don't have explicit close(), but I presume
> > > .finalize()
> > > will close them eventually.
> > > > I'll do another patch for other text file handling, using
> > > StandardCharset where possible and fixing TokenScanner message
> > > for
> > > bad
> > > characters if not utf-8 and, if reasonable, allowing a BOM even
> > > if
> > > the
> > > file is opened as utf-8 anyway.
> > > > Ticker
> > > > On Tue, 2020-01-14 at 08:21 +0000, Gerd Petermann wrote:
> > > > Hi Ticker,
> > > >
> > > > thanks for the patch.
> > > >
> > > > Please review TypCompiler.CharsetProbe.  BufferedReader br is
> > > > not
> > > > closed. Is that intended?
> > > >
> > > > I see that we have a mix of "utf-8" and "UTF-8" in the mkgmap
> > > > sources. I think it would be good to use StandardCharsets.UTF_8
> > > > where
> > > > possible
> > > > and unify the rest.
> _______________________________________________
> mkgmap-dev mailing list
> [hidden email]
> http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________
mkgmap-dev mailing list
[hidden email]
http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev

utf8_v3.patch (32K) Download Attachment