Nightly 20070202 and new encoding switch

Jonathan Perret Jonathan.Perret at augure.com
Mon Feb 5 08:16:32 EST 2007


Dirk:
> 1.) different encodings: This one should be solved with the encoding
> attribute, but while playing with this, I had still problems to output
> characters that are allowed in one codepage, but discouraged by the
XML
> standard. See
> http://www.w3.org/TR/REC-xml/#charsets where some characters, that are
> still allowed in the the windows-1252 codepage, are discouraged in
XML.
> esp. most of the characters in the band [x80-x9f].

Please note that XML discourages or forbids some Unicode codepoints,
not bytes in specific codepages. Specifically, windows-1252 does not
map any byte to a codepoint in the range [0x80-0x9F].
(see http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx)
For example, 0x80 in windows-1252 maps to Unicode 0x20AC (Euro sign).

> 2.) real garbaged content: real garbage is hard to detect and
therefore
> ssphys does only filter so called control characters determined by the
> "iscntrl" function. This decision is based upon the current locale. A
> few other characters are filtered by vss2svn after streaming in the
XML
> output
> 
>  >    $gSysOut =~ s/\x00//g; # remove null bytes
>  >    $gSysOut =~ s/.\x08//g; # yes, I've seen VSS store backspaces in
> names!
>  >    # allow all characters in the windows-1252 codepage: see
> http://de.wikipedia.org/wiki/Windows-1252
>  >    $gSysOut =~
> s/[\x00-\x09\x11\x12\x14-\x1F\x81\x8D\x8F\x90\x9D]/_/g; # just to be
> sure

I would hazard that removing just [\x00-\x09\x11\x12\x14-\x1F] should
be safe enough for any windows codepage.

Cheers,
--jonathan



More information about the vss2svn-users mailing list