Nightly 20070202 and new encoding switch
Jonathan Perret
Jonathan.Perret at augure.com
Mon Feb 5 08:16:32 EST 2007
Dirk:
> 1.) different encodings: This one should be solved with the encoding
> attribute, but while playing with this, I had still problems to output
> characters that are allowed in one codepage, but discouraged by the
XML
> standard. See
> http://www.w3.org/TR/REC-xml/#charsets where some characters, that are
> still allowed in the the windows-1252 codepage, are discouraged in
XML.
> esp. most of the characters in the band [x80-x9f].
Please note that XML discourages or forbids some Unicode codepoints,
not bytes in specific codepages. Specifically, windows-1252 does not
map any byte to a codepoint in the range [0x80-0x9F].
(see http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx)
For example, 0x80 in windows-1252 maps to Unicode 0x20AC (Euro sign).
> 2.) real garbaged content: real garbage is hard to detect and
therefore
> ssphys does only filter so called control characters determined by the
> "iscntrl" function. This decision is based upon the current locale. A
> few other characters are filtered by vss2svn after streaming in the
XML
> output
>
> > $gSysOut =~ s/\x00//g; # remove null bytes
> > $gSysOut =~ s/.\x08//g; # yes, I've seen VSS store backspaces in
> names!
> > # allow all characters in the windows-1252 codepage: see
> http://de.wikipedia.org/wiki/Windows-1252
> > $gSysOut =~
> s/[\x00-\x09\x11\x12\x14-\x1F\x81\x8D\x8F\x90\x9D]/_/g; # just to be
> sure
I would hazard that removing just [\x00-\x09\x11\x12\x14-\x1F] should
be safe enough for any windows codepage.
Cheers,
--jonathan
More information about the vss2svn-users
mailing list