trouble with demo.json validation #1

tschaub · 2012-02-28T00:23:49Z

I'm trying to write some tests for a browser implementation that use the demo.json described in the spec. I'm seeing trouble once I hit row 215, col 222 - the 55262th id. If I understand right, this should be "encoded" as 55296. I notice that some parsers mention 55296 to 57343 as a range where UTF-16 surrogate pairs cannot be converted to UTF-8.

I'm serving up my tests (with <meta http-equiv="content-type" content="text/html; charset=UTF-8">) and demo.json with Apache to Chrome 17 (same behavior on Firefox 10). Thanks for any hints on what might be up. I'm not entirely confident this is UTF-8 through and through.

The text was updated successfully, but these errors were encountered:

tschaub · 2012-03-01T01:38:42Z

I've put together a basic Jasmine test spec to demonstrate the issue I'm seeing. Note that this is a fork of the mapbox/mbtiles-spec repo with the demo.json referenced in latest the UTFGrid spec.

I couldn't find any other UTFGrid related tests for the client. Let me know if I've missed some - seeing working tests would help figure out what might be going wrong on my side.

Thanks.

springmeyer · 2012-03-01T02:50:23Z

@tschaub - thanks for this report. Nothing immediately comes to mind about why this is failing. Its certainly possible it is a problem with the demo.json. I should have some time next week to dig into this a bit more.

/cc @kkaefer - any thoughts?

saik0 · 2013-06-05T11:35:16Z

Surrogates are invalid UTF-8

The definition of UTF-8 prohibits encoding character numbers between
U+D800 and U+DFFF, which are reserved for use with the UTF-16
encoding form (as surrogate pairs) and do not directly represent
characters.

One way to deal with it would be to treat the strings as UTF-16 and decode them into an array of Numbers. We would then be able to use the entire Unicode range of 0 - 0x10FFFF (minus invalid JSON)

saik0 · 2013-06-05T11:52:27Z

Something like this, with saner error handling

function utf16ToUnicode (str) {
    var utf32 = 0,
        isPair = false,
        out = [],
        len = str.length;

    for(var i = 0, code; i < len; i++) {
        code = str.charCodeAt(i);
        if (!isPair) {
            if ((code  & 0xFC00) == 0xD800) {
                // High surrogate of new pair sequence
                utf32 = ((code & 0x3ff) << 10) + 0x10000;
                isPair = true;
            } else if ((code & 0xFC00) == 0xDC00) {
                // Unexpected Low Surrogate
                return false;
            } else {
                // BMP code point, pass straight through
                out.push(code);
            }
        } else {
            // When isPair is true, we expect a continuation of a surrogate pair
            if ((code & 0xFC00) == 0xDC00) {
                // Legal low surrogate
                utf32 |= (code & 0x3FF);
                out.push(utf32);
            } else {
                // Incomplete surrogate pair
                return false;
            }
            utf32 = 0;
            isPair = false;
        }
    }
    return out;
}

Edit: Fixed decoding bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trouble with demo.json validation #1

trouble with demo.json validation #1

tschaub commented Feb 28, 2012

tschaub commented Mar 1, 2012

springmeyer commented Mar 1, 2012

saik0 commented Jun 5, 2013

saik0 commented Jun 5, 2013

trouble with demo.json validation #1

trouble with demo.json validation #1

Comments

tschaub commented Feb 28, 2012

tschaub commented Mar 1, 2012

springmeyer commented Mar 1, 2012

saik0 commented Jun 5, 2013

saik0 commented Jun 5, 2013