Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trouble with demo.json validation #1

Open
tschaub opened this issue Feb 28, 2012 · 4 comments
Open

trouble with demo.json validation #1

tschaub opened this issue Feb 28, 2012 · 4 comments

Comments

@tschaub
Copy link

tschaub commented Feb 28, 2012

I'm trying to write some tests for a browser implementation that use the demo.json described in the spec. I'm seeing trouble once I hit row 215, col 222 - the 55262th id. If I understand right, this should be "encoded" as 55296. I notice that some parsers mention 55296 to 57343 as a range where UTF-16 surrogate pairs cannot be converted to UTF-8.

I'm serving up my tests (with <meta http-equiv="content-type" content="text/html; charset=UTF-8">) and demo.json with Apache to Chrome 17 (same behavior on Firefox 10). Thanks for any hints on what might be up. I'm not entirely confident this is UTF-8 through and through.

@tschaub
Copy link
Author

tschaub commented Mar 1, 2012

I've put together a basic Jasmine test spec to demonstrate the issue I'm seeing. Note that this is a fork of the mapbox/mbtiles-spec repo with the demo.json referenced in latest the UTFGrid spec.

I couldn't find any other UTFGrid related tests for the client. Let me know if I've missed some - seeing working tests would help figure out what might be going wrong on my side.

Thanks.

@springmeyer
Copy link

@tschaub - thanks for this report. Nothing immediately comes to mind about why this is failing. Its certainly possible it is a problem with the demo.json. I should have some time next week to dig into this a bit more.

/cc @kkaefer - any thoughts?

@saik0
Copy link

saik0 commented Jun 5, 2013

Surrogates are invalid UTF-8

The definition of UTF-8 prohibits encoding character numbers between
U+D800 and U+DFFF, which are reserved for use with the UTF-16
encoding form (as surrogate pairs) and do not directly represent
characters.

One way to deal with it would be to treat the strings as UTF-16 and decode them into an array of Numbers. We would then be able to use the entire Unicode range of 0 - 0x10FFFF (minus invalid JSON)

@saik0
Copy link

saik0 commented Jun 5, 2013

Something like this, with saner error handling

function utf16ToUnicode (str) {
    var utf32 = 0,
        isPair = false,
        out = [],
        len = str.length;

    for(var i = 0, code; i < len; i++) {
        code = str.charCodeAt(i);
        if (!isPair) {
            if ((code  & 0xFC00) == 0xD800) {
                // High surrogate of new pair sequence
                utf32 = ((code & 0x3ff) << 10) + 0x10000;
                isPair = true;
            } else if ((code & 0xFC00) == 0xDC00) {
                // Unexpected Low Surrogate
                return false;
            } else {
                // BMP code point, pass straight through
                out.push(code);
            }
        } else {
            // When isPair is true, we expect a continuation of a surrogate pair
            if ((code & 0xFC00) == 0xDC00) {
                // Legal low surrogate
                utf32 |= (code & 0x3FF);
                out.push(utf32);
            } else {
                // Incomplete surrogate pair
                return false;
            }
            utf32 = 0;
            isPair = false;
        }
    }
    return out;
}

Edit: Fixed decoding bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants