-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does not work with structured arrays #42
Comments
Here is a rough idea of how to parse a structured array. It changes things a bit with the library, but this stand-alone example worked with a structured array that I was working with. Maybe this will be helpful for others. const dtypes = {
u1: {
bytesPerElement: Uint8Array.BYTES_PER_ELEMENT,
dvFnName: 'getUint8'
},
u2: {
bytesPerElement: Uint16Array.BYTES_PER_ELEMENT,
dvFnName: 'getUint16'
},
i1: {
bytesPerElement: Int8Array.BYTES_PER_ELEMENT,
dvFnName: 'getInt8'
},
i2: {
bytesPerElement: Int16Array.BYTES_PER_ELEMENT,
dvFnName: 'getInt16'
},
u4: {
bytesPerElement: Int32Array.BYTES_PER_ELEMENT,
dvFnName: 'getInt32'
},
i4: {
bytesPerElement: Int32Array.BYTES_PER_ELEMENT,
dvFnName: 'getInt32'
},
u8: {
bytesPerElement: BigUint64Array.BYTES_PER_ELEMENT,
dvFnName: 'getBigUint64'
},
i8: {
bytesPerElement: BigInt64Array.BYTES_PER_ELEMENT,
dvFnName: 'getBigInt64'
},
f4: {
bytesPerElement: Float32Array.BYTES_PER_ELEMENT,
dvFnName: 'getFloat32'
},
f8: {
bytesPerElement: Float64Array.BYTES_PER_ELEMENT,
dvFnName: 'getFloat64'
},
};
function parse(arrayBufferContents) {
const dv = new DataView(arrayBufferContents);
const headerLength = dv.getUint16(8, true);
const offsetBytes = 10 + headerLength;
const hcontents = new TextDecoder("utf-8").decode(
new Uint8Array(arrayBufferContents.slice(10, 10 + headerLength))
);
const [, descr, fortranOrder, shape] = hcontents.match(
/{'descr': (.*), 'fortran_order': (.*), 'shape': (.*), }/
);
const columns = [...descr.matchAll(/\('([^']+)', '([\|<>])([^']+)'\)/g)].map(
([, columnName, endianess, dtype]) => ({
columnName,
littleEndian: endianess === "<",
bytesPerElement: dtypes[dtype].bytesPerElement,
dvFn: (...args) => dv[dtypes[dtype].dvFnName](...args),
})
);
const [, numRows] = shape.match(/\((\d+),\)/);
const stride = columns
.map((c) => c.bytesPerElement)
.reduce((sum, numBytes) => sum + numBytes, 0);
const data = [];
let i, c, offset, dataIdx, row;
const numColumns = columns.length;
for (i = offsetBytes; i < numRows; i++) {
offset = 0;
row = {};
dataIdx = i * stride;
for (c = 0; c < numColumns; c++) {
row[columns[c].columnName] = columns[c].dvFn(dataIdx + offset, columns[c].littleEndian);
offset += columns[c].bytesPerElement;
}
data.push(row);
}
return data;
} |
Wow thank you for the thoughtful comments @jeffpeck10x!! Would you be interested in turning this into a PR? Otherwise I'm happy to incorporate these suggestions in my next dev push :) |
Thanks @j6k4m8 . I don't think I have the time to generalize this into a more complete approach. I did end up taking this a little farther, although ended up ultimately using Apache Arrow for my use-case. That said, here is some helpful code to get this started. The following function should reliably fetch function parseHeaderContents(hcontents) {
const [, descr, fortranOrder, shape] = hcontents.match(
/{'descr': (.*), 'fortran_order': (.*), 'shape': (.*), }/
);
let offset = 0;
const columns = [...descr.matchAll(/\('([^']+)', '([\|<>])([^']+)'\)/g)].map(
([, columnName, endianess, dtype]) => {
const littleEndian = endianess === "<";
const ret = {
columnName,
dvFnName: dtypes[dtype].dvFnName,
TypedArray: dtypes[dtype].TypedArray,
};
offset += dtypes[dtype].bytesPerElement;
return ret;
}
);
const [, numRows] = shape.match(/\((\d+),\)/);
return { columns, numRows };
} I ended up writing the following function parse(arrayBufferContents) {
const dv = new DataView(arrayBufferContents);
const headerLength = dv.getUint16(8, true);
const offsetBytes = 10 + headerLength;
const hcontents = new TextDecoder("utf-8").decode(
new Uint8Array(arrayBufferContents.slice(10, 10 + headerLength))
);
const { columns, numRows } = parseHeaderContents(hcontents);
const columnNames = columns.map((c) => c.columnName);
const offsets = columns
.map((c) => c.TypedArray.BYTES_PER_ELEMENT)
.reduce((arr, v, i) => [...arr, v + arr[i]], [0]);
const stride = offsets.pop();
const dvFnNames = columns.map((c) => c.dvFnName);
const dataViews = offsets.map(
(offset) => new DataView(arrayBufferContents, offsetBytes + offset)
);
const dataViewGetters = dataViews.map((dv, i) => dv[dvFnNames[i]].bind(dv));
const data = columnNames.reduce(
(obj, columnName, i) => ({
...obj,
[columnName]: new columns[i].TypedArray(numRows),
}),
{}
);
for (let j = 0; j < columnNames.length; j++) {
const columnName = columnNames[j];
const getter = dataViewGetters[j];
const column = data[columnName];
for (let i = 0; i < numRows; i++) {
column[i] = getter(i * stride, true);
}
}
return data;
} I think there are ideas from that which can be used to generalize the approach. |
I know this all goes in a slightly different direction than your library, although using your library as the base gave me a really good headstart, just to even realize that I can use DataViews and offsets, and where to find the metadata in the headers. So, if any of this helps to improve the versatility of your library, great! And full circle. Don't feel like any pressure to do this though. In my particular use-case, as mentioned, I am all set with Apache Arrow IPC (ultimately needed to send this data between two processes, so more suited anyway). |
.npy
files can contain structured arrays, as described here, fail to open with error:Example, in python create the structured array:
Serve the
out.npy
file:Try to load that
out.npy
file:The text was updated successfully, but these errors were encountered: