grouping considering existing and non existing indices #12

ttsesm · 2020-06-30T13:32:16Z

Hi guys, I am trying to use your library for my project but I am stuck.

I have a couple of numpy arrays:

orig = [[28021.22333333,  6585.53333333,     0. ],
 [28021.22333333,  6585.53333333,     0.        ],
 [26723.52333333,  6587.48666667,     0.        ],
 [26723.52333333,  6587.48666667,     0.        ],
 [26063.11,       13089.56,           0.        ],
 [26063.11,       13089.56,           0.        ],
 [27424.91,       13091.4,            0.        ],
 [27424.91,       13091.4,            0.        ],
 [28833.60333333, 12641.65333333,     0.        ],
 [28833.60333333, 12641.65333333,     0.        ],
 [26125.33,        7954.18166667,     0.        ],
 [26125.33,        7954.18166667,     0.        ],
 [26121.29666667,  7956.72633333,     0.        ],
 [26121.29666667,  7956.72633333,     0.        ],
 [26116.26,        7957.80833333,     0.        ],
 [26116.26,        7957.80833333,     0.        ],
 [26110.98333333,  7957.263,          0.        ],
 [26110.98333333,  7957.263,          0.        ],
 [26106.27,        7955.17333333,     0.        ],
 [26106.27,        7955.17333333,     0.        ],
 [26102.84,        7951.85733333,     0.        ],
 [26102.84,        7951.85733333,     0.        ]]

and

idxs = [ 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 20, 21]

tri = [731, 703, 703, 731, 731, 731, 731, 693, 673, 699, 689, 731, 727, 731, 731, 731, 731, 731, 730]

pnts = [[28035.61081192,  6657.82528209,  2800.  ],
 [27951.42292993,  6561.84728091,  2800.        ],
 [28076.63625815,  6536.92743701,  2800.        ],
 [28139.0775588,   6773.36600593,  2800.        ],
 [27990.76839321,  6805.17674429,  2800.        ],
 [27856.70943257,  6734.2138896,   2800.        ],
 [27799.62835447,  6593.68175023,  2800.        ],
 [27846.23402973,  6449.33687603,  2800.        ],
 [27974.71914494,  6368.71983786,  2800.        ],
 [28124.96408673,  6389.55224384,  2800.        ],
 [28226.66757706,  6502.08637967,  2800.        ],
 [28232.24142249,  6653.66627254,  2800.        ],
 [28382.4101748,  6673.10904354,  2800.        ],
 [28315.56368133,  6812.44564901,  2800.        ],
 [28197.8230677,   6912.54705367,  2800.        ],
 [28049.54675563,  6956.10481526,  2800.        ],
 [27896.37306654,  6935.58740108,  2800.        ],
 [27764.78712281,  6854.54245845,  2800.        ],
 [27677.54132953,  6726.98339422,  2800.        ]]

Initially I wanted to group the values in idxs, tri and pnts based on the values of idxs which are indices to rows of orig so that they correspond to the same value per row in orig. For example I would like to get:

idxs = [[0,1], [2,3], [4,5], [7], [8,9], [10,11], [12,13], [14,15], [17], [18], [20,21]]

tri = [[731, 703], [703, 731], [731, 731], [731], [693, 673], [699, 689], [731, 727], [731, 731], [731], [731], [731, 730]]

and

pnts = [[[28035.61081192,  6657.82528209,  2800.  ],
     [27951.42292993,  6561.84728091,  2800.        ]],
     [[28076.63625815,  6536.92743701,  2800.        ],
     [28139.0775588,   6773.36600593,  2800.        ]],
     [[27990.76839321,  6805.17674429,  2800.        ],
     [27856.70943257,  6734.2138896,   2800.        ]],
     [[27799.62835447,  6593.68175023,  2800.        ]],
     [[27846.23402973,  6449.33687603,  2800.        ],
     [27974.71914494,  6368.71983786,  2800.        ]],
     [[28124.96408673,  6389.55224384,  2800.        ],
     [28226.66757706,  6502.08637967,  2800.        ]],
     [[28232.24142249,  6653.66627254,  2800.        ],
     [28382.4101748,  6673.10904354,  2800.        ]],
     [[28315.56368133,  6812.44564901,  2800.        ],
     [28197.8230677,   6912.54705367,  2800.        ]],
     [[28049.54675563,  6956.10481526,  2800.        ]],
     [[27896.37306654,  6935.58740108,  2800.        ]],
     [[27764.78712281,  6854.54245845,  2800.        ],
     [27677.54132953,  6726.98339422,  2800.        ]]]

Also imagine that at the end I would have to apply the same on corresponding matrices with quite a few million inputs. In any case I was able to group my output by applying the following:

import numpy_indexed as npi
eq = npi.group_by(orig[idxs])
print(eq.split(idxs))
print(eq.split(tri))
print(eq.split(pnts))

However, I have two questions

Is it possible to get the output sorted right away from eq.split() without to need to apply an extra step?
Now eq.split() will group the data based on the existing indices given by idxs, however if in case that I did not have an index in idxs (the corresponding values in tri and pnts will also be missing) for one of the rows in orig how I could put an empty list in the corresponding position so that I keep the dimension of my output arrays to the size each unique row in my orig array?

e.g. if 17 was not in my initial idxs array my output array should be: [[0, 1], [2, 3], [4, 5], [7], [8, 9], [10, 11], [12, 13], [14, 15], [], [18], [20, 21]]

in practice I could get the non existing indices by:

start_idxs = np.arange(orig.shape[0])
no_existing_idxs = start_idxs[~np.isin(np.arange(start_idxs.size), idxs)]

but I am not sure whether this would help somehow.

The text was updated successfully, but these errors were encountered:

EelcoHoogendoorn · 2020-07-01T12:59:02Z

1: define sorted. I suppose you mean sorting by the values in each split group; but what does that mean when splitting a multidimensional array in the first place?
2: I dont think there is a simpler solution than doing it yourself. If you look inside the implementation of npi.split, it inherits all its behavior from np.split.

Ill add that as a rule of thumb, you really want to avoid using split; or using jagged arrays in numpy generally. Unless the downstream processing steps truly demand this format, it is usually more elegant and always more performant, to find a way to do your operations on the array as a whole. And its almost always possible!

ttsesm · 2020-07-01T19:37:25Z

No I mean sorting the values as list. When I create the grouping with group_by and use the split() then I get:

>> print(eq.split(idxs))
[array([0, 1]), array([20, 21]), array([8, 9]), array([14, 15]), array([2, 3]), array([17]), array([12, 13]), array([18]), array([4, 5]), array([7]), array([10, 11])]

so as you see my values are sorted per group but not as elements of the list. For sorting it as list I need to do and extra step with a for loop:

>> print(sorted([l.tolist() for l in eq.split(idxs)]))
[[0, 1], [2, 3], [4, 5], [7], [8, 9], [10, 11], [12, 13], [14, 15], [17], [18], [20, 21]]

so the question was whether I could avoid this extra step

Sure, avoiding split() might me more proper but I am not sure how to do it without it, that's why I am trying to split it in steps. Actually, as you correctly guessed this is an intermediate step of what I want to finally achieve. I guess it would be more intuitive to provide the whole pipeline.

Imagine that I have orig as follows:

orig = np.array([[28021.22333333,  6585.53333333,     0. ],
 [28021.22333333,  6585.53333333,     0.        ],
 [28021.22333333,  6585.53333333,     0.        ],
 [26723.52333333,  6587.48666667,     0.        ],
 [26723.52333333,  6587.48666667,     0.        ],
 [26723.52333333,  6587.48666667,     0.        ],
 [26063.11,       13089.56,           0.        ],
 [26063.11,       13089.56,           0.        ],
 [26063.11,       13089.56,           0.        ],
 [27424.91,       13091.4,            0.        ],
 [27424.91,       13091.4,            0.        ],
 [27424.91,       13091.4,            0.        ],
 [28833.60333333, 12641.65333333,     0.        ],
 [28833.60333333, 12641.65333333,     0.        ],
 [28833.60333333, 12641.65333333,     0.        ],
 [26125.33,        7954.18166667,     0.        ],
 [26125.33,        7954.18166667,     0.        ],
 [26125.33,        7954.18166667,     0.        ],
 [26121.29666667,  7956.72633333,     0.        ],
 [26121.29666667,  7956.72633333,     0.        ],
 [26121.29666667,  7956.72633333,     0.        ],
 [26116.26,        7957.80833333,     0.        ],
 [26116.26,        7957.80833333,     0.        ],
 [26116.26,        7957.80833333,     0.        ],
 [26110.98333333,  7957.263,          0.        ],
 [26110.98333333,  7957.263,          0.        ],
 [26110.98333333,  7957.263,          0.        ],
 [26106.27,        7955.17333333,     0.        ],
 [26106.27,        7955.17333333,     0.        ],
 [26106.27,        7955.17333333,     0.        ],
 [26102.84,        7951.85733333,     0.        ],
 [26102.84,        7951.85733333,     0.        ],
 [26102.84,        7951.85733333,     0.        ]])

Then I am creating a matrix m based on size of the unique row values in orig (imagine that orig could have n times each unique row repeating itself, here it is times 3). So in this case since I have 11 unique rows, m is a 11x11 array initialized with zeros:

dim = np.unique(orig, axis=0).shape[0]
m = np.zeros((dim, dim))

Then I have idxs which correspond to the multiple row indices of orig but with missing values (already sorted). Something like:
idxs = np.array([ 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18, 20, 21, 22, 23, 27, 28, 29, 30, 31, 32])

and the corresponding tri array with values corresponding to some of the unique indices of orig or if you prefer to the indices of my square matrix m, which they could be repeatable:

tri = np.array([ 0, 0, 0, 1, 2, 2, 1, 1, 5, 6, 6, 7, 8, 9, 3, 4, 3, 10, 10, 10, 9, 8, 9, 7, 1, 3])

Now what I want to do, is to group the values in tri apply binning per group and put the corresponding value in the corresponding index back in the m matrix.

The way I am doing it at the moment is the following:

First I am finding the missing indices from idxs:

full_idxs = np.arange(orig.shape[0])
missing_idxs = full_idxs[~np.isin(np.arange(full_idxs.size), idxs)]

Then I am labeling the missing indices in tri as -1:

tri = np.insert(tri, missing_idxs - np.arange(len(missing_idxs)), -1)

Then reshaping to the size of unique rows in orig:

tri = tri.reshape(dim, -1)

and finally applying the row-wise binning and mapping:

rowidx, colidx = np.indices(tri.shape)
(cols, rows), B = npi.count((tri.flatten(), rowidx.flatten()))

# remove negative indexing that we introduced from the missing indices
negative_idxs = np.where(cols < 0)
cols = np.delete(cols, negative_idxs)
rows = np.delete(rows, negative_idxs)
B = np.delete(B, negative_idxs)

# assign values to the corresponding position in matrix `m`
m[rows, cols] = B
m
array([[3., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 2., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 2., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 2., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 3.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 2., 0.],
       [0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0.]])

the idea behind this procedure above was whether I could simplify some parts with group_by instead of inserting the missing indices which is costly considering that this I would have to do it an array with a few million values.

Apologies for the long post.

EelcoHoogendoorn · 2020-07-05T19:29:32Z

Sorry; I gave it 10 minutes but I cannot figure out what it is you intend to achieve. Conceptually, what does the matrix m represent?

ttsesm · 2020-07-05T19:35:06Z

the binning result of matrix tri per row after tri is grouped based on orig and reshaped to a 2d matrix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grouping considering existing and non existing indices #12

grouping considering existing and non existing indices #12

ttsesm commented Jun 30, 2020 •

edited

Loading

EelcoHoogendoorn commented Jul 1, 2020

ttsesm commented Jul 1, 2020 •

edited

Loading

EelcoHoogendoorn commented Jul 5, 2020

ttsesm commented Jul 5, 2020 •

edited

Loading

grouping considering existing and non existing indices #12

grouping considering existing and non existing indices #12

Comments

ttsesm commented Jun 30, 2020 • edited Loading

EelcoHoogendoorn commented Jul 1, 2020

ttsesm commented Jul 1, 2020 • edited Loading

EelcoHoogendoorn commented Jul 5, 2020

ttsesm commented Jul 5, 2020 • edited Loading

ttsesm commented Jun 30, 2020 •

edited

Loading

ttsesm commented Jul 1, 2020 •

edited

Loading

ttsesm commented Jul 5, 2020 •

edited

Loading