Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grouping considering existing and non existing indices #12

Open
ttsesm opened this issue Jun 30, 2020 · 4 comments
Open

grouping considering existing and non existing indices #12

ttsesm opened this issue Jun 30, 2020 · 4 comments

Comments

@ttsesm
Copy link

ttsesm commented Jun 30, 2020

Hi guys, I am trying to use your library for my project but I am stuck.

I have a couple of numpy arrays:

orig = [[28021.22333333,  6585.53333333,     0. ],
 [28021.22333333,  6585.53333333,     0.        ],
 [26723.52333333,  6587.48666667,     0.        ],
 [26723.52333333,  6587.48666667,     0.        ],
 [26063.11,       13089.56,           0.        ],
 [26063.11,       13089.56,           0.        ],
 [27424.91,       13091.4,            0.        ],
 [27424.91,       13091.4,            0.        ],
 [28833.60333333, 12641.65333333,     0.        ],
 [28833.60333333, 12641.65333333,     0.        ],
 [26125.33,        7954.18166667,     0.        ],
 [26125.33,        7954.18166667,     0.        ],
 [26121.29666667,  7956.72633333,     0.        ],
 [26121.29666667,  7956.72633333,     0.        ],
 [26116.26,        7957.80833333,     0.        ],
 [26116.26,        7957.80833333,     0.        ],
 [26110.98333333,  7957.263,          0.        ],
 [26110.98333333,  7957.263,          0.        ],
 [26106.27,        7955.17333333,     0.        ],
 [26106.27,        7955.17333333,     0.        ],
 [26102.84,        7951.85733333,     0.        ],
 [26102.84,        7951.85733333,     0.        ]]

and

idxs = [ 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 20, 21]

tri = [731, 703, 703, 731, 731, 731, 731, 693, 673, 699, 689, 731, 727, 731, 731, 731, 731, 731, 730]

pnts = [[28035.61081192,  6657.82528209,  2800.  ],
 [27951.42292993,  6561.84728091,  2800.        ],
 [28076.63625815,  6536.92743701,  2800.        ],
 [28139.0775588,   6773.36600593,  2800.        ],
 [27990.76839321,  6805.17674429,  2800.        ],
 [27856.70943257,  6734.2138896,   2800.        ],
 [27799.62835447,  6593.68175023,  2800.        ],
 [27846.23402973,  6449.33687603,  2800.        ],
 [27974.71914494,  6368.71983786,  2800.        ],
 [28124.96408673,  6389.55224384,  2800.        ],
 [28226.66757706,  6502.08637967,  2800.        ],
 [28232.24142249,  6653.66627254,  2800.        ],
 [28382.4101748,  6673.10904354,  2800.        ],
 [28315.56368133,  6812.44564901,  2800.        ],
 [28197.8230677,   6912.54705367,  2800.        ],
 [28049.54675563,  6956.10481526,  2800.        ],
 [27896.37306654,  6935.58740108,  2800.        ],
 [27764.78712281,  6854.54245845,  2800.        ],
 [27677.54132953,  6726.98339422,  2800.        ]]

Initially I wanted to group the values in idxs, tri and pnts based on the values of idxs which are indices to rows of orig so that they correspond to the same value per row in orig. For example I would like to get:

idxs = [[0,1], [2,3], [4,5], [7], [8,9], [10,11], [12,13], [14,15], [17], [18], [20,21]]

tri = [[731, 703], [703, 731], [731, 731], [731], [693, 673], [699, 689], [731, 727], [731, 731], [731], [731], [731, 730]]

and

pnts = [[[28035.61081192,  6657.82528209,  2800.  ],
     [27951.42292993,  6561.84728091,  2800.        ]],
     [[28076.63625815,  6536.92743701,  2800.        ],
     [28139.0775588,   6773.36600593,  2800.        ]],
     [[27990.76839321,  6805.17674429,  2800.        ],
     [27856.70943257,  6734.2138896,   2800.        ]],
     [[27799.62835447,  6593.68175023,  2800.        ]],
     [[27846.23402973,  6449.33687603,  2800.        ],
     [27974.71914494,  6368.71983786,  2800.        ]],
     [[28124.96408673,  6389.55224384,  2800.        ],
     [28226.66757706,  6502.08637967,  2800.        ]],
     [[28232.24142249,  6653.66627254,  2800.        ],
     [28382.4101748,  6673.10904354,  2800.        ]],
     [[28315.56368133,  6812.44564901,  2800.        ],
     [28197.8230677,   6912.54705367,  2800.        ]],
     [[28049.54675563,  6956.10481526,  2800.        ]],
     [[27896.37306654,  6935.58740108,  2800.        ]],
     [[27764.78712281,  6854.54245845,  2800.        ],
     [27677.54132953,  6726.98339422,  2800.        ]]]

Also imagine that at the end I would have to apply the same on corresponding matrices with quite a few million inputs. In any case I was able to group my output by applying the following:

import numpy_indexed as npi
eq = npi.group_by(orig[idxs])
print(eq.split(idxs))
print(eq.split(tri))
print(eq.split(pnts))

However, I have two questions

  1. Is it possible to get the output sorted right away from eq.split() without to need to apply an extra step?
  2. Now eq.split() will group the data based on the existing indices given by idxs, however if in case that I did not have an index in idxs (the corresponding values in tri and pnts will also be missing) for one of the rows in orig how I could put an empty list in the corresponding position so that I keep the dimension of my output arrays to the size each unique row in my orig array?

e.g. if 17 was not in my initial idxs array my output array should be: [[0, 1], [2, 3], [4, 5], [7], [8, 9], [10, 11], [12, 13], [14, 15], [], [18], [20, 21]]

in practice I could get the non existing indices by:

start_idxs = np.arange(orig.shape[0])
no_existing_idxs = start_idxs[~np.isin(np.arange(start_idxs.size), idxs)]

but I am not sure whether this would help somehow.

@EelcoHoogendoorn
Copy link
Owner

1: define sorted. I suppose you mean sorting by the values in each split group; but what does that mean when splitting a multidimensional array in the first place?
2: I dont think there is a simpler solution than doing it yourself. If you look inside the implementation of npi.split, it inherits all its behavior from np.split.

Ill add that as a rule of thumb, you really want to avoid using split; or using jagged arrays in numpy generally. Unless the downstream processing steps truly demand this format, it is usually more elegant and always more performant, to find a way to do your operations on the array as a whole. And its almost always possible!

@ttsesm
Copy link
Author

ttsesm commented Jul 1, 2020

  1. No I mean sorting the values as list. When I create the grouping with group_by and use the split() then I get:
>> print(eq.split(idxs))
[array([0, 1]), array([20, 21]), array([8, 9]), array([14, 15]), array([2, 3]), array([17]), array([12, 13]), array([18]), array([4, 5]), array([7]), array([10, 11])]

so as you see my values are sorted per group but not as elements of the list. For sorting it as list I need to do and extra step with a for loop:

>> print(sorted([l.tolist() for l in eq.split(idxs)]))
[[0, 1], [2, 3], [4, 5], [7], [8, 9], [10, 11], [12, 13], [14, 15], [17], [18], [20, 21]]

so the question was whether I could avoid this extra step

  1. Sure, avoiding split() might me more proper but I am not sure how to do it without it, that's why I am trying to split it in steps. Actually, as you correctly guessed this is an intermediate step of what I want to finally achieve. I guess it would be more intuitive to provide the whole pipeline.

Imagine that I have orig as follows:

orig = np.array([[28021.22333333,  6585.53333333,     0. ],
 [28021.22333333,  6585.53333333,     0.        ],
 [28021.22333333,  6585.53333333,     0.        ],
 [26723.52333333,  6587.48666667,     0.        ],
 [26723.52333333,  6587.48666667,     0.        ],
 [26723.52333333,  6587.48666667,     0.        ],
 [26063.11,       13089.56,           0.        ],
 [26063.11,       13089.56,           0.        ],
 [26063.11,       13089.56,           0.        ],
 [27424.91,       13091.4,            0.        ],
 [27424.91,       13091.4,            0.        ],
 [27424.91,       13091.4,            0.        ],
 [28833.60333333, 12641.65333333,     0.        ],
 [28833.60333333, 12641.65333333,     0.        ],
 [28833.60333333, 12641.65333333,     0.        ],
 [26125.33,        7954.18166667,     0.        ],
 [26125.33,        7954.18166667,     0.        ],
 [26125.33,        7954.18166667,     0.        ],
 [26121.29666667,  7956.72633333,     0.        ],
 [26121.29666667,  7956.72633333,     0.        ],
 [26121.29666667,  7956.72633333,     0.        ],
 [26116.26,        7957.80833333,     0.        ],
 [26116.26,        7957.80833333,     0.        ],
 [26116.26,        7957.80833333,     0.        ],
 [26110.98333333,  7957.263,          0.        ],
 [26110.98333333,  7957.263,          0.        ],
 [26110.98333333,  7957.263,          0.        ],
 [26106.27,        7955.17333333,     0.        ],
 [26106.27,        7955.17333333,     0.        ],
 [26106.27,        7955.17333333,     0.        ],
 [26102.84,        7951.85733333,     0.        ],
 [26102.84,        7951.85733333,     0.        ],
 [26102.84,        7951.85733333,     0.        ]])

Then I am creating a matrix m based on size of the unique row values in orig (imagine that orig could have n times each unique row repeating itself, here it is times 3). So in this case since I have 11 unique rows, m is a 11x11 array initialized with zeros:

dim = np.unique(orig, axis=0).shape[0]
m = np.zeros((dim, dim))

Then I have idxs which correspond to the multiple row indices of orig but with missing values (already sorted). Something like:
idxs = np.array([ 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18, 20, 21, 22, 23, 27, 28, 29, 30, 31, 32])

and the corresponding tri array with values corresponding to some of the unique indices of orig or if you prefer to the indices of my square matrix m, which they could be repeatable:

tri = np.array([ 0, 0, 0, 1, 2, 2, 1, 1, 5, 6, 6, 7, 8, 9, 3, 4, 3, 10, 10, 10, 9, 8, 9, 7, 1, 3])

Now what I want to do, is to group the values in tri apply binning per group and put the corresponding value in the corresponding index back in the m matrix.

The way I am doing it at the moment is the following:

First I am finding the missing indices from idxs:

full_idxs = np.arange(orig.shape[0])
missing_idxs = full_idxs[~np.isin(np.arange(full_idxs.size), idxs)]

Then I am labeling the missing indices in tri as -1:

tri = np.insert(tri, missing_idxs - np.arange(len(missing_idxs)), -1)

Then reshaping to the size of unique rows in orig:

tri = tri.reshape(dim, -1)

and finally applying the row-wise binning and mapping:

rowidx, colidx = np.indices(tri.shape)
(cols, rows), B = npi.count((tri.flatten(), rowidx.flatten()))

# remove negative indexing that we introduced from the missing indices
negative_idxs = np.where(cols < 0)
cols = np.delete(cols, negative_idxs)
rows = np.delete(rows, negative_idxs)
B = np.delete(B, negative_idxs)

# assign values to the corresponding position in matrix `m`
m[rows, cols] = B
m
array([[3., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 2., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 2., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 2., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 3.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 2., 0.],
       [0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0.]])

the idea behind this procedure above was whether I could simplify some parts with group_by instead of inserting the missing indices which is costly considering that this I would have to do it an array with a few million values.

Apologies for the long post.

@EelcoHoogendoorn
Copy link
Owner

Sorry; I gave it 10 minutes but I cannot figure out what it is you intend to achieve. Conceptually, what does the matrix m represent?

@ttsesm
Copy link
Author

ttsesm commented Jul 5, 2020

the binning result of matrix tri per row after tri is grouped based on orig and reshaped to a 2d matrix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants