Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more tests #96

Merged
merged 37 commits into from
Jul 15, 2020
Merged

Add more tests #96

merged 37 commits into from
Jul 15, 2020

Conversation

alkino
Copy link
Member

@alkino alkino commented Jun 17, 2020

Still to do:

  • do the same for Spikes
  • tests for ReportReader too
  • test inverted Selection
  • test data in python
  • add a function to get node_ids

@tomdele
Copy link
Contributor

tomdele commented Jun 18, 2020

For my part, I consider it as implementation details, so any value is good as you asked for invalid ranges.

Not sure what you mean here.

@alkino
Copy link
Member Author

alkino commented Jun 18, 2020

For my part, I consider it as implementation details, so any value is good as you asked for invalid ranges.

Not sure what you mean here.

I mean, do we really care about bad users inputs and add a lot of tests for it in code, or is the user responsible to give bad values?

@alkino
Copy link
Member Author

alkino commented Jun 18, 2020

After all, I have no idea about the output for those limit cases, users should define it, I can adapt.

@tomdele
Copy link
Contributor

tomdele commented Jun 18, 2020

I mean, do we really care about bad users inputs and add a lot of tests for it in code, or is the user responsible to give bad values?

There are 2 things here to me:

  • the bad inputs from the users : we must accept that the user is not necessary an expert in coding or that he has sometimes no idea of what a good selection or time range are. We should help him in finding his errors directly through the code usually via proper error throwing or warning or whatever. This will save a us a lot of painful support time at the end. You are a bit protected in HPC but I have in average maybe between 2 and 5 different scientists asking me stuffs everyday about circuit / simulation reading (and we already spent tonnes of hours on testing everything and documenting everything in bluepy or bluepysnap and in most of nse products).
  • the implicit behaviors : this needs to be defined somewhere (usually in the doc). A user cannot find by its own all the implicit behaviors and being sure this is not a bug. How can someone knows that there is no simulation with negative times and or that the times are swapped if t_start > t_stop ? And these implicit behaviors are part of the api and should remain the same if the version change (or the change should be mentioned in a change log).

Having the tests, ensure that the implicit behaviors remain the same with the changes in the implementation.

@tomdele
Copy link
Contributor

tomdele commented Jun 18, 2020

After all, I have no idea about the output for those limit cases, users should define it, I can adapt.

I put some of the limit case behaviors here :
#84

For the rest, I think this is more about you choosing what seems reasonable or not. Then if someone complains you can change it.

@tomdele
Copy link
Contributor

tomdele commented Jun 19, 2020

About the returned .data on the python side, it is mandatory to return a numpy array. @WeinaJi found out that the dataframe constructor with list is utterly bad (I only tried with the numpy arrays back in the days when we discussed this) :

# fake element reader like data :
times = [1, 2]
ids = [(i, j) for i in range(600) for j in range(300)]         # 180000 elements                                                                                                                                  
data = np.random.random((2, len(columns)))              # 2*180000 values                                                                                                                                    
data_list = data.tolist()


%timeit pd.DataFrame(data=data_list, columns=pd.MultiIndex.from_tuples(ids), index=times)                                                                                                       
6.68 s ± 42.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pd.DataFrame(data=data, columns=pd.MultiIndex.from_tuples(ids), index=times)                                                                                                            
21.7 ms ± 481 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Also I tried the .ids output options with the same data :

ids = [(i, j) for i in range(600) for j in range(300)]
ids_array = np.array([[i, j] for i in range(600) for j in range(300)]).T

%timeit pd.MultiIndex.from_tuples(ids)                                                                                                                                                         
20.6 ms ± 64.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.MultiIndex.from_arrays(ids_array)                                                                                                                                                   
1.93 ms ± 6.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The difference is massive.

So to summarize :

times = [1, 2]
ids = [(i, j) for i in range(600) for j in range(300)]        
ids_array = np.array([[i, j] for i in range(600) for j in range(300)]).T                                                                                                                                  
data = np.random.random((2, len(columns)))                                                                                                                                  
data_list = data.tolist()

# current version
%timeit pd.DataFrame(data=data_list, columns=pd.MultiIndex.from_tuples(columns), index=time)                                                                                                       
7.08 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# full numpy
%timeit pd.DataFrame(data=data, columns=pd.MultiIndex.from_arrays(columns_array), index=time)                                                                                                      
2.39 ms ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So basically if you can provide a numpy array for the data,j the ids and the times also, the gain in perf on the python / dataframe side will be HUGE.

Comment on lines 267 to 268
double start = tstart.value_or(tstart_);
double stop = tstop.value_or(tstop_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

@tomdele
Copy link
Contributor

tomdele commented Jun 23, 2020

The changes are really nice !

Some comments :

  • it would be nice to have the same "optional" API for the spikes
  • missing tests similar to SomaReportReader limits for ElementReader
  • the selection that does not raise with negative numbers or out of range ids
  • unit32_t --> ElementID
  • on the python's side it would be nice to throw a LibsonataError instead of a genericRuntimeError
  • add tests to check if get(node_ids=[1, 2]) != get(node_ids=[2, 1]). This is important for me to know if the ids are always ordered or not. This is used when converting to dataframe (I need to know if I should call .sort(axis=1) or not).
  • some tests are disabled for the moment and should be enabled in the c++ side
  • in python having tests with the .data is needed. You can use : numpy.testing.assert_allclose to compare more easily the floats.
  • also having a function ReportReader<T>::Population::nodeIds() (node_ids() in python) would be useful. This function should just access the population/mapping/node_ids and return a vector from the dataset (numpy array in python).
  • numpy outputs (but for another PR probably)

@alkino
Copy link
Member Author

alkino commented Jun 23, 2020

The changes are really nice !

Some comments :

* it would be nice to have the same "optional" API for the spikes

* missing tests similar to `SomaReportReader limits` for `ElementReader`

* the selection that does not raise with negative numbers or out of range ids

* `unit32_t` --> `ElementID`

* on the python's side it would be nice to throw a `LibsonataError` instead of a generic`RuntimeError`

* add tests to check if get(node_ids=[1, 2]) != get(node_ids=[2, 1]). This is important for me to know if the ids are always ordered or not. This is used when converting to dataframe (I need to know if I should call .sort(axis=1) or not).

* some tests are disabled for the moment and should be enabled in the c++ side

* in python having tests with the `.data` is needed. You can use : `numpy.testing.assert_allclose` to compare more easily the floats.

* also having a function `ReportReader<T>::Population::nodeIds()` (`node_ids()` in python) would be useful. This function should just access the `population/mapping/node_ids` and return a vector from the dataset (numpy array in python).

* numpy outputs (but for another PR probably)

First comment updated, trying to summarize with all things asked in this patch.

@alkino
Copy link
Member Author

alkino commented Jun 24, 2020

@tomdele inverted Selection is already tested and throw

@alkino
Copy link
Member Author

alkino commented Jul 9, 2020

It works for me

@tomdele
Copy link
Contributor

tomdele commented Jul 9, 2020

actually I have this :

In [1]: from libsonata import SpikeReader                                                                                                                                                                   
In [2]: a = SpikeReader("spikes.h5")                                                                                                                                                                        
In [3]: b= a["default"]                                                                                                                                                                                     
In [4]: b.get(node_ids=[])                                                                                                                                                                                  
Out[4]: [(2, 0.1), (0, 0.2), (1, 0.3), (2, 0.7), (0, 1.3)]

maybe I messed up ?

@tomdele
Copy link
Contributor

tomdele commented Jul 9, 2020

just add the same tests as the Frame version and then we will be sure !

tomdele
tomdele previously approved these changes Jul 10, 2020
python/tests/test.py Outdated Show resolved Hide resolved
Blanco Alonso Jorge and others added 4 commits July 14, 2020 12:30
We finally solve the problem of array data stored somewhere and
we want to return only a wrapper numpy array. 

Co-authored-by: Fernando Pereira <[email protected]>
@alkino
Copy link
Member Author

alkino commented Jul 15, 2020

So I think we can merge it with the fix of @ferdonline ?

@tomdele
Copy link
Contributor

tomdele commented Jul 15, 2020

can you just add a simple test to check the double access to the data and the times ? so we can t regress on this ?

@alkino
Copy link
Member Author

alkino commented Jul 15, 2020

it's done

@tomdele
Copy link
Contributor

tomdele commented Jul 15, 2020

For time not data. If you want to change the bindings one day for matrix we don t want to regress

@alkino
Copy link
Member Author

alkino commented Jul 15, 2020

@alkino
Copy link
Member Author

alkino commented Jul 15, 2020

Ok, let's go

@tomdele
Copy link
Contributor

tomdele commented Jul 15, 2020

I am good

@alkino alkino merged commit 5352f8e into master Jul 15, 2020
@alkino alkino deleted the add_tests branch July 15, 2020 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants