-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adam 64 #50
base: main
Are you sure you want to change the base?
Conversation
…ing garbage results though. needs cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an overall question on the scope. Do we want to eventually be able to do something like the following?
from precovery.main import precover
results = precover(..., algorithm="brute-force")
I personally think it would be useful to have the precovery algorithm options mapped to a common interface like that. How feasible is it to take what you have here and enable something like that? If we do have a generalized interface it becomes very easy to do a brute-force vs nelson-force (we need a name) comparison which is what we will want to do.
By extension, that would also mean enabling frame candidates with the brute-force code and probably re-organizing the code a bit to enable that level of generalization. This might put us in the realm of a higher-level design discussion. What do you think?
In the meantime, I think it's become even more important to get the unit test we built last week cleaned up so I will do that today and submit a PR. Ideally, before we merge this work we will want to run that test against the brute-force search as well.
I'm pretty excited about this. I'm looking forward to being able to delete large chunks of code in THOR and this is a huge step forward in that.
precovery/utils.py
Outdated
return chunk_size | ||
|
||
|
||
def _initWorker(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the original author: this function should be deleted and any reference to it destroyed and purged. :)
I originally added this function so I could keyboard interrupt long-running multiprocessed jobs. What you can do is launch a pool, and then catch a keyboard interrupt exception, and then terminate the pool. Though ignoring signals is too dangerous to be worth the convenience here so we should ideally remove this and pretend it never existed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some stuff in from the THOR implementation - but will pull this part out. The reasoning for why it was there makes total sense
precovery/healpix_geom.py
Outdated
@@ -7,3 +7,13 @@ def radec_to_healpixel( | |||
ra: npt.NDArray[np.float64], dec: npt.NDArray[np.float64], nside: int | |||
) -> npt.NDArray[np.int64]: | |||
return hp.ang2pix(nside, ra, dec, nest=True, lonlat=True).astype(np.int64) | |||
|
|||
|
|||
def radec_to_healpixel_with_neighbors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is a very useful addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ntellis Can we make this an optional keyword include_neighbors
argument to radec_to_healpixel
? I
precovery/brute_force.py
Outdated
return attributions | ||
|
||
|
||
def attributeObservations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest renaming this to attribute_observations
. THOR inherited LSST's use of camel case but ideally let's match the rest of the precovery code and go with underscores. The new version of THOR is headed that direction as well.
cc' @akoumjian : function naming going forward might be a good style guide addition for us to adopt at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with the casing. I'm in favor of sticking to underscore casing for Python, camelCase for JS. Good to know new THOR is moving that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, although rather immaterial snake_case
is the way to go.
precovery/residuals.py
Outdated
__all__ = ["calcResiduals", "calcSimpleResiduals", "calcProbabilisticResiduals"] | ||
|
||
|
||
def calcResiduals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same renaming here. Let's go with underscores to make it all consistent.
precovery/utils.py
Outdated
return | ||
|
||
|
||
def _checkParallel(num_jobs, parallel_backend): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_checkParallel
is another THOR artifact that is best dropped I think. I think the use case here isn't substantial enough to merit having this function in its current form since it would be another thing we have to unit test.
Instead, I'd assume we are going to be using Python's multiprocessing moving forward, remove the ray
component, and then throw the calculation of the number of cores to use into the attribute_observations
function or write a new function altogether that does the same.
I know we also do a "number of cores to use" calculation in the multiprocess indexing code so there could be some use for using a generalized utility function. For example, if num_workers=None
then it defaults to using all available threads, or if a user passes an int then uses that number of threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I think for this implementation in particular we can just stick with multiprocessing. I had taken out reference to the other backends in attribute_observations , but realize it was left in in this function. Will reorganize as you suggest.
This does have a difference with nelson-force - we really want single machines with many cores, because of the constraint in pulling data out of the indexed db. Doesn't scale horizontally as well for that reason, unless broken up in time.
precovery/brute_force.py
Outdated
ephemeris = pd.concat(ephemeris_dfs, ignore_index=True) | ||
unique_frames = pd.concat(frame_dfs, ignore_index=True).drop_duplicates() | ||
|
||
# This filter is very inefficient - there is surely a better way to do this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this more efficient I'd delegate this to SQL altogether and add another convenience function. This PR already adds some very useful convenience functions to the FrameIndex class. Maybe adding one that returns all frames with a specific healpixel, mjd, and obscode is another one that will be useful.
Actually, I'd imagine that a function like this is necessary, especially in the context of HEALPix down/upsampling that Tanay's been working on. We will want to be able to query for all frames that correspond to one HEALPixel value at a given mjd and obscode so that we can pull those frames in and then check those for matches. In the context of resampling nside
, you could imagine adding a nside_search
parameter to this convenience function. If nside_search
is not equal to the nside
at indexing time, then you then resample pixels and return the correct frames for a given nside_search
. Of course, this is out of scope for this PR but having part of this function already included might be super useful.
Please feel free to push back here given you have more knowledge of what Tanay's code looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your sense here is correct. I was reticent to add too much bloat to the FrameIndex class - functions convenient just for the brute-force. But I agree that this is a case where we could get utility out of that function outside of brute-force as well.
I'll see what could meet the requirements of both the healpix downsampling and this, and rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconded on pushing to SQL. Let the database handle filtering heuristics.
precovery/brute_force.py
Outdated
[ | ||
pd.DataFrame.from_dict( | ||
{ | ||
"mjd_utc": [x[0] for y in range(9)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is range(9)
doing here exactly? Might be worth adding a comment since it's a hard-coded list length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this is unclear. Healpixel plus neighbors is a set of 9 healpixels
""" | ||
|
||
# Find all the mjd we need to propagate to | ||
all_frame_mjd = precovery_db.frames.idx.unique_frame_times() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've probably already thought about this but you can solve your comment on line 54 by changing this function to return tuples (or similar) of obscode and mjds for that obscode. Then you can remove the obscode kwarg altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - in my local branch now, will push a fix.
This does break down a little with per-obs MJDs though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, excellent point. We might be able to get around that later on by loading in the frames and then adding a check for if frame.mjd != obs.mjd
then propagate to the correct epoch. Not sure how we would generalize it so we can have nelson-force and brute-force use the same logic and code.
Thanks for the detailed review @moeyensj. Will incorporate and update. I think you're absolutely right that we will want to be able to call precovery with a method as you describe, ultimately. Probably written in such a way as to default to "nelson-force" so as not to break anything current. One limitation that makes just swapping out the mode difficult is that the brute force method is a huge memory hog. Doing the propagation in advance and only pulling intersecting and neighboring frames helps. It's also not nearly as bad at higher-nside (pulling less observations per ephemeride), but with many orbits over 7 years, the alg as written still could demand loading some tens of gb of observations into memory. Just something to be aware of. This could be avoided by setting a max memory usage and batch processing, but that isn't written in as of now - optimization for the future. Now, we may not be running this algorithm at huge scale repeatedly - I think mostly as a consistency check for the nelson method - so this may not be an issue. |
precovery/brute_force.py
Outdated
ephemeris = pd.concat(ephemeris_dfs, ignore_index=True) | ||
unique_frames = pd.concat(frame_dfs, ignore_index=True).drop_duplicates() | ||
|
||
# This filter is very inefficient - there is surely a better way to do this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconded on pushing to SQL. Let the database handle filtering heuristics.
precovery/brute_force.py
Outdated
attribution_dfs = [] | ||
|
||
# prepare ephemeris and observation dictionaries | ||
ephemeris, observations = get_observations_and_ephemerides_for_orbits( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be completely misunderstanding this, so I apologize if I'm off.
Can this function be broken up into separate functions? The process seems to be:
- orbits -> ephemerides
- ephemerides -> all unique intersecting frames + neighbors
- frames -> get all observations
The reason being that I believe most of these are re-usable and consistent and may simply require parameters to adjust results based on brute-force vs. nelson-force, etc. Here is what I am thinking this looks like alternatively,
frame_times = get_frame_times_in_range(mjd_start, mjd_end)
ephemerides = ephemerides_from_orbits(orbits, epochs, obscode=obscode)
frames_to_search = intersecting_frames(ephemerides, neighbors=True) # Default is False
observations = precovery_db.extract_observations_by_frames(filtered_frames)
I think each of these steps is usable in multiple cases, especially with using optional parameters.
This makes each individual piece more testable and gives back simpler return values.
precovery/healpix_geom.py
Outdated
@@ -7,3 +7,13 @@ def radec_to_healpixel( | |||
ra: npt.NDArray[np.float64], dec: npt.NDArray[np.float64], nside: int | |||
) -> npt.NDArray[np.int64]: | |||
return hp.ang2pix(nside, ra, dec, nest=True, lonlat=True).astype(np.int64) | |||
|
|||
|
|||
def radec_to_healpixel_with_neighbors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ntellis Can we make this an optional keyword include_neighbors
argument to radec_to_healpixel
? I
for frame in frames: | ||
inc_arr = np.empty((0, 5), float) | ||
obs_ids = np.empty((0, 1), object) | ||
for obs in self.frames.iterate_observations(frame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you summarize what's going on with this loop?
…orce to use smaller, testable functions. still todo: add obscode filter to get_frame_times
…tending PrecoveryCandidate, as well as a sorted df of the same.
…g, extends test_precovery_index to run on both algorithms
The brute force algorithm has now been refactored to return PrecoveryCandidates (or, a superclass with the probabilistic extras included). However, it now breaks on per-obs MJDs - as prophesized. As written, the alg:
This breaks down because the calculated ephemerides now don't match the observation MJDs. Another change is needed: propagate from "frame midpoint" to the mjd of each observation. This implementation is tending towards nelson-force as time goes on. |
for ephemeris_visit, observations_visit in zip( | ||
ephemeris_visits, observations_visits | ||
): | ||
assert len(ephemeris_visit["mjd_utc"].unique()) == 1 | ||
assert len(observations_visit["mjd_utc"].unique()) == 1 | ||
assert ( | ||
observations_visit["mjd_utc"].unique()[0] | ||
== ephemeris_visit["mjd_utc"].unique()[0] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where a major refactor takes place in the case of per-obs mjds. Not sure the BallTree implementation makes sense in the context where each obs has its own coords_predicted_latlon
. Not much of a tree in that case.
No description provided.