Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add future work documentation #87

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

nielsleadholm
Copy link
Contributor

@nielsleadholm nielsleadholm commented Dec 2, 2024

Adds descriptions for many of the outstanding "future-work" areas of our documentation. This includes a few new sections:

  • simple-cross-modal-policy
  • improve-handling-of-symmetry
  • reuse-hypothesis-testing-points
  • support-scale-invariance

I've also added these to the overview document.

"Bottom-up distant agent policies" was removed because it was a duplicate, while a few others appear "removed" because of changes to their names/fixes to typos.

A few of the ones that I haven't done as I was a bit unsure what we had previously discussed, are:

  • Improve Bounded Evidence Performance
  • Make it Possible to Store Multiple Feature Maps on one Graph
  • Less Dependency on First Observation
  • Deal with Incomplete Models

@scottcanoe and @hlee9212 tagging you for any thoughts you want to add and as this also gives an overview of some of the things we can work on soon.

@tristanls tristanls added documentation Improvements or additions to documentation triaged This issue or pull request was triaged labels Dec 2, 2024
@@ -1,3 +1,5 @@
---
title: Use Models with Less Points
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to be that guy, but "Fewer" is also an option here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha fair point, thanks

Copy link
Contributor

@scottcanoe scottcanoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I'm excited to get to work on these!

Copy link
Contributor

@vkakerbeck vkakerbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thanks a lot for outlining all those :)
I left a bunch of detailed comments, some are definitely open to discussion. I just thought this would be a good place to have those discussions and nail down what exactly we mean by these items.
Overall, reading all these made me excited to jump into research again!

@@ -1,4 +1,9 @@
---
title: Add Associative Connections
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one I was actually thinking of associative connections like between the vision model of a car and the sound a car makes, and the word "car" etc.. I was thinking these would be analogous to lateral voting connections. What you describe here would go under "Add Top-Down Connections".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok that makes sense, I was confused by the potential duplication (I think because I focused on the term "hierarchy" in the cmp-hierarchy grouping).

With that cleared up, I wonder if it's a bit of a duplication of "Generalize Voting to Associative Connections" --> my temptation would be to keep that one, and add the point that this should enable associating e.g. sound objects with physical objects (i.e. where their models may not both be 3D), and get rid of "Add Associative Connections" under cmp-hierarchy. What do you think @vkakerbeck ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow yes! That only now clicked for me that those two are basically the same. Its kind of cool that we can solve both these with the same solution. I think I had added this one under hierarchy because the first time I thought about these was in the context of modeling language and grounding it in physical models of objects. But I think we should just remove this one and expand on the one under voting like you suggest. Maybe add the "abstract" or "num_steps" label to it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok nice, yeah sounds good!

@@ -1,3 +1,5 @@
---
title: Add Top-Down Connections
---

One of the main roles of top-down connections is the associative recall and prediction outlined in [Associative Connections](add-associative-connections.md). However, top-down projections can also support decomposing goal-states into specific sub-goals, as discussed in [Decomposing Goal States](../motor-system-improvements/decompose-goals-into-subgoals-communicate.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the comment above, I would use the description you wrote out for the previous topic here. I wouldn't think of the goal states as the top-down connections. Those belong in the motor section, specifically "Decompose Goals into Subgoals & Communicate"


As we introduce hierarchy and leverage more unsupervised learning, representations will emerge at different levels of the system that may not correspond to any labels present in our datasets. For example, handles, or the head of a spoon, may emerge as object-representations in low-level LMs, even though the dataset only recognizes labels like "mug" and "spoon".

One approach to measure the "correctness" of representations in this setting might be how well a predicted representation aligns with the outside world. For example, while LMs are not designed to be used as generative models, we could visualize how well an inferred object graph maps onto the object actually present in the world. Quantifying such alignment might leverage measures such as differences in point-clouds. This would provide some evidence of how well the learned decomposition of objects corresponds to the actual objects present in the world.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we actually need to measure this. If we model and recognize compositional objects I would assume that just the outputs of the highest level LMs would be enough to judge how well the system does on those compositional datasets. Maybe we would want to measure additional things like number of graphs learned at lower levels etc (which we already do). We can leave it here as an additional suggestion but I think when we start taking a crack at the compositional dataset this wouldn't be the first thing I would start with.
Another point would be that in our compositional dataset we know what the sub-objects are (forks, knives, spoons,...) and we know the compositional objects (set dinner table,...) Somehow we want to system to learn these. That's what I meant with "Figure out supervision" So for instance, should we show the sub objects first and give labels for those to all LMs, then show the compositional scenes and give labels to all? What is the desired outcome? Do we want lower-level LMs to learn rough models of the scenes? Do we want higher-level LMs to learn models of the cutlery as well? I would add a lot more around that in this section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've added those items to the start.

@@ -1,3 +1,11 @@
---
title: Send Similarity Encoding Object ID to Next Level & Test
---

We have implemented the ability to encode object IDs using sparse-distributed representations (SDRs), and in particular can use this as a way of capturing similarity and disimlarity between objects. Using such encodings in learned [Associative Connections](add-associative-connections.md), we should observe a degree of natural generalization when recognizing compositional objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we are interpreting the term "associative connections" in Monty the same way. When I wrote that I meant associations between object IDs that coocur (basically voting), not hierarchical connections. Since those are spatially a lot more constrained I wouldn't think of them the same way. Why would we need learned associative connections to see the effect of similarity encodings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've changed this to Hierarchical Connections, per the earlier discussion.


For example, assume a Monty system learns a dinner table setting with normal cuttlery and plates. Separately, the system learns about medieval instances of cuttlery and plates, but never sees them arranged in a dinner table setting. Based on the similarity of the medieval cutterly objects to their modern counterparts, the objects should have considerable overlap in their SDR encodings.

If the system was to then see a medieval dinner table setting for the first time, it should be able to recognize the arrangement as a dinner-table setting with reasonable confidence, even if the constituent objects are somewhat different from those present when the compositional object was first learned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be nice to include images of these two scenes here for better visualization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Adding

@@ -1,3 +1,9 @@
---
title: Detect Local and Global Flow
---

Our general view is that there are two sources of flow processed by cortical columns. A larger receptive field sensor helps to estimate global flow, where flow here will be particularly pronounced if the whole object is moving, or the sensor itself is moving. A small receptive-field sensor patch corresponds to the channel by which the primary sensory features (e.g. point-normal, color) arrive. If flow is detected here, but not in the more global channel, then it is likely that just part of the object is moving.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the distinction more clear:
local flow - object is moving
global flow - sensor is moving

We should also mention that these may not be detectable with the same sensor (small patch can't distinguish between object and sensor movement since for the patch both of it would be global flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah that's what I was trying to get at (the uncertainty depending on the size), will try rewording it.


In the short term, we would like to extract richer features, such as using HTM's spatial-pooler or Local Binary Patterns for visual features, or processing depth information within a patch to approximate tactile texture.

In the longer-term, given the "sub-cortical" nature of this sensory processing, we might also consider neural-network based feature extraction, such as shallow convolutional neural networks, however please see [our FAQ on why Monty does not currently use deep learning](../../how-monty-works/faq-monty.md#why-does-monty-not-make-use-of-deep-learning).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be worth mentioning that extracted features should be rotation invariant. So if we look at the same location on an object from different angles, the extracted feature should be the same. This is not a given with neural networks or many other approaches.


However, a more general formulation might be to use displacements as the core spatial information in the CMP, such that a specific location (in body-centric coordinates or otherwise) is never communicated outside of an LM or sensor module.

Such an approach might align well with adding information about flow (see [Detect Local and Global Flow](../sensor-module-improvements/detect-local-and-global-flow.md)), modeling moving objects (see [Deal With Moving Objects](../learning-module-improvements/deal-with-moving-objects.md)), and supporting abstract movements like the transition from grandchild to grandparent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach will be very tricky since if we don't get the location of the sensor relative to the body it is almost impossible to output anything in a common reference frame and therefor vote or send other outputs. We could do basic voting on object ID and rely on colocation of receptive fields in the hierarchy but also motor commands will be pretty much impossible this way. We could mention that this could allude to the difference between the where and what pathway. But also that this is not something we plan to implement but merely a possibility we want to investigate further.

Currently, voting relies on all learning modules sharing the same object ID for any given object, as a form of supervised learning signal. Thanks to this, they can vote on this particular ID when communicating with one-another.

However, in the setting of unsupervised learning, the object ID that is associated with any given model is unique to the parent LM. As such, we need to organically learn the mapping between the object IDs that occur together across different LMs, such that voting can function without any supervised learning signal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also mention that the brain has to do the same thing. It could not use a globally consistent SDR representation for each object. The neurons just do associative learning and a cortical column has no idea what the incoming spikes mean.

@vkakerbeck
Copy link
Contributor

Oh, lastly, it would be good if you can update this sheet https://docs.google.com/spreadsheets/d/10b0FR9YdFYqfhIiGMpZsjmN2OAbNAjp4m_hLBCV161I/edit?gid=0#gid=0 with the topics you added/renamed so they match exactly

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just jotting down here that I'm interested in this direction. :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always reminded me of the problem of multi-label classification (https://paperswithcode.com/task/multi-label-classification). It might be worth looking into some off-the-shelf model that can attach multiple labels, or even multiple attributes / afforances.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, why do we want fewer points? As in, is it for speed purposes only?

I'm asking because I think Viviane said that we chose to use graphs to model objects, but we don't necessarily believe it is how the brain stores them.

A thought I had around this was modeling objects as a set of representation vectors of sub-objects, like:

cup = {vector for handle, vector for body}
spoon = {vector for grabbing part, vector for curved part}
fork = {vector for grabbing part (identical to spoon), vector for curved part)

But if this has a related goal that we are trying to do, I'm all down. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like we will do when we have hierarchy? Basically in the higher-level LM, the objects modeled in the lower-level LM become features in that graph? So the switch, light bulb and lambp shade modeled in the LLLM become features in the lamp model in the HLLM?
One thing I just want to make sure you are not forgetting (you probably aren't but since you didn't mention it in your comment) is that we always model features AT LOCATIONS. And the relative locations of features are actually more important than the features themselves (think of how you can easily recognize a face made of fruits but wouldn't call a random assortment of noses and eyes a face). We always use reference frame, not just a bag of features.

As for using less points: Yes, this would be an efficiency and generalization point. At the very beginning we would store every point in our models which would give us perfect accuracy but it was slow. So over time we figured several ways to use less points (feature change SM, graph_delta_thresholds, ...) These give us significant efficiency gains with little loss in accuracy and there are still a few more ways we can make our models more efficient by using even less points. This also relates a bit to hierarchy since in the larger objects composed of sub objects we would likely want to store less points.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish I could give a 👍 like we are requesting features, haha. 🤣 This one would be very useful!

Copy link

@hlee9212 hlee9212 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much Niels! I'm really amazed how much you have added explanations to all future directions, it's mind boggling. 🤯

LGTM on PR. Will the next step by trying to prioritize these somehow?

@vkakerbeck
Copy link
Contributor

We have all of them listed out here https://docs.google.com/spreadsheets/d/10b0FR9YdFYqfhIiGMpZsjmN2OAbNAjp4m_hLBCV161I/edit?gid=0#gid=0 and some are already grouped into our next milestones (below the table). We will likely create 1-2 new, intermediate milestones to prepare for the Heterarchy pt. 2 milestone (and eventually hierarchical goal policies).

@nielsleadholm
Copy link
Contributor Author

Thanks for the helpful comments @vkakerbeck and @hlee9212 ! Those should all be addressed now but let me know if there are any further changes you want. I'll also now double check the overview spreadsheet and make sure all the cells are updated to match here.

@nielsleadholm
Copy link
Contributor Author

@vkakerbeck I've also updated some of the hashtags where I felt some were missing.

@nielsleadholm
Copy link
Contributor Author

Lastly I've gone through and made sure the names for future-work sections are consistent across the individual articles, the header .md files, and the overview sheet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation triaged This issue or pull request was triaged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants