A TensorFlow implementation of the models described in Unsupervised Learning for Physical Interaction through Video Prediction (Finn et al., 2016).
This video prediction model, which is optionally conditioned on actions, predicts future video by internally predicting how to transform the last image (which may have been predicted) into the next image. As a result, it can reuse apperance information from previous frames and can better generalize to objects not seen in the training set. Some example predictions on novel objects are shown below:
When the model is conditioned on actions, it changes it's predictions based on the passed in action. Here we show the models predictions in response to varying the magnitude of the passed in actions, from small to large:
Because the model is trained with an l2 objective, it represents uncertainty as blur.
- Tensorflow (see tensorflow.org for installation instructions)
- spatial_tranformer model in tensorflow/models, for the spatial tranformer predictor (STP).
The data used to train this model is located here.
To download the robot data, run the following.
./download_data.sh
To train the model, run the prediction_train.py file.
python prediction_train.py
There are several flags which can control the model that is trained, which are exeplified below:
python prediction_train.py \
--data_dir=push/push_train \ # path to the training set.
--model=CDNA \ # the model type to use - DNA, CDNA, or STP
--output_dir=./checkpoints \ # where to save model checkpoints
--event_log_dir=./summaries \ # where to save training statistics
--num_iterations=100000 \ # number of training iterations
--pretrained_model=model \ # path to model to initialize from, random if emtpy
--sequence_length=10 \ # the number of total frames in a sequence
--context_frames=2 \ # the number of ground truth frames to pass in at start
--use_state=1 \ # whether or not to condition on actions and the initial state
--num_masks=10 \ # the number of transformations and corresponding masks
--schedsamp_k=900.0 \ # the constant used for scheduled sampling or -1
--train_val_split=0.95 \ # the percentage of training data for validation
--batch_size=32 \ # the training batch size
--learning_rate=0.001 \ # the initial learning rate for the Adam optimizer
If the dynamic neural advection (DNA) model is being used, the --num_masks
option should be set to one.
The --context_frames
option defines both the number of initial ground truth
frames to pass in, as well as when to start penalizing the model's predictions.
The data directory --data_dir
should contain tfrecord files with the format
used in the released push dataset. See
here for
details. If the --use_state
option is not set, then the data only needs to
contain image sequences, not states and actions.
To ask questions or report issues please open an issue on the tensorflow/models issues tracker. Please assign issues to @cbfinn.
This code was written by Chelsea Finn.