The high-level description below is for the online setting. In the semi-online setting, the detections are first merged across a small clip. The first frame is always initialized with detection without propagation.
- DEVA propagates masks from memory to the current frame
- If this is a detection frame, go to the next step. Otherwise, no further processing is needed for this frame.
- Grounding DINO takes the text prompt and generates some bounding boxes
- Segment Anything takes the bounding boxes and generates corresponding segmentation masks
- The propagated masks are compared to and merged with the segmentation from Segment Anything
- DEVA propagates masks from memory to the current frame.
- If this is a detection frame, go to the next step. Otherwise, no further processing is needed for this frame.
- We generate a grid of points on the unsegmented regions.
- Segment Anything takes the points and generates corresponding segmentation masks.
- The propagated masks are compared to and merged with the segmentation from Segment Anything.
General Tips:
- Though innocently looking, reading frames from disk, visualizing the output, and encoding the output as videos can be slow, especially at high resolutions. The script version runs faster than the gradio version because it uses threaded I/O.
- Specifying
--amp
(automatic fixed precision) makes things run faster on most modern GPUs. - In general, text-prompted inference is faster and more robust than "automatic" inference.
- To speed up the actual processing, we need to speed up either the image model or the propagation model.
Speed up the image model:
- The most efficient way is to use the image model less often. This can be achieved by:
- Using
online
instead ofsemionline
, or, - Increasing
detection_every
.
- Using
- Use a faster image model. For example, Mobile-SAM is faster than SAM. Grounded-Segment-Anything (text-prompt) is faster than automatic SAM. In automatic mode, you can reduce the number of prompting points (
SAM_NUM_POINTS_PER_SIDE
) to reduce the number of queries to SAM. - In automatic mode, increasing
SAM_NUM_POINTS_PER_BATCH
improves parallelism.
Speeding up the propagation model:
- In general, the running time of the propagation model scales linearly with the number of objects (not to be confused with direct proportionality). The best play is thus to reduce the number of objects:
- Using text-prompt typically generates more relevant objects and fewer overall number of objects.
- Increasing the thresholds
SAM_PRED_IOU_THRESHOLD
orDINO_THRESHOLD
reduces the number of detected objects. - Reduce
max_missed_detection_count
to delete objects more readily. - In automatic mode, enable
suppress_small_objects
to get larger and fewer segments. Note this option has its own overhead.
- Reduce the internal processing resolution
size
. Note this does not affect the image model. - Increasing
chunk_size
improves parallelism.
General:
detection_every
: number of frames between two consecutive detections; a higher number means faster inference but slower responses to new objectsamp
: enable mixed precision; is faster and has a lower memory usagechunk_size
: number of objects to be processed in parallel; a higher number means faster inference but higher memory usagesize
: internal processing resolution for the propagation module; defaults to 480max_missed_detection_count
: maximum number of consecutive detections that can be missed before an object is deleted from memory.max_num_objects
: maximum number of objects that can be tracked at the same time; new objects are ignored if this is exceeded
Text-prompted mode only:
DINO_THRESHOLD
: threshold for DINO to consider a detection as validprompt
: text prompt to use, separate by a full stop; e.g. "people.trees". The wording of the prompt and minor details like pluralization might affect the results.
Automatic mode only:
SAM_NUM_POINTS_PER_SIDE
: number of points per side to use for automatic grid-based prompting in SAMSAM_NUM_POINTS_PER_BATCH
: number of points prompts to process in parallel in SAMSAM_PRED_IOU_THRESHOLD
: threshold of predicted IoU to be considered as a valid segmentation for SAMsuppress_small_objects
: if enabled, small objects that overlap with large objects are suppressed during the automatic mode; does not matter in the text-prompted modeSAM_OVERLAP_THRESHOLD
: if suppress_small_objects are enabled, this is the IoU threshold for the suppression. A lower threshold means more segmentation masks (less suppression)