- Prior work repurposes classifiers for Object Detection
- YOLO formulates it as a regression problem of spatially separated bounding boxes and class probabilities
- Single NN predicts bounding boxes and class probabilties in one evaluation
- Resize input to 448x448, runs single CNN, thresholds the resulting decisions
- Base network runs at 45 fps on a Titan X GPU
- Able to process real time video with 25ms of latency
- YOLO reasons about the image globally, instead of regions
- Unify separate components of object detection into single NN
- Divides image into SxS grid.
- If center of object falls in a cell, that cell is responsible for detecting the object
- Train CNN on ImageNet 1000 classes dataset
- Then convert the model to perform detection, add 4 layers that increase resolution from 224 to 448
- Final layer predicts a tensor of class probabilities and bbox coordinates
- Limitations: each grid can only have one class and predict two boxes
- Struggles with flocks of birds or similar smaller objects