Loading [MathJax]/jax/output/HTML-CSS/jax.js
Eryck Zhou

A super simple BLOG for Artifical Intelligence.

YOLOv1

26 July 2020

image
Photo by unsplash-logoJess Barnett

Origin: You Only Look Once: Unified, Real-Time Object Detection

Abstract

  • As a regression problem to spatially separated bounding boxes and associated class probabilities
  • A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation
  • more localization errors but is less likely to predict false positives on background

The YOLO Detection System

Figure 1
  1. resizes the input image to 448 × 448
  2. runs a single convolutional network on the image
  3. thresholds the resulting detections bythe model’s confidence

The Model

Procedure

Figure 2
  1. It divides the image into an S × S grid [448 × 448 -> 7 x 7]
        If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
  2. Each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities.
        Bounding Box: x, y, w, h(center)
        Confidence: Pr(object)IoUpredtruth
  3. Final output tensor: S × S × (B ∗ 5 + C)

The Loss Function

λcoord S2i=0Bj=01obj ij[(xiˆxi)2+(yiˆyi)2]+λcoord S2i=0Bj=01obj ij[(wiˆwi)2+(hiˆhi)2]+S2i=0Bj=01obj ij(CiˆCi)2+λnoobj S2i=0Bj=01noobj ij(CiˆCi)2+S2i=01obj ic classes (pi(c)ˆpi(c))2

        i: 0~(S2-1) [iterate each grid (0~48)]
        j: 0~(B-1) [iterate each bbox (0~1)]
1objij&1noobjij:[0000001000001000001000001000001000001000001000000][1111110111110111110111110111110111110111110111111]

        For 1objij, we have B predictions in each cell, only the one with largest IoU shall be labeled as 1

Coordinate Loss

λcoord S2i=0Bj=01obj ij[(xiˆxi)2+(yiˆyi)2]+λcoord S2i=0Bj=01obj ij[(wiˆwi)2+(hiˆhi)2]
  • x, y: predicated bbox center
  • w, h: predicated bbox width & height
  • ˆx,ˆy: labeled bbox center
  • ˆw,ˆh: labeled bbox width & height
  • w,h: Suppress the effect for larger bbox
  • λcoord : 5. because there’s only 8 dimensions. Too less comparing to other losses weighted loss essentially.

Confidence Loss

+S2i=0Bj=01objij(CiˆCi)2+λnoobj S2i=0Bj=01noobj ij(CiˆCi)2
  • ˆCi: confidence score [IoU] of predicted and ground truth
  • Ci: preidcted confidence score [IoU] generated from network

Note:

  • ˆCi is 0 or 1 integer
  • λnoobj =0.5, because there’s so many non-object bboxes
  • Train: confidence = Pr(object)IoUpredtruth
  • Test: individual box confidence predicton:
         confidence = Pr(clsiobj)Pr(obj)IoUpredtruth

Classification loss

+S2i=01objic classes (pi(c)ˆpi(c))2

Each cell will only predict 1 object, which is decided by the bbox with the largest IoU.

Don't forget to do NMS after generating bboxes.


The YOLOv1 Pros & Cons

Pros:

  • one stage, really fast

Cons:

  • Bad for crowed objects[1 cell 1 obj]
  • Bad for small objects
  • Bad for objects with new width-height ratio
  • No BN