This issue has been created since 2019-12-10.

First of all, thanks @btekin for your work and for publishing it here. My question is about the relation between a cell and a box. In the paper it reads:

Figure 1. ... The 3D output tensor from our network, which represents

for each cell a vector consisting of the 2D corner locations, the class probabilities and a confidence value associated with the prediction.

Overall, our output 3D tensor depicted in Figure 1(e) has dimension S × S ×

D, where the

2D spatial grid corresponding to the image dimensions has S × S cells andeach such cell has a D dimensional vector. Here, D = 9×2+C +1, because we have 9 (x i , y i ) control points, C class probabilities and one confidence value.

When multiple objects are located close to each other in the 3D scene, they are more likely to appear close together in the images or be occluded by each other. In these cases, certain cells might contain multiple objects. To be able to predict the pose of such multiple objects that lie in

the same cell, we allowup to 5 candidates per cell and therefore predict five sets of control points per cell.

Maybe my question is easier to understand when I use an example.

Lets say we have M very close objects all lie in one cell. How does the final vector for this cell look like?

- (9×2+C +1)×M = M full vectors, each having a box, confidence value and a set of class probabilities (YOLO-v2-like).
- (9×2+1)×M + C) = M boxes with confidence scores but only one set of class probabilities per cell (YOLO-v1-like).

belorenz wrote this answer
on
2019-12-10

More Details About Repo

Owner Name | microsoft |

Repo Name | singleshotpose |

Full Name | microsoft/singleshotpose |

Language | Python |

Created Date | 2018-06-30 |

Updated Date | 2022-07-28 |

Star Count | 612 |

Watcher Count | 28 |

Fork Count | 200 |

Issue Count | 71 |