Getting pose ground truth data for YOLO
So YOLO requires a ground truth text file that has the class number and box dimension — centre point(x,y), width and height.
What do we have? well…
All 16 joints with there image coordinates and if they are visible or not. Along with a bounding box for the head.

Attempt One
To draw rough bounding boxes around each joint with a centre point as the joint coordinate and box with and height of lets a 50px. Should look something like this:

However each image might be a different size or the pose in the image may have a size ratio compared to the image.
Creating Ground Truth
Each image e.g. image.jpg requires a text file called image.txt describing all the class and bounding boxes inside the image.
For my first attempt I drew a 50 by 50 pixels bounding boxes around the each of the joints and labelled each joint with it’s joint id. For this initial run I used 1000 images with 200 test images, I didn’t expect good results but I wanted to get the networking working and make sure all my GPU drives were working correctly.
As for the ground truth text file itself, I started with a joint’s centre point and a box width and height and it’s joint id. YOLO expects these in a format where the box dimensions are related to the image dimensions.
Below I divide the box dimensions by the corresponding image dimensions so:
x = x / img_width
y = y / img_height
box_w = 25 * Scale of pose / img_width
box_h = 25 * Scale of pose / img_height
Note on the box_w and box_h this is just testing out the MPII Human Pose dataset. These number will need to be changed based on the dataset you are using.
This produce number between 0–1 with a total output of something like this:
CLASS NUMBER, X , Y, WIDTH, HEIGHT
1 0.546875 0.7861111111111111 0.0390625 0.06944444444444445
Then you want a folder structure something like this:

As you can see here each image has a text file describing each class in the image. For YOPO each file would have up to 16 classes it would have less if the joint was not visible in the image.
Training on images that have upward orientation bounding boxes have fairly well for network only being trained on 150ish images see below:

However on images that have humans with poses that aren’t mapped by upward oriented bounding boxes, it yields very poor results and in some cases no boxes are found. The network was train for two days on nvidia 1070 GTX which resulted in 20,000 training iterations. With an average IOU of around 3 and wasn’t going any lower. Around 25,000 training iterations the average loss was bouncing up and down which isn’t ideal because the average loss should going down over time with a target value of 0.6.
So for next week I need to look into why it’s not going and a different way of approaching the problem.