All implementations are in PyTorch. The walkability prediction models in our experiments are based on a generic U-Net style dense prediction structure with an ImageNet pretrained ResNet-50 network as the backbone. Note that our method is not limited to any particular network architecture.

We use up to layer3 in the ResNet-50 model for feature extraction and add two skip connections to the decoder net. The network takes an input image of size 640x480 and produces a 160x120 single channel map, where at each position a score represents the likelihood of a person walking there. Features sampled from the layer before the last single-channel output layer are used in the adversarial feature loss computation described in the main paper.

Detailed dense prediction network summary

A three-layer fully connected network is used as the discriminator, which takes a 256-D feature vector and predicts a single score.

Detailed discriminator network summary


We train networks using Adam with betas set to 0.5 and 0.999. Images are resized to 480 pixels in height and then randomly cropped along the width to 640x480 pixels. The initial learning rate is 1e^-4 and is reduced by 10% when the validation loss reaches a plateau. For each model, the training takes about 60K iterations with a batch size of 10.

For the rectified classification loss described in Section 4.1 in the main paper, we use per image positive and negative class weights that are obtained by taking the reciprocal of their frequencies in the current image: their ratio is on average about 1e^5.

Adversarial training is done with a WGAN loss with Gradient Penalty following the orginal paper (Eq. 3). In our problem, the dense prediction network is the 'Generator' network in a standard GAN training setting. We optimize the discriminator so that it couldn't tell if a feature is extracted from a ground-truth region or our predicted region.