Animal pose estimation from video data is an important step in many biological studies, but current methods struggle in complex environments where occlusions are common and training data is scarce. Recent work has demonstrated improved accuracy with deep neural networks, but these methods often do not incorporate prior distributions that could improve localization. Here we present GIMBAL: a hierarchical von Mises-Fisher-Gaussian model that improves upon deep networks' estimates by leveraging spatiotemporal constraints. The spatial constraints come from the animal's skeleton, which induces a curved manifold of keypoint configurations. The temporal constraints come from the postural dynamics, which govern how angles between keypoints change over time. Importantly, the conditional conjugacy of the model permits simple and efficient Bayesian inference algorithms. We assess the model on a unique experimental dataset with video of a freely-behaving rodent from multiple viewpoints and ground-truth motion capture data for 20 keypoints. GIMBAL extends existing techniques, and in doing so offers more accurate estimates of keypoint positions, especially in challenging contexts.