A grand challenge of computer vision is to understand and parse natural images into boundaries, surfaces and objects. To solve this problem we would inevitably need to work with visual entities and cues of heterogeneous nature, such as brightness and texture at low-level, contour and region grouping at mid-level, and shape recognition at high-level. Learning to represent and incorporate these entities and cues, along with the complexity of the visual world itself, calls for probabilistic models for image parsing. Many previous efforts in this line suffer from issues such as lack of a compact representation, lack of scale invariance or lack of comprehensive experimentation. We describe a scale-invariant image representation using piecewise linear approximations of contours and the constrained Delaunay triangulation (CDT) for completing gradientless gaps. On top of the CDT graph we develop conditional random fields (CRF) for contour completion, figure/ground organization as well as object segmentation. Large datasets of human-annotated natural images are utilized for both training and evaluation. Our quantitative results are the first to demonstrate the working of mid-level visual cues in general natural scenes. The CDT/CRF framework enables efficient representation and inference of both bottom-up and top-down information, hence applicable to various vision problems. We extend our work to joint object recognition and segmentation, in particular finding people, in static images and video.