We extend our earlier work on deep-structured conditional random field (DCRF) and develop deep-structured hidden conditional random field (DHCRF). We investigate the use of this new sequential deep-learning model for phonetic recognition. DHCRF is a hierarchical model in which the final layer is a hidden conditional random field (HCRF) and the intermediate layers are zero-th-order conditional random fields (CRFs). Parameter estimation and sequence inference in the DHCRF are developed in this work. They are carried out layer by layer so that the time complexity is linear to the number of layers. In the DHCRF, the training label is available only at the final layer and the state boundary is unknown. This difficulty is addressed by using unsupervised learning for the intermediate layers and lattice-based supervised learning for the final layer. Experiments on the standard TIMIT phone recognition task show small performance improvement of a three-layer DHCRF over a two-layer DHCRF; both are significantly better than the single-layer DHCRF and are superior to the discriminatively trained tri-phone hidden Markov model (HMM) using identical input features.