A new method for localising and recognising hand poses and objects in real-time is presented. This problem is important in vision-driven applications where it is natural for a user to combine hand gestures and real objects when interacting with a machine. Examples include using a real eraser to remove words from a document displayed on an electronic surface. In this paper the task of simultaneously recognising object classes, hand gestures and detecting touch events is cast as a single classification problem. A random forest algorithm is employed which adaptively selects and combines a minimal set of appearance, shape and stereo features to achieve maximum class discrimination for a given image. This minimal set leads to both efficiency at run time and good generalisation. Unlike previous stereo works which explicitly construct disparity maps, here the stereo matching costs are used directly as visual cue and only computed on-demand, i.e. only for pixels where they are necessary for recognition. This leads to improved efficiency. The proposed method is assessed on a database of a variety of objects and hand posess elected for interacting on a flat surface in an office environment.