MonoFusion allows a user to build dense 3D reconstructions of their environment in real-time, utilizing only a single, off-the-shelf web camera as the input sensor. The camera could be one already available in a tablet, phone, or a standalone device. No additional input hardware is required. This removes the need for power intensive active sensors that do not work robustly in natural outdoor lighting. Using the input stream of the camera we first estimate the 6DoF camera pose using a sparse tracking method. These poses are then used for efficient dense stereo matching between the input frame and a key frame (extracted previously). The resulting dense depth maps are directly fused into a voxel-based implicit model (using a computationally inexpensive method) and surfaces are extracted per frame. The system is able to recover from tracking failures as well as filter out geometrically inconsistent noise from the 3D reconstruction. Our method is both simple to implement and efficient, making such systems even more accessible. This paper details the algorithmic components that make up our system and a GPU implementation of our approach. Qualitative results demonstrate high quality reconstructions even visually comparable to active depth sensor-based systems such as KinectFusion.