Modular multiplication of long integers is an important building block for cryptographic algorithms. Although several FPGA accelerators have been proposed for large modular multiplication, previous systems have been based on O(N2) algorithms. In this paper, we present a Montgomery multiplier that incorporates the more efficient Karatsuba algorithm which is O(N(log 3= log 2)). This system is parameterizable to different bitwidths and makes excellent use of both embedded multipliers and fine-grained logic. The design has significantly lower LUT-delay product and multiplier-delay product compared with previous designs. Initial testing on a Virtex-6 FPGA showed that it is 60-190 times faster than an optimized multi-threaded software implementation running on an Intel Xeon 2.5 GHz CPU. The proposed multiplier system is also estimated to be 95-189 times more energy efficient than the software-based implementation. This high performance and energy efficiency makes it suitable for server-side applications running in a datacenter environment.