In this paper, we present Structon, a novel approach that uses Web mining together with inference and IP traceroute to geolocate IP addresses with significantly better accuracy than existing automated approaches. Structon is composed of three ideas which we realize in three corresponding steps. First, we extract geolocation information of Web server IP addresses from Web pages. Second, we devise heuristic algorithms to improve both the accuracy and the coverage of the IP geolocation database using these Web server IP addresses and their geolocations as input. Third, for those segments that are not covered in the first two steps, we use IP traceroute to identify the access routers of those segments. When the location of the access router is known, we can deduce the location of the associated segment since it is co-located together with the access router.
By mining 500-million Web pages collected in China in 2006 (11 percent of the total Web pages in China at that time), we are able to identify the geolocations for 103 million IP addresses. This represents nearly 88 percent IP addresses allocated to China in March 2008. Structon is 87.4 percent accurate at city granularity and up to 93.5 percent accurate at province level. We also used 10 day Windows Live client log to evaluate our client IP addresses coverage: Structon identified geolocations of 98.9 percent of client IP addresses.