Abstract

A typical assumption in network classification methods is that the full network is available to both learn the model and apply the model for prediction. Often this assumption is appropriate (publicly visible friendship links in social networks), however in other domains, while the underlying relational structure exists, there may be a cost associated with acquiring the edges. In this preliminary work we explore the problem domain of active sampling – where our goal is to maximize the number of positive (e.g., fraudulent) nodes identified, while simultaneously querying for network structure that is likely to improve estimates. We outline the problem domain formally and discuss five subdomains that are likely to be observed in real world scenarios. For our key finding, we show when the parameter estimates are learned from the distribution of labeled samples they are biased with respect to the parameters for the distribution of unlabeled samples, which negatively impacts the number of positive instances recalled. Additionally, we demonstrate that the estimation of the generative distribution from the labeled samples is also biased.