Scanning Electronic Documents for Personally Identifiable Information

  • Tuomas Aura ,
  • Thomas A. Kuhn ,
  • Michael Roe

Published by Association for Computing Machinery, Inc.

Sometimes, it is necessary to remove author names and other personally identifiable information (PII) from documents before publication. We have implemented a novel defensive tool for detecting such data automatically. By using the detection tool, we have learned about where PII may be stored in documents and how it is put there. A key observation is that, contrary to common belief, user and machine identifiers and other metadata are not embedded in documents only by a single piece of software, such as a word processor, but by various tools used at different stages of the document authoring process.