Bias reduction in the population size estimation of large data sets Articles uri icon

authors

  • CHU, JEFFREY
  • Zhang, Yuanyuan
  • Chan, Stephen
  • Nadarajah, Saralees

publication date

  • May 2020

start page

  • 1

end page

  • 32

issue

  • 106914

volume

  • 145

International Standard Serial Number (ISSN)

  • 0167-9473

Electronic International Standard Serial Number (EISSN)

  • 1872-7352

abstract

  • Estimation of the population size of large data sets and hard to reach populations can be a significant problem. For example, in the military, manpower is limited and the manual processing of large data sets can be time consuming. In addition, accessing the full population of data may be restricted by factors such as cost, time, and safety. Four new population size estimators are proposed, as extensions of existing methods, and their performances are compared in terms of bias with two existing methods in the big data literature. These would be particularly beneficial in the context of time-critical decisions or actions. The comparison is based on a simulation study and the application to five real network data sets (Twitter, LiveJournal, Pokec, Youtube, Wikipedia Talk). Whilst no single estimator (out of the four proposed) generates the most accurate estimates overall, the proposed estimators are shown to produce more accurate population size estimates for small sample sizes, but in some cases show more variability than existing estimators in the literature.

subjects

  • Mathematics
  • Statistics

keywords

  • random walk sampling; relative bias; size estimator; twitter; youtube