'A priori' shapley data value estimation Articles uri icon

publication date

  • June 2025

start page

  • 1

end page

  • 19

issue

  • 2

volume

  • 28

International Standard Serial Number (ISSN)

  • 1433-7541

Electronic International Standard Serial Number (EISSN)

  • 1433-755X

abstract

  • Distributed machine learning approaches are required when training data cannot be collected in a central location, due to storage, transmission or privacy/security constraints. An important task in any distributed machine learning context, and Federated Learning is no exception, is data value estimation or credit allocation, where the goal is to reward each participant proportionally to their contribution to the final performance of the machine learning model. However, all existing data value estimation techniques require that training be completed before the data values are obtained, and in this sense they can be considered as 'a posteriori' approaches. Thus, all potential contributors must participate in the training process, regardless of the quality of their data or the final reward they can obtain. Here we present an 'a priori' Shapley data value estimation technique in which, based on some statistical measures provided by the participants, the central counterpart or aggregator can obtain reasonably accurate data value estimates before actually starting the distributed learning process. To the best of our knowledge, this is the first 'a priori' data value estimation approach proposed in the literature, and it can be used for the pre-selection of participants or to implement new pricing schemes. The introduced algorithms have been benchmarked using a variety of datasets and a logistic regression model, and we show that our 'a priori' estimates are very accurate, compared to the centralized Shapley data values.

subjects

  • Telecommunications

keywords

  • a priori; shapley; data value; distributed learning; privacy preserving; credit allocation