Improving Synthetic Data Generation Through Federated Learning in Scarce and Heterogeneous Data Scenarios

March 13, 2025

Publications

>

Article

Sí

Improving Synthetic Data Generation Through Federated Learning in Scarce and Heterogeneous Data Scenarios

Publicated to: Big Data and Cognitive Computing. 9 (2): 18- - 2025-02-01 9(2), DOI: 10.3390/bdcc9020018

Authors:

Apellaniz, Patricia A; Parras, Juan; Zazo, Santiago

[+]

Affiliations

Univ Politecn Madrid, Informat Proc & Telecommun Ctr, ETS Ingn Telecomunicac, Madrid 28040, Spain - Author

Abstract

Synthetic Data Generation (SDG) is a promising solution for healthcare, offering the potential to generate synthetic patient data closely resembling real-world data while preserving privacy. However, data scarcity and heterogeneity, particularly in under-resourced regions, challenge the effective implementation of SDG. This paper addresses these challenges using Federated Learning (FL) for SDG, focusing on sharing synthetic patients across nodes. By leveraging collective knowledge and diverse data distributions, we hypothesize that sharing synthetic data can significantly enhance the quality and representativeness of generated data, particularly for institutions with limited or biased datasets. This approach aligns with meta-learning concepts, like Domain Randomized Search. We compare two FL techniques, FedAvg and Synthetic Data Sharing (SDS), the latter being our proposed contribution. Both approaches are evaluated using variational autoencoders with Bayesian Gaussian mixture models across diverse medical datasets. Our results demonstrate that while both methods improve SDG, SDS consistently outperforms FedAvg, producing higher-quality, more representative synthetic data. Non-IID scenarios reveal that while FedAvg achieves improvements of 13-27% in reducing divergence compared to isolated training, SDS achieves reductions exceeding 50% in the worst-performing nodes. These findings underscore synthetic data sharing potential to reduce disparities between data-rich and data-poor institutions, fostering more equitable healthcare research and innovation.

[+]

Keywords

Data heterogeneitData heterogeneityData scarcityFederated learningMedical dataSynthetic data generation

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

The work has been published in the journal Big Data and Cognitive Computing due to its progression and the good impact it has achieved in recent years, according to the agency WoS (JCR), it has become a reference in its field. In the year of publication of the work, 2025, it was in position 26/147, thus managing to position itself as a Q1 (Primer Cuartil), in the category Computer Science, Theory & Methods.

Independientemente del impacto esperado determinado por el canal de difusión, es importante destacar el impacto real observado de la propia aportación.

Según las diferentes agencias de indexación, el número de citas acumuladas por esta publicación hasta la fecha 2026-04-24:

WoS: 3
Scopus: 8

[+]

Impact and social visibility

From the perspective of influence or social adoption, and based on metrics associated with mentions and interactions provided by agencies specializing in calculating the so-called "Alternative or Social Metrics," we can highlight as of 2026-04-24:

The use, from an academic perspective evidenced by the Altmetric agency indicator referring to aggregations made by the personal bibliographic manager Mendeley, gives us a total of: 28.
The use of this contribution in bookmarks, code forks, additions to favorite lists for recurrent reading, as well as general views, indicates that someone is using the publication as a basis for their current work. This may be a notable indicator of future more formal and academic citations. This claim is supported by the result of the "Capture" indicator, which yields a total of: 28 (PlumX).

With a more dissemination-oriented intent and targeting more general audiences, we can observe other more global scores such as:

The Total Score from Altmetric: 4.
The number of mentions on the social network X (formerly Twitter): 1 (Altmetric).
The number of mentions on Wikipedia: 1 (Altmetric).

It is essential to present evidence supporting full alignment with institutional principles and guidelines on Open Science and the Conservation and Dissemination of Intellectual Heritage. A clear example of this is:

The work has been submitted to a journal whose editorial policy allows open Open Access publication.
Assignment of a Handle/URN as an identifier within the deposit in the Institutional Repository: https://oa.upm.es/91970/

As a result of the publication of the work in the institutional repository, statistical usage data has been obtained that reflects its impact. In terms of dissemination, we can state that, as of

Views: 51
Downloads: 40

[+]

Leadership analysis of institutional authors

There is a significant leadership presence as some of the institution’s authors appear as the first or last signer, detailed as follows: First Author (ALONSO DE APELLANIZ, PATRICIA) and Last Author (ZAZO BELLO, SANTIAGO).

the author responsible for correspondence tasks has been ALONSO DE APELLANIZ, PATRICIA.

[+]

Awards linked to the item

This work was supported by the GenoMed4All and SYNTHEMA projects from the European Union's Horizon 2020 Research and Innovation Program under Grant 101017549 and Grant 101095530. However, the views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible.

[+]

Indexed in

License and Use

Citations

Altmetrics

Analysis of institutional authors

Share

Improving Synthetic Data Generation Through Federated Learning in Scarce and Heterogeneous Data Scenarios

Affiliations

Abstract

Keywords

Quality index

Bibliometric impact. Analysis of the contribution and dissemination channel

Leadership analysis of institutional authors

Awards linked to the item