Will Synthetic Data Make a "Real" Impact in Healthcare?
The digitization of our healthcare system is long overdue, but there are important reasons for why this industry has lagged others. One of the main obstacles to seamless digital communication between healthcare stakeholders is the highly sensitive nature of the information being shared. Strict regulations such as HIPAA rightfully limit the ease with which patient data can be accessed and exchanged, but this comes at a significant cost. Indeed, there is a tradeoff between healthcare data utility and compliance. With compliance taking first priority, healthcare providers and payers have sacrificed utility by relying on fax machines to exchange patient information.
This antiquated modus operandi, which works well enough today, needs to change if we want the healthcare industry to be a part of the AI revolution. The creation of AI models for healthcare, whether for drug development or clinical decision support, requires millions of points of information to train. But large, useful datasets are difficult to come by when information is fragmented across disjointed data siloes. While efforts are underway to make the healthcare system more interoperable, such as the creation of Health Information Exchanges that connect major hospital systems and the implementation of universal standards of exchange like FHIR in 2014, there are still stringent (and necessary) regulations on the sharing of health data that hamper progress.
However, a sub-field of AI might have found a clever workaround to this problem: synthetic data. Synthetic data refers to artificially generated data that imitates real datasets without any identifiable information from actual individuals. This data is created using algorithms and statistical models to replicate the statistical characteristics of the original dataset, such as the distribution of data points and their relationships. In practice, organizations will be able to upload their private datasets to a synthetic data engine and receive an entirely new dataset back. This new dataset contains the same exact insights as the original one while preserving patient privacy, so that it can move freely between healthcare stakeholders and be used to train AI algorithms (1).
Other industries have already started using synthetic data to train their AI models, especially for “edge cases” where real data is limited. For example, American Express has used synthetic data to train their AI models on unusual patterns of credit card fraud (2). Meanwhile, synthetic images have been used to train self-driving algorithms in extreme scenarios where real-world data is lacking (3). Other industries, such as marketing and cybersecurity, have begun to use synthetic datasets to test different advertising strategies and safety protocols.
Still, the adoption of synthetic data in healthcare remains low, and because this technology is so young, skepticism is justified. One of the limitations of synthetic datasets, especially those that are only partially synthetic, is the risk of data leakage, or the unintentional inclusion of sensitive or private information from the original dataset into the synthetic dataset (4). The main concern, however, is that this synthetic data is not accurate, or at least not accurate enough for the high-stakes and high-complexity nature of information in the healthcare system.
As such, the adoption of synthetic data in healthcare will likely vary significantly by use case (5). For example, a lower-risk scenario which could see quicker adoption might be a hospital sharing synthetic data derived from its patient records to an insurance company to make risk assessments and set up preventive healthcare programs. On the other hand, more validation in the field may be needed before synthetic data can be used to train algorithms that predict the intricate effects of a procedure on disease progression.
In conclusion, synthetic data is a novel solution to the challenges of data privacy and accessibility within the healthcare system and holds the potential to drive impactful advancements in patient care, research, and data-driven decision-making. However, rigorous validation is still needed before it can be deployed for our most complex problems.