Online education

Synthetic data could be better than real data

An illustration of a woman putting together part of images after she shredded them.

Credit: Janelle Barone

When more than 155,000 students from all over the world signed up to take free online classes in electronics in 2012, offered through the fledgling US provider edX, they set in motion an explosion in the popularity of online courses.

The edX platform, created by the Massachusetts Institute of Technology (MIT) and Harvard University, both in Cambridge, Massachusetts, was not the first attempt at teaching classes online — but the number of participants it attracted was unusual. The activity created a massive amount of information on how people interact with online education, and presented researchers with an opportunity to garner answers to questions such as ‘What might encourage people to complete courses?’, and ‘What might give them a reason to drop out?’.

“We had a tonne of data,” says Kalyan Veeramachaneni, a data scientist at MIT’s Laboratory for Information and Decision Systems. Although the university had long dealt with large data sets generated by others, “that was the first time that MIT had big data in its own backyard”, says Veeramachaneni.

Hoping to take advantage, Veeramachaneni assigned 20 MIT students to run analyses of the information. But he soon ran into a roadblock: legally, the data had to be private. This wealth of information was held on a single computer in his laboratory, with no connection to the Internet to prevent hacking. The researchers had to schedule a time to use it. “It was a nightmare,” Veeramachaneni says. “I just couldn’t get the work done because the barrier to the data was very high.”

His solution, eventually, was to create synthetic students — computer-generated versions of edX participants that shared characteristics with real students using the platform, but that did not give away private details. The team then applied machine-learning algorithms to the synthetic students’ activity, and in doing so discovered several factors associated with a person failing to complete a course1. For instance, students who tended to submit assignments right on a deadline were more likely to drop out. Other groups took the findings of this analysis and used them to help create interventions to help real people complete future courses2.

This experience of building and using a synthetic data set led Veeramachaneni and his colleagues to create the Synthetic Data Vault, a set of open-source software that allows users to model their own data and then use those models to generate alternative versions of the data3. In 2020, he co-founded a company called DataCebo, based in Boston, Massachusetts, which helps other companies to do this.

The desire to preserve privacy is one of the driving forces behind synthetic-data research. Because artificial intelligence (AI) and machine learning have expanded rapidly, finding their way into areas as diverse as health care, art and financial analysis, concerns about the data used to train the systems is also growing. To learn, these algorithms must consume vast amounts of information — much of which relates to individuals. The system could reveal private details, or be used to discriminate against people when making decisions on hiring, lending or housing, for example. The data fed to these machines might also be owned by an individual or company that does not want the information to be used to create a tool that might then compete with them — or at least, might not want to give the data away for free.

Some researchers think that the answer to these concerns could lie in synthetic data. Getting computers to manufacture data that is close enough to the real thing without recycling real information could help to address privacy problems. But it could also do much more. “I want to move away from just privacy,” says Mihaela van der Schaar, a machine-learning researcher and director of the UK Cambridge Centre for AI in Medicine. “I hope that synthetic data could help us create better data.”

All data sets come with issues that go beyond privacy considerations. They can be expensive to produce and maintain. In some cases — for example, trying to diagnose a rare medical condition using imaging — there simply might not be enough real-world data available to train a system to do the task reliably. Bias is also a problem — both social biases, which might cause systems to favour one group of people over another, and subtler issues such as a training set of photos that includes only a handful taken at night. Synthetic data, its proponents say, can get around these problems by adding absent information to data sets faster and more cheaply than gathering it from the real world, assuming it were possible to obtain the real thing at all.

“To me, it’s about making data this living, controllable object that you can change towards your application and your goals,” says Phillip Isola, a computer scientist at MIT who specializes in machine vision. “It’s a fundamental new way of working with data.”

The same on the inside

There are several ways to synthesize data, but they all invoke the same concept. A computer, using a machine-learning algorithm or a neural network, analyses a real data set and learns about the statistical relationships within it. It then creates a new data set containing different data points than the original, but retaining the same relationships. A familiar example is ChatGPT, the text generation engine. ChatGPT is based on a large language model, Generative Pre-trained Transformer, which pored over billions of examples of text written by humans, analysed the relationships between the words and built a model of how they fit together. When given a prompt — ‘Write me an ode to ducks’ — ChatGPT takes what it has learnt about odes and ducks and produces a string of words, with each word choice informed by the statistical probability of it following the previous one:

“Oh ducks, feathered and free,

Paddling in ponds with such glee,

Your quacks and waddles are a delight,

A joy to behold, day or night.”

With the right training, machines can produce not only text but also images, audio or the rows and columns of tabular data. The question is, how accurate is the output? “That’s one of the challenges in synthetic data,” says Thomas Strohmer, a mathematician who directs the Center for Data Science and Artificial Intelligence Research at the University of California, Davis (UC Davis).

Three doctors discuss results in a clinical setting

Jason Adams, Thomas Strohmer and Rachael Callcut (left to right) are part of the synthetic data research team at UC Davis Health.

“You first have to figure out what you mean by accuracy,” he says. To be useful, a synthetic data set must retain the aspects of the original that are relevant to the outcome — the all-important statistical relationships. But AI has accomplished many of its impressive feats by identifying patterns in data that are too subtle for humans to notice. If we could understand the data well enough to easily identify the relationships in medical data that suggest someone is at risk of a disease, we would have no need for a machine to find those relationships in the first place, Strohmer says.

This catch-22 means that the clearest way to know whether a synthetic data set has captured the important nuances of the original is to see if an AI system trained on the synthetic data makes similarly accurate predictions to a system trained on the original. The more capable the machine, the harder it is for humans to distinguish the real from the fake. AI-generated images and text are already at the point where they seem realistic to most people, and the technology is advancing rapidly. “We’re getting close to the level where, even to the expert, the imagery looks correct, but it still might not be correct,” Isola says. It is therefore important that users treat synthetic data with some caution, and don’t lose sight of the fact that it isn’t real data, he says. “It still might be misleading.”

Development headaches

Last April, Strohmer and two of his colleagues at UC Davis Health in Sacramento, California, won a four-year, US$1.2-million grant from the US National Institutes of Health to work out ways to generate high-quality synthetic data that could help physicians to predict, diagnose and treat diseases. As part of the project, Strohmer is developing mathematical methods of proving just how accurate synthetic data sets are.

He also wants to include a mathematical guarantee of privacy, especially given the stringent laws around medical privacy around the world, such as the Health Insurance Portability and Accountability Act in the United States and the European Union’s General Data Protection Regulation. The difficulty is that the utility and privacy of data are in tension; increasing one means decreasing the other.

To increase privacy in data, scientists add statistical noise to a data set. If, for instance, one of the data points collected is a person’s age, they throw in some random ages to make individuals less identifiable. It’s easier to pinpoint a 45-year-old man with diabetes than a person with diabetes who might be 38, or 51, or 62. But, if the age of diabetes onset is one of the factors being studied, this privacy-protecting measure will lead to less accurate results.

Part of the difficulty of guaranteeing privacy is that scientists are not completely sure how synthetic data reveals private information or how to measure how much it reveals, says Florimond Houssiau, a computer scientist at the Alan Turing Institute in London. One way in which secrets could be spilled is if the synthetic data are too similar to the original data. In a data set that contains many pieces of information associated with an individual, it can be hard to grasp the statistical relationships. In this case, the system generating the synthetic version is more likely to replicate what it sees rather than make up something entirely new. “Privacy is not actually that well understood,” Houssiau says. Scientists can assign a numerical value to the privacy level of a data set, but “we don’t exactly know which values should be considered safe or not. And so it’s difficult to do that in a way that everyone would agree on”.

The varied nature of medical data sets also makes generating synthetic versions of them challenging. They might include notes written by physicians, X-rays, temperature measurements, blood-test results and more. A medical professional with years of training and experience might be able to put those factors together and come up with a diagnosis. Machines, so far, cannot. “We just don’t know enough, in terms of machine learning, to extract information from different modalities,” Strohmer says. That’s a problem for analysis tools, but it’s also a problem for machines tasked with creating synthetic data sets that retain the all-important relationships. “We don’t understand yet how to automatically detect these relationships,” he says.

There are also fundamental theoretical limits to how much improvement data can undergo, says Isola. Information theory contains a principle called the data-processing inequality, which states that processing data can only reduce the amount of information available, not add to it4. And all synthetic data must have real data at its root, so all the problems with real data — privacy, bias, expense and more — still exist at the start of the pipeline. “You’re not getting something for free — you’re still ultimately learning from the world, from data. You’re just reformatting that into an easier-to-work-with format that you can control better,” Isola says. With synthetic data, “data comes in and a better version of the data comes out”.

Into the world

Although synthetic data in medicine haven’t yet made their way into clinical use, there are some areas where such data sets have taken off. They are being widely used in finance, Strohmer says, with many companies springing up to help financial institutions create new data that protect privacy. Part of the reason for this difference might be that the stakes are lower in finance than in medicine. “If in finance you get it wrong, it still hurts, but it doesn’t lead to death, so they can push things a little bit faster than in the medical field,” Strohmer says.

In 2021, the US Census Bureau announced that it was looking at creating synthetic data to enhance the privacy of people who respond to its annual American Community Survey, which provides detailed information about households in subsections of the country. Some researchers have objected, however, on the grounds that the move could undermine the data’s usefulness. In February, Administrative Data Research UK, a partnership that enables the sharing of public-sector data, announced a grant to study the value of synthetic versions of data sets that have been created by the Office of National Statistics and the UK Data Service.

Some people are also using synthetic data to test software that they hope to eventually use on real data that they do not yet have access to, says Andrew Elliott, a statistician at the University of Glasgow, UK. These fake data have to look something like the real data, but they can be meaningless, because they only exists for testing the code. A scientist who wants to analyse a sensitive data set that they are granted only limited access to can perfect the code first with synthetic data, and not have to waste time when they get hold of the real data.

For now, synthetic data are a relatively niche pursuit. van der Schaar thinks that more people should be talking about synthetic data and their potential impact — and not just scientists. “It’s important that not only computer scientists understand, but also the general public,” she says. “People need to wrap their heads around this technology because it could affect everyone.”

The issues around synthetic data not only raise interesting research questions for scientists but also important issues for society at large, Strohmer says. “Data privacy is so important in the age of surveillance capitalism,” he says. Creating good synthetic data that both preserve privacy and reflect diversity, and that are made widely available, has the potential not just to improve the performance of AI and expand its uses, but also to help democratize AI research. “A lot of data is owned by a few big companies, and that creates an imbalance. Synthetic data could help to re-establish this balance a little bit,” Strohmer says. “I think that’s an important, bigger goal behind synthetic data.”

Source link

Related Articles

Back to top button