Some research institutions in Canada, the US, and Italy are using AI-generated synthetic medical data that mimics real patient information without including actual human data, allowing them to bypass traditional ethics review processes due to the data's non-human status and potential privacy benefits.
Synthetic data generated by AI can aid medical research and improve healthcare, especially in areas with limited real data, but concerns about privacy, validation, and ethical oversight must be addressed to ensure reliable and safe use.
Research indicates that AI models can transmit hidden subliminal signals to each other through training data, potentially amplifying negative behaviors like violence, even when data appears benign to humans. This phenomenon, called subliminal learning, poses significant risks for AI safety and the use of synthetic data in training, as it may be impossible to fully prevent the transfer of harmful patterns between models.
Microsoft has introduced Phi-4, the latest in its Phi series of generative AI models, available for limited research use on the Azure AI Foundry platform. This 14 billion parameter model excels in math problem-solving due to improved training data quality, including high-quality synthetic datasets. Phi-4 competes with other small models like GPT-4o mini and Claude 3.5 Haiku, offering faster and cheaper performance. The launch follows the departure of key developer Sébastien Bubeck to OpenAI.
OpenAI is reportedly facing a slowdown in the improvement of its AI models, with its upcoming model, codenamed Orion, showing less advancement compared to previous iterations like GPT-4. To address this, OpenAI has formed a foundations team to explore new strategies, including using synthetic data for training and enhancing models post-training. Despite these efforts, Orion may not outperform existing models in certain areas, such as coding. OpenAI has not confirmed plans to release Orion this year.
As earnings season approaches, skepticism around the returns on AI technologies is growing, with concerns about the immense costs and limitations of relying on synthetic data for training AI models. Tech companies are investing heavily in hardware and infrastructure to reduce their dependence on outside suppliers of AI chips, but the spending and warnings over data and resources will bring them closer to having to prove the profitability of their investments in the AI-led future.
Tech companies like OpenAI and Google are exploring the use of synthetic data, generated by artificial intelligence, to train their A.I. models as they face copyright issues and potential data scarcity. However, the use of synthetic data is still experimental, as A.I. models can introduce biases and inaccuracies, potentially amplifying flaws in the training process.
AI companies are facing a shortage of training data as they continue to build larger models, leading to the exploration of alternative sources such as publicly-available video transcripts and synthetic data. Some companies are considering controversial methods like training on transcriptions from public YouTube videos, while others are working on creating higher-quality synthetic data. Concerns about AI running out of data have been raised, but researchers believe that breakthroughs could address the issue. However, the solution may also involve reevaluating the pursuit of larger models due to environmental and resource concerns.
Researchers have developed AlphaGeometry, a neuro-symbolic theorem prover that uses synthetic data to solve olympiad-level geometry problems. By generating 100 million synthetic theorems and their proofs, AlphaGeometry outperforms previous state-of-the-art geometry-theorem-proving computer programs and approaches the performance of an average International Mathematical Olympiad (IMO) gold medallist. The method combines language modeling and specialized symbolic engines to produce human-readable proofs, achieving a success rate of 25 out of 30 problems on a test set of classical geometry problems. The synthetic data generation process rediscovers known theorems and lemmas, demonstrating the potential of this approach in theorem proving.
Researchers are increasingly turning to synthetic data to supplement or even replace natural data for training neural networks. Synthetic data is proving useful in addressing concerns about facial recognition, as many facial recognition systems are trained with huge libraries of images of real faces, which raises issues about privacy and bias. Microsoft has released a collection of 100,000 synthetic faces for training AI systems, generated from a set of 500 people who gave permission for their faces to be scanned. The computer can label every part of every face, which helps the neural net learn faster.