(Press-News.org) What can exploding stars teach us about how blood flows through an artery? Or swimming bacteria about how the ocean’s layers mix? A collaboration of researchers from universities, science philanthropies and national laboratories has reached an important milestone toward training artificial intelligence models to find and exploit transferable knowledge between seemingly disparate fields to drive scientific discovery.
This initiative, called Polymathic AI, uses technology similar to that powering large language models such as OpenAI’s ChatGPT or Google’s Gemini. But instead of ingesting text, the project’s models learn using scientific datasets from across astrophysics, biology, acoustics, chemistry, fluid dynamics and more, essentially giving the models cross-disciplinary scientific knowledge.
“These groundbreaking datasets are by far the most diverse large-scale collections of high-quality data for machine learning training ever assembled for these fields,” says Polymathic AI member Michael McCabe, a research engineer at the Flatiron Institute in New York City. “Curating these datasets is a critical step in creating multidisciplinary AI models that will enable new discoveries about our universe.”
Today, the Polymathic AI team released two of its open-source training dataset collections to the public — a colossal 115 terabytes in total from dozens of sources — for the scientific community to use to train AI models and enable new scientific discoveries. (For comparison, GPT-3 used 45 terabytes of uncompressed, unformatted text for training, which ended up being around 0.5 terabytes after filtering.)
“The freely available datasets are an unprecedented resource for developing sophisticated machine learning models that can then tackle a wide range of scientific problems,” says Polymathic AI member Ruben Ohana, a research fellow at the Flatiron Institute’s Center for Computational Mathematics (CCM). “The machine learning community has always been open-sourced; that’s why it’s been so fast-paced compared to other fields. We feel that sharing this data open source will benefit the machine learning and scientific communities. It’s a win-win situation — you have machine learning that can develop new models, and at the same time, scientific communities can see what machine learning can do for them.”
The full datasets are available to download for free from the Flatiron Institute and accessible on HuggingFace, a platform hosting AI models and datasets. The Polymathic AI team provides further information about the datasets in two papers accepted for presentation at the leading machine learning conference, NeurIPS, to be held in December in Vancouver, Canada.
“We’ve seen again and again that the most effective way to advance machine learning is to take difficult challenges and make them accessible to the wider research community,” says McCabe. “Each time a new benchmark is released, it initially seems like an insurmountable problem, but once a challenge is made accessible to the broader community, we see more and more people digging in and accelerating progress faster than any individual group could alone.”
The Polymathic AI project is run by researchers from the Simons Foundation and its Flatiron Institute, New York University, the University of Cambridge, Princeton University, the French Centre National de la Recherche Scientifique and the Lawrence Berkeley National Laboratory.
AI tools such as machine learning are increasingly common in scientific research, including being recognized in two of this year’s Nobel Prizes. Still, such tools are typically purpose-built for a specific application and trained using data from that field. The Polymathic AI project instead aims to develop models that are truly polymathic, like people whose expert knowledge spans multiple areas. The project’s team itself reflects intellectual diversity, with physicists, astrophysicists, mathematicians, computer scientists and neuroscientists.
The first of the two new training dataset collections focuses on astrophysics. Dubbed the Multimodal Universe, the dataset contains hundreds of millions of astronomical observations and measurements, such as portraits of galaxies taken by NASA’s James Webb Space Telescope and measurements of our galaxy’s stars made by the European Space Agency’s Gaia spacecraft.
“Machine learning has been happening for around 10 years in astrophysics, but it’s still very hard to use across instruments, across missions and across scientific disciplines,” says Polymathic AI research scientist Francois Lanusse. “Datasets like the Multimodal Universe are what will allow us to build models that natively understand all of these data and can be used as a Swiss Army knife for astrophysics.”
In total, the dataset clocks in at 100 terabytes and was a major undertaking. “Our work, from around a dozen institutes and two dozen researchers, paves a path for machine learning to become a core component of modern astronomy,” says Polymathic AI member Micah Bowles, a Schmidt AI in Science Fellow at the University of Oxford. “Assembling this dataset was only possible through a broad collaboration of not only the Polymathic AI team but many expert astronomers from around the world.”
The other collection — called the Well — comprises over 15 terabytes of data from 16 diverse datasets. These datasets contain numerical simulations of biological systems, fluid dynamics, acoustic scattering, supernova explosions and other complicated processes. While these diverse datasets may seem disconnected at first, they all require the modeling of mathematical equations called partial differential equations. Such equations pop up in problems related to everything from quantum mechanics to embryo development and can be incredibly difficult to solve, even for supercomputers. One of the goals of the Well is to enable AI models to churn out approximate solutions to these equations quickly and accurately.
“This dataset encompasses a diverse range of physics simulations designed to address key limitations of current machine [learning] models,” says Polymathic AI member Rudy Morel, a CCM research fellow. “We are eager to see models that perform well across all these scenarios, as it would be a significant step forward.”
Gathering the data for those datasets posed a challenge, says Ohana. The team collaborated with scientists to gather and create data for the project. “The creators of numerical simulations are sometimes skeptical of machine learning because of all the hype, but they’re curious about it and how it can benefit their research and accelerate scientific discovery,” he says.
The Polymathic AI team itself is now using the datasets to train AI models. In the coming months, they will deploy these models on various tasks to see how successful these well-rounded, well-trained AIs are at tackling complex scientific problems.
“Understanding how machine learning models generalize and interpolate across datasets from different physical systems is an exciting research challenge,” says Polymathic AI member Régaldo-Saint Blancard, a CCM research fellow.
The Polymathic AI team has begun training machine learning models using the datasets, and “the early results have been very exciting,” says Polymathic AI project lead Shirley Ho, a group leader at the Flatiron Institute’s Center for Computational Astrophysics. “I’m also looking forward to seeing what other AI scientists will do with these datasets. Just like the Protein Data Bank spawned AlphaFold, I’m excited to see what the Well and the Multimodal Universe will help create.” Ho will give a talk at the NeurIPS meeting highlighting the use and incredible potential of the work.
About the Flatiron Institute
The Flatiron Institute is the research division of the Simons Foundation. The institute's mission is to advance scientific research through computational methods, including data analysis, theory, modeling and simulation. The institute comprises centers devoted to problems in astrophysics, biology, mathematics, neuroscience and quantum physics.
END
New datasets will train AI models to think like scientists
The Polymathic AI team has released two massive datasets for training artificial intelligence models to tackle problems across scientific disciplines. The datasets include data from dozens of sources.
2024-12-02
ELSE PRESS RELEASES FROM THIS DATE:
Brief scientific literacy interventions may quash new conspiracy theories
2024-12-02
UNIVERSITY PARK, Pa. — The more time you spend on social media, the likelier you are to have come across a viral post that seems too strange to be true. Brief scientific literacy interventions, especially those that focus on critical thinking skills, may help to undermine conspiracy beliefs and behaviors before the conspiracy theories have a chance to take root, according to a team led by Penn State researchers.
The team published their findings in the Journal of Consumer Research.
\“While some conspiracy beliefs may seem relatively harmless, others — about vaccines, genetically modified organisms and climate change, for example ...
Illinois researchers examine teens’ use of generative AI, safety concerns
2024-12-02
CHAMPAIGN, Ill. — Teenagers use generative artificial intelligence for many purposes, including emotional support and social interactions. A study by University of Illinois Urbana-Champaign researchers found that parents have little understanding of GAI, how their children use it and its potential risks, and that GAI platforms offer insufficient protection to ensure children’s safety.
The research paper by information sciences professor Yang Wang, the co-director of the Social Computing Systems Lab, and doctoral student Yaman Yu is one of the first published sources of data on the uses and risks of GAI for children. Wang ...
UTA student recognized for research on high-fat diets
2024-12-02
University of Texas at Arlington senior Ken Perry has always been interested in how the heart works. This curiosity led the Arlington High School graduate to start working in the lab of UTA kinesiology Professor R. Matthew Brothers during his second year of college. Now, two years later, Perry is the recipient of two research awards from the American Physiological Society (APS) for his research on a connection between high-fat meals and cardiovascular health.
“I always wanted to learn about heart and blood flow, so when friend of mine interested in research encouraged me to apply for the SURPINT program ...
Smallest walking robot makes microscale measurements
2024-12-02
ITHACA, N.Y. – Cornell University researchers have created the smallest walking robot yet. Its mission: to be tiny enough to interact with waves of visible light and still move independently, so that it can maneuver to specific locations – in a tissue sample, for instance – to take images and measure forces at the scale of some of the body’s smallest structures.
The team’s paper, “Magnetically Programmed Diffractive Robotics,” published in Science.
“A walking robot that’s small enough to interact with and shape light effectively takes a microscope’s lens and puts it directly ...
Peroxisomal protein boosts plant immunity to thrive under environmental stress
2024-12-02
Salicylic acid is vital for protecting plants from pathogens, but its synthesis remains unclear. A recent study by Shinshu University researchers has discovered that the protein HSR201 is key to its production. They found that HSR201 localizes to specific organelles called peroxisomes through a unique targeting signal. This discovery improves our understanding of how plants produce salicylic acid and could pave the way for developing engineered crops with improved disease resistance.
Plant hormones, or phytohormones, are vital for plant growth, adaptation, and defense. ...
Critical relationship between stem cells and mechanical signals unveiled
2024-12-02
A new study from The Hospital for Sick Children (SickKids) and Institut Curie reveals how stem cells sense and respond to their environment, with implications for inflammatory bowel disease and colorectal cancer.
Stem cells constantly adapt to their environment to maintain organ and tissue health, informed by chemical signals and physical forces. When they do not function as intended, stem cells can result in a number of health conditions including inflammatory bowel disease (IBD) and colorectal (bowel) cancer, where they continue to divide until a tumour forms.
Until ...
A cause of hyperinflammatory response in lethal COVID-19 identified
2024-12-02
As part of the COVID-19 International Research Team, researchers at the Johns Hopkins Kimmel Cancer Center, Children’s Hospital of Philadelphia, the University of Pittsburgh and Weill Cornell Medicine discovered a novel cause of cytokine storm — the extreme inflammatory response associated with increased risk of death in COVID-19 infection.
Their findings were reported Nov. 27 in the online issue of Proceedings of the National Academy of Sciences.
In an intensive genomic search for causes of cytokine storm, the research team used autopsy samples obtained from 40 patients who died from COVID-19. They performed genome analysis ...
Sleep deprivation in dementia: Heart disease, diabetes, anxiety, and thyroid disorders
2024-12-02
"Sleep disturbances are a significant concern in individuals with dementia, affecting their overall health and quality of life, as well as that of their family members and caregivers.”
BUFFALO, NY- December 2, 2024 – A new review was published on the cover of Aging (listed by MEDLINE/PubMed as "Aging (Albany NY)" and "Aging-US" by Web of Science) Volume 16, Issue 21, titled, “Sleep deprivation in dementia comorbidities: focus on cardiovascular disease, diabetes, anxiety/depression and thyroid disorders.”
The review, authored by Upasana Mukherjee, Ujala Sehar, Malcolm Brownell, and ...
Temporary tattoo printed directly on the scalp offers easy, hair-friendly solution for measuring brainwaves
2024-12-02
For the first time, scientists have invented a liquid ink that doctors can print onto a patient’s scalp to measure brain activity. The technology, presented December 2 in the Cell Press journal Cell Biomaterials, offers a promising alternative to the cumbersome process currently used for monitoring brainwaves and diagnosing neurological conditions. It also has the potential to enhance non-invasive brain-computer interface applications.
“Our innovations in sensor design, biocompatible ink, and high-speed printing pave the way for future on-body manufacturing of electronic tattoo sensors, with broad applications both within and beyond ...
Ketone bodies: more than energy, they are powerful signaling metabolites that clean up damaged proteins
2024-12-02
Ketone bodies, produced by the body to provide fuel during fasting, have roles in regulating cellular processes and aging mechanisms beyond energy production. Research at the Buck Institute shows that ketone bodies can best be understood as powerful signaling metabolites affecting brain function in aging and Alzheimer’s disease. A new study demonstrates that ketone bodies and similar metabolites have profound effects on the proteome and protein quality control in the brain. Publishing in Cell Chemical Biology, Buck Institute scientists, ...
LAST 30 PRESS RELEASES:
Scientists unlock secrets behind flowering of the king of fruits
Texas A&M researchers illuminate the mysteries of icy ocean worlds
Prosthetic material could help reduce infections from intravenous catheters
Can the heart heal itself? New study says it can
Microscopic discovery in cancer cells could have a big impact
Rice researchers take ‘significant leap forward’ with quantum simulation of molecular electron transfer
Breakthrough new material brings affordable, sustainable future within grasp
How everyday activities inside your home can generate energy
Inequality weakens local governance and public satisfaction, study finds
Uncovering key molecular factors behind malaria’s deadliest strain
UC Davis researchers help decode the cause of aggressive breast cancer in women of color
Researchers discovered replication hubs for human norovirus
SNU researchers develop the world’s most sensitive flexible strain sensor
Tiny, wireless antennas use light to monitor cellular communication
Neutrality has played a pivotal, but under-examined, role in international relations, new research shows
Study reveals right whales live 130 years — or more
Researchers reveal how human eyelashes promote water drainage
Pollinators most vulnerable to rising global temperatures are flies, study shows
DFG to fund eight new research units
Modern AI systems have achieved Turing's vision, but not exactly how he hoped
Quantum walk computing unlocks new potential in quantum science and technology
Construction materials and household items are a part of a long-term carbon sink called the “technosphere”
First demonstration of quantum teleportation over busy Internet cables
Disparities and gaps in breast cancer screening for women ages 40 to 49
US tobacco 21 policies and potential mortality reductions by state
AI-driven approach reveals hidden hazards of chemical mixtures in rivers
Older age linked to increased complications after breast reconstruction
ESA and NASA satellites deliver first joint picture of Greenland Ice Sheet melting
Early detection model for pancreatic necrosis improves patient outcomes
Poor vascular health accelerates brain ageing
[Press-News.org] New datasets will train AI models to think like scientistsThe Polymathic AI team has released two massive datasets for training artificial intelligence models to tackle problems across scientific disciplines. The datasets include data from dozens of sources.