(Press-News.org) CAMBRIDGE, MA – The ability to generate high-quality images quickly is crucial for producing realistic simulated environments that can be used to train self-driving cars to avoid unpredictable hazards, making them safer on real streets.
But the generative AI techniques increasingly being used to produce such images have drawbacks. One popular type of model, called a diffusion model, can create stunningly realistic images but is too slow and computationally intensive for many applications. On the other hand, the autoregressive models that power LLMs like ChatGPT are much faster, but they produce poorer-quality images that are often riddled with errors.
Researchers from MIT and NVIDIA developed a new approach that brings together the best of both methods. Their hybrid image-generation tool uses an autoregressive model to quickly capture the big picture and then a small diffusion model to refine the details of the image.
Their tool, known as HART (short for Hybrid Autoregressive Transformer) can generate images that match or exceed the quality of state-of-the-art diffusion models, but do so about nine times faster.
The generation process consumes fewer computational resources than typical diffusion models, enabling HART to run locally on a commercial laptop or smartphone. A user only needs to enter one natural language prompt into the HART interface to generate an image.
HART could have a wide range of applications, such as helping researchers train robots to complete complex real-world tasks and aiding designers in producing striking scenes for video games.
“If you are painting a landscape, and you just paint the entire canvas once, it might not look very good. But if you paint the big picture and then refine the image with smaller brush strokes, your painting could look a lot better. That is the basic idea with HART,” says Haotian Tang PhD ’25, co-lead author of a new paper on HART.
He is joined by co-lead author Yecheng Wu, an undergraduate student at Tsinghua University; senior author Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and a distinguished scientist of NVIDIA; as well as others at MIT, Tsinghua University, and NVIDIA. The research will be presented at the International Conference on Learning Representations.
The best of both worlds
Popular diffusion models, such as Stable Diffusion and DALL-E, are known to produce highly detailed images. These models generate images through an iterative process where they predict some amount of random noise on each pixel, subtract the noise, then repeat the process of predicting and “de-noising” multiple times until they generate a new image that is completely free of noise.
Because the diffusion model de-noises all pixels in an image at each step, and there may be 30 or more steps, the process is slow and computationally expensive. But because the model has multiple chances to correct details it got wrong, the images are high-quality.
Autoregressive models, commonly used for predicting text, can generate images by predicting patches of an image sequentially, a few pixels at a time. They can’t go back and correct their mistakes, but the sequential prediction process is much faster than diffusion.
These models use representations known as tokens to make predictions. An autoregressive model utilizes an autoencoder to compress raw image pixels into discrete tokens as well as reconstruct the image from predicted tokens. While this boosts the model’s speed, the information loss that occurs during compression causes errors when the model generates a new image.
With HART, the researchers developed a hybrid approach that uses an autoregressive model to predict compressed, discrete image tokens, then a small diffusion model to predict residual tokens. Residual tokens compensate for the model’s information loss by capturing details left out by discrete tokens.
“We can achieve a huge boost in terms of reconstruction quality. Our residual tokens learn high-frequency details, like edges of an object, or a person’s hair, eyes, or mouth. These are places where discrete tokens can make mistakes,” says Tang.
Because the diffusion model only predicts the remaining details after the autoregressive model has done its job, it can accomplish the task in eight steps, instead of the usual 30 or more a standard diffusion model requires to generate an entire image. This minimal overhead of the additional diffusion model allows HART to retain the speed advantage of the autoregressive model while significantly enhancing its ability to generate intricate image details.
“The diffusion model has an easier job to do, which leads to more efficiency,” he adds.
Outperforming larger models
During the development of HART, the researchers encountered challenges in effectively integrating the diffusion model to enhance the autoregressive model. They found that incorporating the diffusion model in the early stages of the autoregressive process resulted in an accumulation of errors. Instead, their final design of applying the diffusion model to predict only residual tokens as the final step significantly improved generation quality.
Their method, which uses a combination of an autoregressive transformer model with 700 million parameters and a lightweight diffusion model with 37 million parameters, can generate images of the same quality as those created by a diffusion model with 2 billion parameters, but it does so about nine times faster. It uses about 31 percent less computation than state-of-the-art models.
Moreover, because HART uses an autoregressive model to do the bulk of the work — the same type of model that powers LLMs — it is more compatible for integration with the new class of unified vision-language generative models. In the future, one could interact with a unified vision-language generative model, perhaps by asking it to show the intermediate steps required to assemble a piece of furniture.
“LLMs are a good interface for all sorts of models, like multimodal models and models that can reason. This is a way to push the intelligence to a new frontier. An efficient image-generation model would unlock a lot of possibilities,” he says.
In the future, the researchers want to go down this path and build vision-language models on top of the HART architecture. Since HART is scalable and generalizable to multiple modalities, they also want to apply it for video generation and audio prediction tasks.
###
This research was funded, in part, by the MIT-IBM Watson AI Lab, the MIT and Amazon Science Hub, the MIT AI Hardware Program, and the National Science Foundation. The GPU infrastructure for training this model was donated by NVIDIA.
END
New AI tool generates high-quality images faster than state-of-the-art approaches
Researchers fuse the best of two popular methods to create an image generator that uses less energy and can run locally on a laptop or smartphone.
2025-03-20
ELSE PRESS RELEASES FROM THIS DATE:
Xylazine detected in U.S.-Mexico border drug supply, study finds
2025-03-20
Researchers at University of California San Diego School of Medicine, in collaboration with the Prevencasa free clinic in Tijuana, Mexico, have confirmed the presence of xylazine in the illicit drug supply at the U.S.-Mexico border. While xylazine remains less common in the Western U.S., border cities serve as key trafficking hubs and may have higher rates of emerging substances. The findings, published on March 20, 2025 in the Journal of Addiction Medicine, highlight the urgent need for public health intervention.
“Xylazine is a veterinary anesthetic that is not approved for human use and is increasingly detected alongside illicit fentanyl in parts of the United States ...
Producing nuclear fusion fuel is banned in the US for being too toxic, but these researchers found an alternative
2025-03-20
Lithium-6 is essential for producing nuclear fusion fuel, but isolating it from the much more common isotope, lithium-7, usually requires liquid mercury, which is extremely toxic. Now, researchers have developed a mercury-free method to isolate lithium-6 that is as effective as the conventional method. The new method is presented March 20 in the Cell Press journal Chem.
“This is a step towards addressing a major roadblock to nuclear energy,” says chemist and senior author Sarbajit Banerjee of ETH Zürich and Texas A&M University. “Lithium-6 is a critical material for the renaissance of nuclear energy, ...
Adaptive defenses against malicious jumping genes
2025-03-20
Adverse genetic mutations can cause harm and are due to various circumstances. “Jumping genes” are one cause of mutations, but cells try and combat them with a specialized RNA called piRNA. For the first time, researchers from the University of Tokyo and their collaborators have identified how the sites responsible for piRNA production evolve effective behaviors against jumping genes. This research could lead to downstream diagnostic or therapeutic applications.
The word mutation can mean different things in different situations. ...
Cancer antigen 125 levels at time of ovarian cancer diagnosis by race and ethnicity
2025-03-20
About The Study: In this cohort study of patients with ovarian cancer, American Indian and Black patients were 23% less likely to have an elevated cancer antigen (CA)-125 level at diagnosis. Current CA-125 thresholds may miss racially and ethnically diverse patients with ovarian cancer. International guidelines use CA-125 thresholds to recommend which patients with pelvic masses should undergo evaluation by gynecologic oncologists for ovarian cancer. However, CA-125 thresholds were developed from white populations. Work is needed to develop inclusive CA-125 thresholds and ...
Prevalence and severity of astigmatism in children after COVID-19
2025-03-20
About The Study: In this study, lifestyle changes after the pandemic were associated with an increase in the prevalence and severity of child astigmatisms, likely associated with changes in the developing cornea. The potential impact of higher degrees of astigmatism may warrant dedicated efforts to elucidate the relationship between environmental and/or lifestyle factors, as well as the pathophysiology of astigmatism.
Corresponding Authors: To contact the corresponding authors, email Jason C. Yam, MD (yamcheuksing@cuhk.edu.hk) and Li ...
Study: new guidelines expanded access to lung cancer screening, but gaps remain in reaching rural and uninsured populations
2025-03-20
MIAMI, FLORIDA (STRICTLY EMBARGOED UNTIL MARCH 20, 2025, AT 11 A.M. EDT) – Since 2021, when lung cancer screening guidelines began to include younger people and those with a lower smoking history, the number of screenings climbed, but significant gaps remain, especially among people with limited access to healthcare, according to a new study led by researchers at Sylvester Comprehensive Cancer Center, part of the University of Miami Miller School of Medicine.
"The updated guidelines substantially increased lung cancer screenings overall, even as ...
Analysis of new colorectal cancer immunotherapy shows more treatment options
2025-03-20
A team of researchers from Cleveland Clinic Genomic Medicine share insights from an early set of 19,000 patients to receive immune checkpoint inhibitor treatments for colorectal cancer in the U.S.
The report comes from the laboratory of Stephanie Schmit, PhD, MPH, and was published in JAMA Network Open. It serves as an opportunity to better understand how immune checkpoint inhibitor treatments, including PD-1 and PD-L1 inhibitors, work in a larger population that reflects real-world settings. Dr. Schmit collaborated with a team of ...
Scientists use cellular programming to mimic first days of embryonic development
2025-03-20
The earliest days after fertilization, once a sperm cell meets an egg, are shrouded in scientific mystery.
The process of how a humble single cell becomes an organism fascinates scientists across disciplines. For some animals, the entire process of cellular multiplication, generation of specialized cells, and their organization into an ordered multicellular embryo takes place in the protective environment of the uterus, making direct observation and studies challenging. This makes it difficult for scientists to understand what can go wrong during that process, and how specific risk factors and the surrounding environment may prevent ...
Potential targeted therapy for pediatric brain cancer identified by Dana-Farber team
2025-03-20
Boston – An international team of clinical collaborators, led by physician scientists from Dana-Farber Cancer Institute, performed a first-ever clinical test of the targeted therapy avapritinib in pediatric and young patients with a form of high-grade glioma. They found that the drug, already FDA-approved for certain adult cancers, was generally safe and resulted in tumor reduction visible on brain scans, as well as clinical improvement, in 3 out of 7 patients.
The study was published in Cancer Cell.
Pediatric-type high-grade gliomas are currently incurable brain tumors with median survival times less than 18 months after initial diagnosis.
Avapritinib ...
Self-assembled vesicles containing podophyllotoxin covalently modified with polyoxometalates for antitumor therapy
2025-03-20
POMs are a class of inorganic metal-oxygen cluster compounds with broad-spectrum antitumor potential. However, their strong hydrophilicity and poor lipophilicity result in insufficient cell membrane permeability, and high doses are required to achieve therapeutic effects, which severely limits their clinical application. To address this challenge, the research team proposes a covalent modification strategy: the construction of an amphipathic drug molecule PPT-POM-PPT by linking the hydrophobic anti-tumor drug Podophyllotoxin (PPT) with hydrophilic POMs. This molecule ...
LAST 30 PRESS RELEASES:
Aotearoa once home to elephant seals
Green recipe: Engineered yeast boosts D-lactic acid production
Computational drug discovery: Exploring natural products targeting SARS-CoV-2
Almost half of children with complicated appendicitis can recover from surgery at home
Sensory t-shirt collects patient data and enables shorter postoperative hospital stay
Worse outcomes for men who avoid prostate cancer screening
Shrinking Andean glaciers threaten water supply of 90 million people, global policy makers warne
Women’s earnings fall 10% four years after menopause diagnosis
Researchers capture first laser-driven, high-resolution CT scans of dense objects
Cambridge team uses powerful new MRI scans to enable life-changing surgery in first for adults with epilepsy
NRL's narrow field imager launches on NASA's PUNCH mission
Galapagos birds exhibit ‘road rage’ due to noise
Groundbreaking study finds AI-driven interviews with children may boost accuracy in witness accounts
New framework to measure economic well-being considers new and free goods and services; addition of digital goods boosts growth
Augmented reality guidance for placing intracranial drains now clinically validated
How feathers develop in chickens
Insomniac fruit fly mutants show enhanced memory despite severe sleep loss
Seals can sense their own circulating blood oxygen and it keeps them from drowning
Infants encode short-lived hippocampal memories
Mountain uplift and dynamic topography shapes biodiversity over deep time
Majority of carbon sequestered on land is locked in nonliving carbon reservoirs
From dinosaurs to birds: the origins of feather formation
Why don’t we remember being a baby? New study provides clues
The cell’s powerhouses: Molecular machines enable efficient energy production
Most of the carbon sequestered on land is stored in soil and water
New US Academic Alliance for the IPCC opens critical nomination access
Breakthrough molecular movie reveals DNA’s unzipping mechanism with implications for viral and cancer treatments
New function discovered for protein important in leukemia
Tiny component for record-breaking bandwidth
In police recruitment efforts, humanizing officers can boost interest
[Press-News.org] New AI tool generates high-quality images faster than state-of-the-art approachesResearchers fuse the best of two popular methods to create an image generator that uses less energy and can run locally on a laptop or smartphone.