- Press Release Distribution

Harnessing large vision-language models

Harnessing large vision-language models
( By Alistair Jones

SMU Office of Research – The terminology of artificial intelligence (AI) and its many acronyms can be confusing for a lay person, particularly as AI develops in sophistication.

Among the developments is deep learning – a machine learning technique that teaches computers to learn by example.

“Deep learning has brought many major changes to AI, especially in natural language processing (NLP) and computer vision, two sub areas of AI,” says Jing Jiang, a Professor of Computer Science at Singapore Management University (SMU).

“In my field, which is NLP, the solution approaches to many tasks have fundamentally changed due to the recent success of ChatGPT type of technologies, and deep learning is one of the key enabling factors to these technologies.”

ChatGPT is a prominent AI-powered chatbot that can generate human-like responses to text inputs in a conversational manner. Its outputs can include articles, reports and even song lyrics, though its attempt to 'write' a Nick Cave song was met with derision from the artist.

But ChatGPT continues to be improved. It draws its knowledge from massive data sets known as large-scale pre-trained language models (LLM) that tech corporations and governments have been building. It repurposes the data using generative AI, which produces the aforementioned articles and reports.

“ChatGPT was not intentionally trained to perform all these tasks. Its ability to transfer its knowledge learned from other tasks to a new task is an example of zero-shot transfer,” Professor Jiang says.

ChatGPT and its ilk have paved the way for a new research project led by Professor Jiang, which was recently awarded a MOE Academic Research Fund Tier 2 grant. 

Professor Jiang is focusing on vision question answering (VisualQA), a technology that enables machines to answer questions based on visual data. The project aims to develop a new methodological framework to harness the power of large-scale pre-trained vision-language models (PT-VLM).

Discovering skillsets

Will this project be another case of generative AI?

“For certain types of questions, the answers do not need to be generated but are rather selected from a set of candidate answers,” Professor Jiang says.

“For example, if a question is asking about the colour of the followers in a picture, the answers can be chosen from a set of known colours.

“On the other hand, there are also some questions, especially those 'why' and 'how' questions, which may need answer generation because the answers to these questions are long sentences that cannot be directly chosen from any set of known answers. Therefore, in my project I will explore the use of existing generative language models for answer generation.”

The research team will identify the basic skills required by VisualQA and use a 'probing approach' to discover the 'skillsets' within the various pre-trained vision-language models. They will then design methods based on adapter modules, which are additional lightweight neural network layers, to augment pre-trained models with additional skills.

Necessary skills would include object recognition and spatial reasoning. Another more elusive skill is ‘common sense’. Can an algorithm replicate how humans think and behave in a reasonable way?

“This may sound like a daunting task, but researchers have been looking into this direction for quite a while,” Professor Jiang says.

“There are already a few resources available that try to capture commonsense knowledge, such as ConceptNet. Increasingly, people also find that LLMs themselves can capture commonsense knowledge, which they probably learned from the tremendous amount of data they are trained on.”

Commercial interest

Research and investment in PT-VLMs have lagged behind language models. Professor Jiang sees a number of reasons.

“First of all, language data is much more dense than visual data in terms of the amount of information or knowledge captured. This means that when trained on the same size of data, a language model could learn more human knowledge from the data than a vision model,” she says.

“Second, most human knowledge is still captured in textual format rather than in visual format, giving much more available training data to language models than to vision models or vision-language models.

“Third, verbal communication (including typing) is probably the most convenient and efficient way for humans to interact with machines, which means industry players will also focus more on developing powerful language models as foundations for their end-user products such as search engines and chatbots.”

Commercial interest in vision-language models has been growing because of the many potential use cases.

“One example is multimodal chatbots, which can receive inputs from humans not only in the format of speech and text but also in visual representations such as images and videos. Microsoft’s new Bing is a multimodal chatbot,” Professor Jiang says.

“Another important use case is embodied AI, where AI models sit on robots that can move around to sense their surroundings and perform tasks for humans. Joint vision-language AI models would enable an embodied AI agent (the robot) to understand a human’s verbal requests in the context of its surroundings.”

Practical impacts

Existing large-scale PT-VLMs are still not powerful enough on their own to handle many VisualQA questions and provide correct or relevant answers.

“The approach we see that is promising currently is to combine the powers of different pre-trained models, such as a framework called Visual ChatGPT developed by Microsoft,” Professor Jiang says. “The Visual ChatGPT framework does not attempt to further enhance the abilities of a single AI model. Rather, it leverages the different abilities of different pre-trained AI models to jointly perform a task such as modifying an interior design image based on a user’s verbal requests (e.g., “Replace the glass side table beside the sofa chair with a wooden one of similar size”).

“Here we can use one AI model for visual object detection, another for spatial reasoning, and a third for image generation, for example. The challenge lies in the dynamic decomposition of the original complex task into several simpler subtasks and the selection of suitable pre-trained AI models for each subtask. Visual ChatGPT uses ChatGPT model to perform the task decomposition and model selection, which I think is very smart.”

And then there is the issue of where to source new training data to augment PL-VLM models.

“It could be through repurposing existing datasets or through annotating new datasets by crowdsourcing. Because the field is evolving very rapidly, we will need to stay agile and be open to new ideas,” Professor Jiang says.

A well-known thorny issue is social biases within data sets.

“It is not easy to mitigate these biases. The problem is also complicated [because] social biases are different in different societies and cultures. Nevertheless, companies developing large pre-trained models are proactively removing or reducing these biases through human interventions.”

Professor Jiang envisages practical impacts from her research project.

“I believe the research output from my project can be used to improve multimodal chatbots and embodied AI agents. These bots can be particularly useful for increasing productivity in sectors such as retail, hospitality, education and healthcare,” she says.

“I currently have another ongoing project that aims to build a virtual avatar to interact with people living with dementia. VisualQA technologies are an important component of such virtual avatars. For societies such as Singapore that are facing imminent ageing problems, these AI-powered social bots have many potential applications.”


[Attachments] See images for this press release:
Harnessing large vision-language models


State policies can boost use of anti-opioid medication

States that want to increase access to buprenorphine, a lifesaving medication used to treat opioid use disorder, should consider efforts to enhance professional education and clinician knowledge, according to a new RAND Corporation study.   Examining six state-level policies aimed at boosting use of buprenorphine, researchers found that requiring buprenorphine prescribers to receive additional education beyond the initially required instruction, as well as continuing medical education related to substance misuse, were both associated with a significant increase in use of the treatment.   The findings are published in the latest edition of the journal JAMA Health Forum.   “Many ...

Association of healthy lifestyle factors and obesity-related diseases in adults in the UK

About The Study: In this study of 438,000 UK Biobank participants, adherence to a healthy lifestyle was associated with reduced risk of a wide range of obesity-related diseases, but this association was modest in adults with obesity. The findings suggest that although a healthy lifestyle seems to be beneficial, it does not entirely offset the health risks associated with obesity.  Authors: Sebastien Czernichow, M.D., Ph.D., of the Hopital Europeen Georges Pompidou in Paris, is the corresponding author.  To access the embargoed study: Visit our For The Media website at this link (doi:10.1001/jamanetworkopen.2023.14741) Editor’s ...

Effect of free medicine distribution on health care costs in Canada

About The Study: In this secondary analysis of a randomized clinical trial of primary care patients in Ontario, Canada, eliminating out-of-pocket medication expenses for patients with cost-related nonadherence in primary care was associated with lower health care spending over three years. These findings suggest that eliminating out-of-pocket medication costs for patients could reduce overall costs of health care.  Authors: Nav Persaud, M.D., of the University of Toronto, is the corresponding author. To access ...

Kentucky, Michigan scientific researchers awarded $2 million to study new heart disease, stroke treatments

DALLAS, May 26, 2023 — A Lexington, Ky., research scientist studying ways to repair damaged major vessels with medication rather than surgery and a physician-scientist from Ann Arbor, Mich., exploring the mechanisms of how exercise can heal heart muscle and brain tissue following a heart attack or stroke are the most recent American Heart Association Merit Award recipients. Each researcher will receive $1 million in funding from the Association, the world’s leading voluntary organization focused on heart and brain health and research. Alan Daugherty, Ph.D., D.Sc., FAHA, the associate vice president for research, ...

Scepticism about Microsoft results

Scepticism about Microsoft results
In March 2022, Microsoft published research results about the realisation of a special type of particle that might be used to make particularly robust quantum bits. Researchers at the University of Basel are now calling these results about so-called Majorana particles into doubt: through calculations they have shown that the findings can also be explained differently. In 1938 a genius suddenly vanished without a trace: after buying a ferry ticket from Palermo to Naples, the young Italian physicist Ettore Majorana seemingly ...

Yeast screen uncovers genes involved in chromosomal mutation

Yeast screen uncovers genes involved in chromosomal mutation
Osaka, Japan – When creating a computer program, errors in the code can introduce bugs to the software. Similarly, errors in our body’s genetic code, DNA, which is stored in structures known as chromosomes, can bring about mutations in the body. These mutations are the cause of many deadly diseases – including cancer. Now, researchers in Japan have shed new light on a particular type of genetic mutation: gross chromosomal rearrangement (GCR). In a new study published in Communications Biology, a multi-institutional team led by researchers from Osaka University analyzed fission yeast to identify two key genes involved in the process of GCR. The researchers ...

Forging a dream material with semiconductor quantum dots

Researchers from the RIKEN Center for Emergent Matter Science and collaborators have succeeded in creating a “superlattice” of semiconductor quantum dots that can behave like a metal, potentially imparting exciting new properties to this popular class of materials. Semiconducting colloidal quantum dots have garnered tremendous research interest due to their special optical properties, which arise from the quantum confinement effect. They are used in solar cells, where they can improve the efficiency of energy conversion, biological imaging, where they can be used as fluorescent probes, electronic displays, and even quantum computing, where their ability to ...

Capturing non-transparent ultrafast scenes

Capturing non-transparent ultrafast scenes
A research team at the Institut national de la recherche scientifique (INRS) led by Professor Roberto Morandotti reported the first realization of a single-shot ultrafast terahertz (THz) photography system. This important achievement published in Nature Communications will be able to provide both the spatial and temporal evolution of ultrashort dynamics with sub-picosecond resolution. In other terms, researchers will be now able to uncover the hidden laws of nature that govern the dynamics, which require imaging ...

Termite mounds reveal secret to creating ‘living and breathing’ buildings that use less energy

Termite mounds reveal secret to creating ‘living and breathing’ buildings that use less energy
Among the approximately 2,000 known species of termites, some are ecosystem engineers. The mounds built by some genera, for example Amitermes, Macrotermes, Nasutitermes, and Odontotermes, reach up to eight meters high, making them some of the world’s largest biological structures. Natural selection has been at work improving the ‘design’ of their mounds over tens of millions of years. What might human architects and engineers learn if they go to the termites and consider their ways? In a new study in Frontiers in Materials, researchers showed how termite mounds can teach us to create comfortable interior climates for our buildings that don’t ...

How eating natto might help to distress

How eating natto might help to distress
Health is wealth as the saying goes and new research now shows that it is possible to have a healthy, less stressed society through familiar and inexpensive foods. One such food might be the Japanese natto which is made from softened soybeans that have been boiled or steamed and fermented with a bacteria called Bacillus subtilis var. natto. Bacillus subtilis var. natto is found in soil, plants, animals, and the human stomach and intestines. Most of the natto consumed in Japan is made from the Miyagino strain. A research group led by Professor Eriko Kage-Nakadai at the Graduate School of Human Life ...


Quantum computing will radically alter the application of copyright law, study says

Ochsner Health & Wellness Day in New Orleans East set for March 9

Protecting joints from bacteria with mussels

Researchers investigate immune response of a man who received 217 Covid vaccinations

Proceed with caution – the meteoric rise of zero-alcohol drinks

USC collaborates with startup supporter Techstars to encourage intellectual property development

Who military service members see as credible to discuss secure firearm storage for suicide prevention

Low birthweight coupled with overweight in 20s linked with ‘massive risk’ of early type 2 diabetes in men

DNA aptamer drug sensors can instantly detect cocaine, heroin and fentanyl – even when combined with other drugs

New project will use next-gen at-home rapid test to track COVID-19, RSV, and flu

SRI relaunches the PARC Forum event series as it celebrates the first anniversary of acquiring the storied Palo Alto Research Center

An inside look at Beech tree disease

New AI model draws treasure maps to diagnose disease

Breastfeeding after COVID-19 booster can give babies antibodies

Researchers closing in on genetic treatments for hereditary lung disease, vision loss

COVID-19 associated with increased risk for autoimmune inflammatory rheumatic diseases up to a year after infection

UC Irvine receives $15 million NSF grant for integrative movement research

University of Houston engineer Metin Akay featured in study highlighting 50 scientists' contributions to biomedical engineering advancements

JWST captures the end of planet formation

Good news—MS drugs taken while breastfeeding may not affect child development

Programs intended to reduce health insurance premiums may make coverage less affordable for the middle class

PrEP discontinuation in a US national cohort of sexual and gender minority populations, 2017–22

USC Study: Medicare Part D plans increased restrictions on drug coverage

Sacituzumab govitecan plus platinum-based chemotherapy in breast, bladder, and lung carcinomas

Global study unveils "problematic" use of porn

Newly discovered protein prevents DNA triplication

Less ice in the arctic ocean has complex effects on marine ecosystems and ocean productivity

Antarctica’s coasts are becoming less icy

New research shows migrating animals learn by experience

Modeling the origins of life: New evidence for an “RNA World”

[] Harnessing large vision-language models