PRESS-NEWS.org - Press Release Distribution
PRESS RELEASES DISTRIBUTION

Forget the needle consider the haystack

Uncovering hidden structures in massive data collections

2013-12-02
(Press-News.org) Contact information: John Sullivan
js29@princeton.edu
609-258-4597
Princeton University, Engineering School
Forget the needle consider the haystack Uncovering hidden structures in massive data collections

Advances in computer storage have created collections of data so huge that researchers often have trouble uncovering critical patterns in connections among individual items, making it difficult for them to realize fully the power of computing as a research tool.

Now, computer scientists at Princeton University have developed a method that offers a solution to this data overload. Using a mathematical method that calculates the likelihood of a pattern repeating throughout a subset of data, the researchers have been able to cut dramatically the time needed to find patterns in large collections of information such as social networks. The tool allows researchers to identify quickly the connections between seemingly disparate groups such as theoretical physicists who study intermolecular forces and astrophysicists researching black holes.

"The data we are interested in are graphs of networks like friends on Facebook or lists of academic citations," said David Blei, an associate professor of computer science and co-author on the research, which was published Sept. 3 in the Proceedings of the National Academy of Science. "These are vast data sets and we want to apply sophisticated statistical models to them in order to understand various patterns."

Finding patterns in the connections among points of data can be critical for many applications. For example, checking citations to scientific papers can provide insights to the development of new fields of study or show overlap between different academic disciplines. Links between patents can map out groups that indicate new technological developments. And analysis of social networks can provide information about communities and allow predictions of future interests.

"The goal is to detect overlapping communities," Blei said. "The problem is that these data collections have gotten so big that the algorithms cannot solve the problem in a reasonable amount of time."

Currently, Blei said, many algorithms uncover hidden patterns by analyzing potential interactions between every pair of nodes (either connected or unconnected) in the entire data set; that becomes impractical for large amounts of data such as the collected citations of the U.S. Patent Office. Many are also limited to sorting data into single groups.

"In most cases, nodes belong to multiple groups," said Prem Gopalan, a doctoral student in Blei's research group and lead author of the paper. "We want to be able to reflect that."

The research was supported by the Office of Naval Research, the National Science Foundation and the Alfred. P. Sloan Foundation.

In very basic terms, the researchers approached the problem by dividing the analysis into two broad tasks. In one, they created an algorithm that quickly analyzes a subset of a large database. The algorithm calculates the likelihood that nodes belong to various groups in the database. In the second broad task, the researchers created an adjustable matrix that accepts the analysis of the subset and assigns "weights" to each data point reflecting the likelihood that it belongs to different groups.

Blei and Gopalan designed the sampling algorithm to refine its accuracy as it samples more subsets. At the same time, the continual input from the sampling to the weighted matrix refines the accuracy of the overall analysis.

The math behind the work is complex. Essentially, the researchers used a technique called stochastic optimization, which is a method to determine a central pattern from a group of data that seem chaotic or, as mathematicians call it, "noisy." Blei likens it to finding your way from New York to Los Angeles by stopping random people and asking for directions — if you ask enough people, you will eventually find your way. The key is to know what question to ask and how to interpret the answers.

"With noisy measurements, you can still make good progress by doing it many times as long as the average gives you the correct result," he said.

In their PNAS article, the researchers describe how they used their method to discover patterns in the connections between patents. Using public data from the U.S. National Bureau of Economic Research, Gopalan and Blei analyzed connections to the 1976 patent "Process for producing porous products."

The patent, filed by Robert W. Gore (who several years earlier discovered the process that led to the creation of the waterproof fabric Gore-Tex), described a method for producing porous material from tetrafluoroethylene polymers. The researchers analyzed a data collection of 3.7 million nodes and found that connections between Gore's 1976 filing and other patents formed 39 distinct communities in the database.

The patent "has influenced the design of many everyday materials such as waterproof laminate, adhesives, printed circuit boards, insulated conductors, dental floss and strings of musical instruments," the researchers wrote.

In the past, researchers struggled to find nuggets of critical information in data. The new challenge is not finding the needle in the data haystack, but finding the hidden patterns in the hay.

"Take the data from the world, from what you observe, and then untangle it," Blei said. "What generated it? What are the hidden structures?"



INFORMATION:



ELSE PRESS RELEASES FROM THIS DATE:

Living with chronic pain: The daily struggle with a 'new self'

2013-12-02
Living with chronic pain: The daily struggle with a 'new self' People who suffer with chronic musculoskeletal pain face a daily struggle with their sense of self and find it difficult to prove the legitimacy of their condition. A new study, funded by the National Institute ...

Oxygen levels increase and decrease the effectiveness of anti-inflammatory therapies

2013-12-02
Oxygen levels increase and decrease the effectiveness of anti-inflammatory therapies New research published in the Journal of Leukocyte Biology suggests that the effectiveness of anti-inflammatory glucocorticoids may be related to ...

Understanding hearing

2013-12-02
Understanding hearing Computer models of neuronal sound processing in the brain lead to cochlear implant improvements This news release is available in German. Intact hearing is a prerequisite for learning to speak. This is why children ...

Newly discovered human peptide may become a new treatment for diabetes

2013-12-02
Newly discovered human peptide may become a new treatment for diabetes New research in The FASEB Journal suggests that humanin, a peptide produced by the human body, increases the metabolism of glucose in beta cells, which in turn ...

Salk scientists crack riddle of important drug target

2013-12-02
Salk scientists crack riddle of important drug target New method for determining structure of key cellular receptors could speed drug development LA JOLLA, CA---- A new approach to mapping how proteins interact with each other, developed at the Salk Institute for Biological Studies, ...

Difficult dance steps: Team learns how membrane transporter moves

2013-12-02
Difficult dance steps: Team learns how membrane transporter moves CHAMPAIGN, Ill. — Researchers have tried for decades to understand the undulations and gyrations that allow transport proteins to shuttle molecules from one side of a cell membrane ...

Head out to the ski slopes, for happiness' sake

2013-12-02
Head out to the ski slopes, for happiness' sake Study says even 1-off skiing trips can give you a valuable boost in pleasure and well-being Are you contemplating a skiing holiday? The all-out pleasure and enjoyment you experience on a pair of skis or a snowboard is positively ...

Process holds promise for production of synthetic gasoline

2013-12-02
Process holds promise for production of synthetic gasoline A chemical system developed by researchers at the University of Illinois at Chicago can efficiently perform the first step in the process of creating syngas, gasoline and other energy-rich products ...

Scientists discover that short-term energy deficits increase factors related to muscle degradation

2013-12-02
Scientists discover that short-term energy deficits increase factors related to muscle degradation New research in The FASEB Journal suggests that a high protein diet suppresses protein breakdown by slowing the activity of the ubiquitin ...

Division of labor in the test tube

2013-12-02
Division of labor in the test tube Bacteria grow faster if they feed each other This news release is available in German. The division of labor is more efficient than a struggle through life without help from others – this is also true ...

LAST 30 PRESS RELEASES:

Azacitidine–venetoclax combination outperforms standard care in acute myeloid leukemia patients eligible for intensive chemotherapy

Adding epcoritamab to standard second-line therapy improves follicular lymphoma outcomes

New findings support a chemo-free approach for treating Ph+ ALL

Non-covalent btki pirtobrutinib shows promise as frontline therapy for CLL/SLL

University of Cincinnati experts present research at annual hematology event

ASH 2025: Antibody therapy eradicates traces of multiple myeloma in preliminary trial

ASH 2025: AI uncovers how DNA architecture failures trigger blood cancer

ASH 2025: New study shows that patients can safely receive stem cell transplants from mismatched, unrelated donors

Protective regimen allows successful stem cell transplant even without close genetic match between donor and recipient

Continuous and fixed-duration treatments result in similar outcomes for CLL

Measurable residual disease shows strong potential as an early indicator of survival in patients with acute myeloid leukemia

Chemotherapy and radiation are comparable as pre-transplant conditioning for patients with b-acute lymphoblastic leukemia who have no measurable residual disease

Roughly one-third of families with children being treated for leukemia struggle to pay living expenses

Quality improvement project results in increased screening and treatment for iron deficiency in pregnancy

IV iron improves survival, increases hemoglobin in hospitalized patients with iron-deficiency anemia and an acute infection

Black patients with acute myeloid leukemia are younger at diagnosis and experience poorer survival outcomes than White patients

Emergency departments fall short on delivering timely treatment for sickle cell pain

Study shows no clear evidence of harm from hydroxyurea use during pregnancy

Long-term outlook is positive for most after hematopoietic cell transplant for sickle cell disease

Study offers real-world data on commercial implementation of gene therapies for sickle cell disease and beta thalassemia

Early results suggest exa-cel gene therapy works well in children

NTIDE: Disability employment holds steady after data hiatus

Social lives of viruses affect antiviral resistance

Dose of psilocybin, dash of rabies point to treatment for depression

Helping health care providers navigate social, political, and legal barriers to patient care

Barrow Neurological Institute, University of Calgary study urges “major change” to migraine treatment in Emergency Departments

Using smartphones to improve disaster search and rescue

Robust new photocatalyst paves the way for cleaner hydrogen peroxide production and greener chemical manufacturing

Ultrafast material captures toxic PFAS at record speed and capacity

Plant phenolic acids supercharge old antibiotics against multidrug resistant E. coli

[Press-News.org] Forget the needle consider the haystack
Uncovering hidden structures in massive data collections