(Press-News.org) Big Data presents scientists with unfolding opportunities, including, for instance, the possibility of discovering heterogeneous characteristics in the population leading to the development of personalized treatments and highly individualized services. But ever-expanding data sets introduce new challenges in terms of statistical analysis, bias sampling, computational costs, noise accumulation, spurious correlations, and measurement errors.
The era of Big Data – marked by a Big Bang-like explosion of information about everything from patterns of use of the World Wide Web to individual genomes – is being propelled by massive amounts of very high-dimensional or unstructured data, continuously produced and stored at a decreasing cost.
"In genomics we have seen a dramatic drop in price for whole genome sequencing," state Jianqing Fan and Han Liu, scientists at Princeton University, and Fang Han at Johns Hopkins. "This is also true in other areas such as social media analysis, biomedical imaging, high-frequency finance, analysis of surveillance videos and retail sales," they point out in a paper titled "Challenges of Big Data analysis" published in the Beijing-based journal National Science Review.
With the quickening pace of data collection and analysis, they add, "scientific advances are becoming more and more data-driven and researchers will more and more think of themselves as consumers of data."
Increasingly complex data sets are emerging across the sciences. In the field of genomics, more than 500 000 microarrays are now publicly available, with each array containing tens of thousands of expression values of molecules; in biomedical engineering, tens of thousands of terabytes of functional magnetic resonance images have been produced, with each image containing more than 50 000 voxel values. Massive and high-dimensional data is also being gathered from social media, e-commerce, and surveillance videos.
Expanding streams of social network data are being channeled and collected by Twitter, Facebook, LinkedIn and YouTube. This data, in turn, is being used to predict influenza epidemics, stock market trends, and box-office revenues for particular movies.
The social media and Internet contain burgeoning information on consumer preferences, leading economic indicators, business cycles, and the economic and social states of a society.
"It is anticipated that social network data will continue to explode and be exploited for many new applications," predict the co-authors of the study. New applications include ultra-individualized services.
And in the area of Internet security, they add, "When a network-based attack takes place, historical data on network traffic may allow us to efficiently identify the source and targets of the attack."
With Big Data emerging from many frontiers of scientific research and technological advances, researchers have focused on the development of new computational infrastructure and data-storage methods, of fast algorithms that are scalable to massive data with high dimensionality.
"This forges cross-fertilization among different fields including statistics, optimization and applied mathematics," the scientists add.
The massive sample sizes giving rise to Big Data fundamentally challenge the traditional computing infrastructure.
"In many applications, we need to analyze Internet-scale data containing billions or even trillions of data points, which makes even a linear pass of the whole dataset unaffordable," the researchers point out.
The basic approach to store and process such data is to divide and conquer. The idea is to partition a large problem into more tractable and independent sub-problems. Each sub- problem is tackled in parallel by different processing units. On a small scale, this divide-and-conquer strategy can be implemented either by multi-core computing or grid computing.
On a larger scale, handling enormous arrays of data requires a new computing infrastructure that supports massively parallel data storage and processing.
The researchers present Hadoop as an example of a basic software and programming infrastructure for Big Data processing. Alongside Hadoop's distributed file system, they review MapReduce, a programming model for processing large datasets in a parallel fashion, cloud computing, convex optimization, and random projection algorithms, which are specifically designed to meet Big Data's computational challenges.
Hadoop is a Java-based software framework for distributed data management and processing. It contains a set of open source libraries for distributed computing using the MapReduce programming model and its own distributed file system called HDFS. Hadoop automatically facilitates scalability and takes cares of detecting and handling failures.
HDFS is designed to host and provide high-throughput access to large datasets that are redundantly stored across multiple machines. It ensures Big Data's survivability and high availability for parallel applications.
In terms of statistical analysis, Big Data presents another set of new challenges. Researchers tend to collect as many features of the samples as possible; as a result, these samples are commonly heterogeneous and high dimensional.
High dimensionality brings new problems, including noise accumulation, spurious correlation, and incidental endogeneity. For instance, high dimensionality gives rise to spurious correlation. In studying the association between cancers and certain genomic and clinical factors, it might be possible that prostate cancer is highly correlated to an unrelated gene. However, such a high correlation could be explained by high dimensionality: In studies that include so many features, ranging from genomic information to height, weight and gender to favorite foods and sports, some high correlations emerge merely by chance.
INFORMATION:
This research was supported by the National Science Foundation [DMS-1206464 to Jianqing Fan, III-1116730 and III-1332109 to Han Liu] and the National Institutes of Health [R01-GM100474 and R01-GM072611 to Jianqing Fan].
See the article: Jianqing Fan, Fang Han, and Han Liu. "Challenges of Big Data analysis." Natl Sci Rev (June 2014) 1 (2): 293-314
http://nsr.oxfordjournals.org/content/1/2/293.full
The National Science Review is the first comprehensive scholarly journal published in English in China that is aimed at linking the country's rapidly advancing community of scientists with the global frontiers of science and technology. The journal also aims to shine a worldwide spotlight on scientific research advances across China.
Bombarded by explosive waves of information, scientists review new ways to process and analyze Big Data
2014-08-26
ELSE PRESS RELEASES FROM THIS DATE:
Chinese scientists use laser-induced breakdown spectroscopy to identify toxic cooking 'gutter oil'
2014-08-26
The illegal use of waste cooking oil in parts of the nationwide food system is threatening the public's health in China.
Now scientists led by Professor Ding Hongbin at the Dalian University of Technology, in northeastern China, present a new means to confront this problem. In a study published in the Chinese Science Bulletin, Ding and fellow researchers at the university's School of Physics and Optoelectronic Engineering outline the potential use of laser-induced breakdown spectroscopy (LIBS) to rapidly distinguish between "gutter oil" and safe, edible oil.
...
Same-beam VLBI Technology successfully monitors the Chang'E-3 rover's movement on the lunar surface
2014-08-26
By using the same-beam VLBI technology, differential phase delay successfully monitored the lunar rover's movement during the Chang'E-3 mission when rover and lander was carrying out the tasks of separation and took photos of each other. The sensitivity of rover motion monitoring was between 50-100mm.Furthermore, relative position between rover and lander was precisely measured by taking the use of the DPD's changing trend. Professor LIU Qing hui and his student ZHENG Xin from the Shanghai Astronomical of observatory, Chinese Academy of Sciences, obtained this result when ...
Laser pulse turns glass into a metal
2014-08-26
Quartz glass does not conduct electric current, it is a typical example of an insulator. With ultra-short laser pulses, however, the electronic properties of glass can be fundamentally changed within femtoseconds (1 fs = 10^-15 seconds). If the laser pulse is strong enough, the electrons in the material can move freely. For a brief moment, the quartz glass behaves like metal. It becomes opaque and conducts electricity. This change of material properties happens so quickly that it can be used for ultra-fast light based electronics. Scientists at the Vienna University of ...
Study calls into question link between prenatal antidepressant exposure and autism risk
2014-08-26
Previous studies that have suggested an increased risk of autism among children of women who took antidepressants during pregnancy may actually reflect the known increased risk associated with severe maternal depression. In a study receiving advance online publication in Molecular Psychiatry, investigators from Massachusetts General Hospital (MGH) report that – while a diagnosis of autism spectrum disorder was more common in the children of mothers prescribed antidepressants during pregnancy than in those with no prenatal exposure – when the severity of the mother's depression ...
Study: Earth can sustain more terrestrial plant growth than previously thought
2014-08-26
CHAMPAIGN, Ill. — A new analysis suggests the planet can produce much more land-plant biomass – the total material in leaves, stems, roots, fruits, grains and other terrestrial plant parts – than previously thought.
The study, reported in Environmental Science and Technology, recalculates the theoretical limit of terrestrial plant productivity, and finds that it is much higher than many current estimates allow.
"When you try to estimate something over the whole planet, you have to make some simplifying assumptions," said University of Illinois plant biology professor ...
New tool to probe cancer's molecular make-up
2014-08-26
Scientists have shown how to better identify and measure vital molecules that control cell behaviour – paving the way for improved tools for diagnosis, prediction and monitoring of cancer.
Researchers from the Cancer Research UK Manchester Institute based at The University of Manchester – part of the Manchester Cancer Research Centre – and the Institute of Cancer Research, London, looked at protein kinases, molecules that control various aspects of cellular function.
The study, funded by a Biotechnology and Biological Sciences Research Council (BBSRC)/Pfizer CASE studentship ...
Symptoms after breast cancer surgery need to be treated on an individual basis
2014-08-26
For those affected, breast cancer is a dramatic diagnosis. Patients often have to endure chemotherapy and surgery, which, depending on the individual scenario, may mean breast conserving surgery or breast removal—mastectomy. In the aftermath, many women experience symptoms such as pain, fatigue/exhaustion, or sleep disturbances. However, the symptoms are highly individual, as Stefan Feiten and colleagues emphasize in a recent study reported in Deutsches Ärzteblatt International (Dtsch Arztebl Int 2014; 111: 537-44).
The authors state that it is crucial for good aftercare ...
Life in Saxony-Anhalt: More attention should be paid to the heart!
2014-08-26
A lack of education, an unhealthy diet, and unemployment go straight to the heart—quite literally, because all three range among the risks that cause ischemic heart disease or contribute to its development. According to a recent study reported by epidemiologists Andreas und Maximilian Stang in Deutsches Ärzteblatt International (Dtsch Arztebl Int 2014; 111: 530-6), the risk factors for heart disease are higher in Saxony-Anhalt than in all other German states, and more persons die from heart disease in the state.
Many of the risk factors could be treated in a more targeted ...
A high-resolution bedrock map for the Antarctic Peninsula
2014-08-26
26.08.2014: Antarctic glaciers respond sensitively to changes in the Atmosphere/Ocean System. Assessing and projecting the dynamic response of glaciers on the Antarctic Peninsula to changed atmospheric and oceanic forcing requires high-resolution ice thickness data as an essential geometric constraint for ice flow models. Therefore, a Swiss-German team of scientists developed a complete bedrock data set for the Antarctic Peninsula on a 100 m grid. They calculated the spatial distribution of ice thickness based on surface topography and ice dynamic modelling.
Daniel Farinotti, ...
Duality principle is 'safe and sound'
2014-08-26
Decades of experiments have verified the quirky laws of quantum theory again and again. So when scientists in Germany announced in 2012 an apparent violation of a fundamental law of quantum mechanics, a physicist at the University of Rochester was determined to find an explanation.
"You don't destroy the laws of quantum mechanics that easily," said Robert Boyd, professor of optics and of physics at Rochester and the Canada Excellence Research Chair in Quantum Nonlinear Optics at the University of Ottawa.
In their 2012 version of the famous Young two-split experiment, ...