(Press-News.org) Modern research tools like supercomputers, particle colliders, and telescopes are generating so much data, so quickly, many scientists fear that soon they will not be able to keep up with the deluge.
"These instruments are capable of answering some of our most fundamental scientific questions, but it is all for nothing if we can't get a handle on the data and make sense of it," says Surendra Byna of the Lawrence Berkeley National Laboratory's (Berkeley Lab's) Scientific Data Management Group.
That's why Byna and several of his colleagues from the Berkeley Lab's Computational Research Division teamed up with researchers from the University of California, San Diego (UCSD), Los Alamos National Laboratory, Tsinghua University, and Brown University to develop novel software strategies for storing, mining, and analyzing massive datasets—more specifically, for data generated by a state-of-the-art plasma physics code called VPIC.
When the team ran VPIC on the Department of Energy's National Energy Research Scientific Computing Center's (NERSC's) Cray XE6 "Hopper" supercomputer, they generated a three-dimensional (3D) magnetic reconnection dataset of a trillion particles. VPIC simulated the process in thousands of time-steps, periodically writing a massive 32 terabyte (TB) file to disk at specified times.
Using their tools, the researchers wrote each 32 TB file to disk in about 20 minutes, at a sustained rate of 27 gigabytes per second (GB/s). By applying an enhanced version of the FastQuery tool, the team indexed this massive dataset in about 10 minutes, then queried the dataset in three seconds for interesting features to visualize.
"This is the first time anyone has ever queried and visualized 3D particle datasets of this size," says Homa Karimabadi, who leads the space physics group at UCSD.
The Problem with Trillion Particle Datasets
Magnetic reconnection is a process where the magnetic topology in a plasma (a gas made up of charged particles) is rearranged, leading to an explosive release of energy in form of plasma jets, heated plasma, and energetic particles. Reconnection is the mechanism behind the aurora borealis (a.k.a. northern lights) and solar flares, as well as fractures in Earth's protective magnetic field—fractures that allow energetic solar particles to seep into our planet's magnetosphere and wreak havoc in electronics, power grids and space satellites.
According to Karimabadi, one of the major unsolved mysteries in magnetic reconnection is the conditions and details of how energetic particles are generated. But until recently, the closest that any researcher has come to studying this is by looking at 2D simulations. Although these datasets are much more manageable, containing at most only billions of particles, Karimabadi notes that lingering magnetic reconnection questions cannot be answered with 2D particle simulations alone. In fact, these datasets leave out a lot of critical information.
"To answer these questions we need to take into full account additional effects such as flux rope interactions and resulting turbulence that occur in 3D simulations," says Karimabadi. "But as we add another dimension, the number of particles in our simulations grows from billions to trillions. And it is impossible to pull up a trillion-particle dataset on your computer screen; it would just fill up your screen with black dots."
To address the challenges of analyzing 3D particle data, Karimabadi and a team of astrophysicists joined forces with the ExaHDF5 team, a Department of Energy funded collaboration to develop high performance I/O and analysis strategies for future exascale computers. Prabhat, of Berkeley Lab's Visualization Group, leads the ExaHDF5 team.
A Scalable Storage Approach Sets Foundation for a Successful Search
According to Byna, VPIC accurately models the complexities of magnetic reconnection at this scale by breaking down the "big picture" into distinct pieces, each of which are assigned, using Message Passing Interface (MPI), to a group of processors to compute. These groups, or MPI domains, work independently to solve their piece of the problem. By subdividing the work, researchers can simultaneously employ hundreds of thousands of processors to simulate a massive and complex phenomenon like magnetic reconnection.
In the original implementation of VPIC, each MPI domain generates a binary file once it finishes processing its assigned piece of the problem. This ensures that the data is written efficiently.
But according to Byna, this approach, called file-per-process, has a number of limitations. One major limitation is that the number of files generated for large-scale scientific simulations, like magnetic reconnection, can become unwieldy. In fact, his team's largest VPIC run on Hopper contained about 20,000 MPI domains—that's 20,000 binary files per time-step. And because most analysis tools cannot easily read binary files, another post-processing step would have been required to re-factor the data into a format that these tools can open.
"It takes a really long time to perform a simple Linux search of a 20,000-file directory; and since the data is not stored in standard data formats, such as HDF5, existing data management and visualization tools cannot directly work with the binary file," says Byna. "Ultimately, these limitations become a bottleneck to scientific analysis and discovery."
But by incorporating H5Part code into the VPIC codebase, Byna and his colleagues managed to overcome all of these challenges. H5Part is an easy-to-use veneer layer on top of HDF5, that allows for the management and analysis of extremely large particle and block-structured datasets.
According to Prabhat, this easy modification to the code-base creates one shared HDF5 file per time-step, instead of 20,000 independent binary files. Because most visualization and analysis tools can use HDF5 files, this approach eliminates the need to re-format the data. With the latest performance enhancements implemented by the ExaHDF5 team, VPIC was able to write each 32 TB time-step to disk at a sustained rate of 27 GB/s.
"This is quite an achievement when you consider that the theoretical peak I/O for the machine is about 35 GB/s," says Prabhat. "Very few production I/O frameworks and scientific applications can achieve that level of performance."
Mining a Trillion Particle Dataset with FastQuery
Once this torrent of information has been generated and stored, the next challenge that researchers face is how to make sense of it. On this front, ExaHDF5 team members Jerry Chou and John Wu implemented an enhanced version of FastQuery, an indexing and querying tool. Using this technique, they indexed the trillion-particle, 32 TB dataset in about 10 minutes, and queried the massive dataset for particles of interest in approximately three seconds. This was the first time anybody has successfully queried a trillion-particle dataset this quickly.
The team was able to accelerate FastQuery's indexing and query capabilities by implementing a hierarchical load-balancing strategy that involves a hybrid of MPI and Pthreads. At the MPI level, FastQuery breaks up the large dataset into multiple fixed-size sub-arrays. Each sub-array is then assigned to a set of compute nodes, or MPI domains, which is where the indexing and querying occurs.
The load-balancing flexibility happens within these MPI domains, where the work is dynamically pooled among threads—which are the smallest unit of processing that can be scheduled by an operating system. When constructing the indexes, the threads build bitmaps on the sub-arrays and store them into the same HDF5 file. When evaluating a query, the processors apply the query to each sub-array and return results.
Because FastQuery is built on the FastBit bitmap indexing technology, Byna notes that researchers can search their data based on an arbitrary range of conditions that is defined by available data values. This essentially means that a researcher can now feasibly search a trillion particle dataset and sift out electrons by their energy values.
According to Prabhat, this unique querying capability also serves as the basis for successfully visualizing the data. Because a typical computer displays contain on the order of a few million pixels, it is simply impossible to render a dataset with trillions of particles. So to analyze their data, researchers must reduce the number of particles in their dataset before rendering. The scientists can now achieve this by using the FastQuery tool to identify the particles of interest to render.
"Although our VPIC runs typically generate two types of data—grid and particle—we never did a whole lot with the particle data because it was really hard to extract information from a trillion particle dataset, and there was no way to sift out the useful information," says Karimabadi.
But with the new query-based visualization techniques, Karimabadi and his team were finally able to verify the localization behavior of energetic particles, gain insights into the relationship between the structure of the magnetic field and energetic particles, and investigate the agyrotropic distribution of particles near the reconnection hot-spot in a 3D trillion particle dataset.
"We have hypothesized about these phenomena in the past, but it was only the development and application of these new analysis tools that enabled us to unlock the scientific discoveries and insights," says Karimabadi. "With these new tools, we can now go back to our archive of particle datasets and look at the physics questions that we couldn't get at before."
"Most of today's simulation codes generate datasets on the order of tens of millions to a few billon particles, so a trillion-particle dataset—that is, a million-million particles—poses unprecedented data management challenges," says Prabhat. "In this work, we have demonstrated that the HDF5 I/O middleware and the FastBit indexing technology can handle these truly massive datasets and operate at scale on current petascale platforms."
But according to Prabhat, exascale platforms will produce even larger datasets in the near future, and researchers need to come up with novel techniques and usable software that can facilitate scientific discovery going forward. He notes that one of the primary goals of the ExaHDF5 team is to scale the widely used HDF5 I/O middleware to operate on modern petascale and future exascale platforms.
INFORMATION:
In addition to Byna, Chou, Karimabadi, Prabhat and Wu, other members of this collaboration include Oliver Rubel, William Daughton, Vadim Roytershteyn, Wes Bethel, Mark Howison, Ke-Jou Hsu, Kuan-Wu Lin, Arie Shoshani and Andrew Uselton. The team also acknowledges critical support provided by NERSC consultants, NERSC system staff, and Cray engineers in assisting with their large-scale runs.
Sifting through a trillion electrons
Berkeley researchers design strategies for extracting interesting data from massive scientific datasets
2012-06-27
ELSE PRESS RELEASES FROM THIS DATE:
Magnet helps target transplanted iron-loaded cells to key areas of heart
2012-06-27
Putnam Valley, NY. (June 26 , 2012) – Optimal stem cell therapy delivery to damaged areas of the heart after myocardial infarction has been hampered by inefficient homing of cells to the damaged site. However, using rat models, researchers in France have used a magnet to guide cells loaded with iron oxide nanoparticles to key sites, enhancing the myocardial retention of intravascularly delivered endothelial progenitor cells.
The study is published in a recent issue of Cell Transplantation (21:4), now freely available on-line at http://www.ingentaconnect.com/content/cog/ct/, ...
Stem cell transplantation into mouse cochlea may impact future hearing loss therapies
2012-06-27
Putnam Valley, NY. (June 26 , 2012) –Researchers in Japan who evaluated the risks and efficacy of transplanting two varieties of stem cells into mouse cochlea have concluded that both adult-derived induced pluripotent stem (iPS) cells and mouse embryonic stem (ES) cells demonstrate similar survival and neural differentiation capabilities. However, there is a risk of tumor growth associated with transplanting iPS cells into mouse cochleae. Given the potential for tumorigenesis, they concluded that the source of iPS cells is a critical issue for iPS cell-based therapy.
Their ...
Study suggests touch therapy helps reduce pain, nausea in cancer patients
2012-06-27
LEXINGTON, Ky. (June 26, 2012) — A new study by the University of Kentucky Markey Cancer Center shows that patients reported significant improvement in side effects of cancer treatment following just one Jin Shin Jyutsu session. Jin Shin Jyutsu is an ancient form of touch therapy similar to acupuncture in philosophy.
Presented at the 2012 Markey Cancer Center Research Day by Jennifer Bradley who is the Jin Shin Jyutsu integrative practitioner at Markey, the study included 159 current cancer patients. Before and after each Jin Shin Jyutsu session, Bradley asked patients ...
US mammograms decline after task force recommendation, Mayo Clinic finds
2012-06-27
ROCHESTER, Minn. -- Preventive mammography rates in women in their 40s have dropped nearly 6 percent nationwide since the U.S. Preventive Services Task Force (USPSTF) recommended against routine mammograms for women in this age group, a Mayo Clinic analysis shows. That represents a small but significant decrease since the controversial guidelines were released, the researchers say. Their findings are being presented at the Academy Health Annual Research Meeting, June 24-26, in Orlando, Fla.
"The 2009 USPSTF guidelines resulted in significant backlash among patients, physicians ...
Musical robot companion enhances listener experience
2012-06-27
Wedding DJs everywhere should be worried about job security now that a new robot is on the scene.
Shimi, a musical companion developed by Georgia Tech's Center for Music Technology, recommends songs, dances to the beat and keeps the music pumping based on listener feedback.
The smartphone-enabled, one-foot-tall robot is billed as an interactive "musical buddy."
"Shimi is designed to change the way that people enjoy and think about their music," said Professor Gil Weinberg, director of Georgia Tech's Center for Music Technology and the robot's creator.
He will ...
Lab-on-a-chip detects trace levels of toxic vapors in homes near Utah Air Force Base
2012-06-27
ANN ARBOR, Mich.---A lab-on-a-chip technology that measures trace amounts of air contaminants in homes was successfully field-tested by researchers at the University of Michigan.
Even in the presence of 50 other indoor air contaminants, the U-M-built microsystem found levels of the targeted contaminant so low that it would be analogous to finding a particular silver dollar in a roll stretching from Detroit to Salt Lake City.
"This is the first (known) study of its kind," said Ted Zellers, professor in the U-M School of Public Health and the Department of Chemistry, ...
NASA satellite spots newborn Tropical Depression Doksuri in W. Pacific
2012-06-27
Another tropical depression was born in the western North Pacific, and NASA's Terra satellite captured an infrared image of the newborn cyclone. Tropical depression Doksuri, known in the Philippines as Dindo, was born during the early hours of June 26, 2012 in the western North Pacific Ocean.
The Moderate Imaging Spectroradiometer (MODIS) instrument onboard NASA's Terra satellite as captured an infrared image of the newborn storm on June 26 at 0228 UTC. The image revealed higher thunderstorms around the center of Tropical Depression Doksuri that were casting shadows on ...
NASA satellites see wildfires across Colorado
2012-06-27
Nearly half of the United States' airborne fire suppression equipment was operating over Colorado on June 25, 2012, CNN reported, as tens of thousands of acres burned. Fires raged in southwestern Colorado, northeastern Colorado, and multiple locations in between.
The Moderate Resolution Imaging Spectroradiometer (MODIS) on NASA's Aqua satellite captured this natural-color image on June 23, 2012. Red outlines approximate the locations of actively burning fires. The High Park and Weber Fires produced the largest plumes of smoke.
The High Park Fire continued to burn west ...
UMass Medical School researchers discover a new role for RNAi
2012-06-27
WORCESTER, MA – Organisms employ a fascinating array of strategies to identify and restrain invasive pieces of foreign DNA, such as those introduced by viruses. For example, many viruses produce double-stranded (ds)RNA during their life cycle and the RNA interference (RNAi) mechanism is thought to recognize this structural feature to initiate a silencing response.
Now, UMass Medical School researchers have identified a mechanism related to RNAi that scans for intruders not by recognizing dsRNA or some other aberrant feature of the foreign sequence, but rather by comparing ...
NASA's Hubble spots rare gravitational arc from distant, hefty galaxy cluster
2012-06-27
Seeing is believing, except when you don't believe what you see.
Astronomers using NASA's Hubble Space Telescope have found a puzzling arc of light behind an extremely massive cluster of galaxies residing 10 billion light-years away. The galactic grouping, discovered by NASA's Spitzer Space Telescope, was observed when the universe was roughly a quarter of its current age of 13.7 billion years. The giant arc is the stretched shape of a more distant galaxy whose light is distorted by the monster cluster's powerful gravity, an effect called gravitational lensing.
The ...
LAST 30 PRESS RELEASES:
Ear muscle we thought humans didn’t use — except for wiggling our ears — actually activates when people listen hard
COVID-19 pandemic drove significant rise in patients choosing to leave ERs before medically recommended
Burn grasslands to maintain them: What is good for biodiversity?
Ventilation in hospitals could cause viruses to spread further
New study finds high concentrations of plastics in the placentae of infants born prematurely
New robotic surgical systems revolutionizing patient care
New MSK research a step toward off-the-shelf CAR T cell therapy for cancer
UTEP professor wins prestigious research award from American Psychological Association
New national study finds homicide and suicide is the #1 cause of maternal death in the U.S.
Women’s pelvic tissue tears during childbirth unstudied, until now
Earth scientists study Sikkim flood in India to help others prepare for similar disasters
Leveraging data to improve health equity and care
Why you shouldn’t scratch an itchy rash: New study explains
Linking citation and retraction data aids in responsible research evaluation
Antibody treatment prevents severe bird flu in monkeys
Polar bear energetic model reveals drivers of polar bear population decline
Socioeconomic and political stability bolstered wild tiger recovery in India
Scratching an itch promotes antibacterial inflammation
Drivers, causes and impacts of the 2023 Sikkim flood in India
Most engineered human cells created for studying disease
Polar bear population decline the direct result of extended ‘energy deficit’ due to lack of food
Lifecycle Journal launches: A new vision for scholarly publishing
Ancient DNA analyses bring to life the 11,000-year intertwined genomic history of sheep and humans
Climate change increases risk of successive natural hazards in the Himalayas
From bowling balls to hip joints: Chemists create recyclable alternative to durable plastics
Promoting cacao production without sacrificing biodiversity
New £2 million project to save UK from food shortages
SCAI mourns Frank J. Hildner, MD, FSCAI: A founder and leader
New diagnostic tool will help LIGO hunt gravitational waves
Social entrepreneurs honored for lifesaving innovations
[Press-News.org] Sifting through a trillion electronsBerkeley researchers design strategies for extracting interesting data from massive scientific datasets