(Press-News.org) PROVIDENCE, R.I. [Brown University] -- Researchers from Brown University and MIT have developed a new data science framework that allows users to process data with the programming language Python -- without paying the "performance tax" normally associated with a user-friendly language.
The new framework, called Tuplex, is able to process data queries written in Python up to 90 times faster than industry-standard data systems like Apache Spark or Dask. The research team unveiled the system in research presented at SIGMOD 2021, a premier data processing conference, and have made the software freely available to all.
"Python is the primary programming language used by people doing data science," said Malte Schwarzkopf, an assistant professor of computer science at Brown and one of the developers of Tuplex. "That makes a lot of sense. Python is widely taught in universities, and it's an easy language to get started with. But when it comes to data science, there's a huge performance tax associated with Python because platforms can't process Python efficiently on the back end."
Platforms like Spark perform data analytics by distributing tasks across multiple processor cores or machines in a data center. That parallel processing allows users to deal with giant data sets that would choke a single computer to death. Users interact with these platforms by inputting their own queries, which contain custom logic written as "user-defined functions" or UDFs. UDFs specify custom logic, like extracting the number of bedrooms from the text of a real estate listing for a query that searches all of the real estate listings in the U.S. and selects all the ones with three bedrooms.
Because of its simplicity, Python is the language of choice for creating UDFs in the data science community. In fact, the Tuplex team cites a recent poll showing that 66% of data platform users utilize Python as their primary language. The problem is that analytics platforms have trouble dealing with those bits of Python code efficiently.
Data platforms are written in high-level computer languages that are compiled before running. Compilers are programs that take computer language and turn it into machine code -- sets of instructions that a computer processor can quickly execute. Python, however, is not compiled beforehand. Instead, computers interpret Python code line by line while the program runs, which can mean far slower performance.
"These frameworks have to break out of their efficient execution of compiled code and jump into a Python interpreter to execute Python UDFs," Schwarzkopf said. "That process can be a factor of 100 less efficient than executing compiled code."
If Python code could be compiled, it would speed things up greatly. But researchers have tried for years to develop a general-purpose Python compiler, Schwarzkopf says, with little success. So instead of trying to make a general Python compiler, the researchers designed Tuplex to compile a highly specialized program for the specific query and common-case input data. Uncommon input data, which account for only a small percentage of instances, are separated out and referred to an interpreter.
"We refer to this process as dual-case processing, as it splits that data into two cases," said Leonhard Spiegelberg, co-author of the research describing Tuplex. "This allows us to simplify the compilation problem as we only need to care about a single set of data types and common-case assumptions. This way, you get the best of two worlds: high productivity and fast execution speed."
And the runtime benefit can be substantial.
"We show in our research that a wait time of 10 minutes for an output can be reduced to a second," Schwarzkopf said. "So it really is a substantial improvement in performance."
In addition to speeding things up, Tuplex also has an innovative way of dealing with anomalous data, the researchers say. Large datasets are often messy, full of corrupted records or data fields that don't follow convention. In real estate data, for example, the number of bedrooms could either be a numeral or a spelled-out number. Inconsistencies like that can be enough to crash some data platforms. But Tuplex extracts those anomalies and sets them aside to avoid a crash. Once the program has run, the user then has the option of repairing those anomalies.
"We think this could have a major productivity impact for data scientists," Schwarzkopf said. "To not have to run out to get a cup of coffee while waiting for an output, and to not have a program run for an hour only to crash before it's done would be a really big deal."
INFORMATION:
The research was supported by the National Science Foundation (DGE-2039354, IIS-1453171) and the U.S. Air Force (FA8750-19-2-1000).
A collaboration led by Distinguished Professor Dr. Kazunori Ikebukuro from Tokyo University of Agriculture and Technology (TUAT), Japan, discovered that G-quadruplex (G4)-forming DNA binds myoglobin through a parallel-type G4 structure. Through the G4 binding, the enzymatic activity of myoglobin increases over 300-fold compared to that of myoglobin alone (Figure). This finding indicates that DNA may work as a carrier of genetic information in living organisms and act as a regulator of unknown biological phenomena.
"Aptamers" are nucleic acid-based synthetic ligands that can be used against many target molecules with high affinity and specificity. Some aptamers that bind to proteins ...
Zeolites are extremely porous materials: Ten grams can have an internal surface area the size of a soccer field. Their cavities make them useful in catalyzing chemical reactions and thus saving energy. An international research team has now made new findings regarding the role of water molecules in these processes. One important application is the conversion of biomass into biofuel.
Fuel made from biomass is considered to be climate-neutral, although energy is still needed to produce it: The desired chemical reactions require high levels of temperature and pressure.
"If ...
WASHINGTON -- If you've ever tried to capture a sunset with your smartphone, you know that the colors don't always match what you see in real life. Researchers are coming closer to solving this problem with a new set of algorithms that make it possible to record and display color in digital images in a much more realistic fashion.
"When we see a beautiful scene, we want to record it and share it with others," said Min Qiu, leader of the Laboratory of Photonics and Instrumentation for Nano Technology (PAINT) at Westlake University in China. "But we don't want to see a digital photo or video with the wrong colors. Our new algorithms can help digital camera and electronic display developers better adapt their ...
There is a significant discrepancy between theoretical and observed amounts of lithium in our universe. This is known as the cosmological lithium problem, and it has plagued cosmologists for decades. Now, researchers have reduced this discrepancy by around 10%, thanks to a new experiment on the nuclear processes responsible for the creation of lithium. This research could point the way to a more complete understanding of the early universe.
There is a famous saying that, "In theory, theory and practice are the same. In practice, they are not." This holds true in every academic domain, but it's especially common in cosmology, the study of the entire ...
A wonder of nature
As a human cell begins division, its 23 chromosomes duplicate into identical copies that remain joined at a region called the centromere. Here lies the kinetochore, a complicated assembly of proteins that binds to thread-like structures, the microtubules. As mitosis progresses, the kinetochore gives green light to the microtubules to tear the DNA copies apart, towards the new forming cells. "The kinetochore is a beautiful, flawless machine: You almost never lose a chromosome in a normal cell!", says Musacchio. "We already know the proteins that constitute it, yet important questions about how the kinetochore works are still open: How does it rebuild itself during chromosome replication? ...
Researchers at the Biodiversity Unit of the University of Turku, Finland, study insect biodiversity particularly in Amazonia and Africa. In their studies, they have discovered hundreds of species previously unknown to science. Many of them are exciting in their size, appearance, or living habits.
"The species we have discovered show what magnificent surprises the Earth's rainforests can contain. The newly discovered Dolichomitus meii wasp is particularly interesting for its large size and unique colouring. With a quick glance, its body looks black but glitters electric blue in light. Moreover, its wings are golden yellow. Therefore, you could say it's like a flying jewel," says Postdoctoral Researcher Diego Pádua from the Instituto Nacional ...
Contrary to what science once suggested, older people with a declining sense of smell do not have comprehensively dampened olfactory ability for odors in general - it simply depends upon the type of odor. Researchers at the University of Copenhagen reached this conclusion after examining a large group of older Danes' and their intensity perception of common food odours.
That grandpa and grandma aren't as good at smelling as they once were, is something that many can relate to. And, it has also been scientifically demonstrated. One's sense of smell gradually begins to decline from about the age of 55. Until now, it was believed that one's sense of smell broadly ...
Plastic waste is considered one of the biggest environmental problems of our time. IASS researchers surveyed consumers in Germany about their use of plastic packaging. Their research reveals that fundamental changes in infrastructures and lifestyles, as well as cultural and economic transformation processes, are needed to make zero-waste shopping the norm.
96 percent of the German population consider it important to reduce packaging waste. Nevertheless, the private end consumption of packaging in Germany has increased continuously since 2009. At 3.2 million tons in 2018, the amount of plastic packaging waste generated by end consumers in Germany ...
One year after infection by SARS-CoV-2, most people maintain anti-Spike antibodies regardless of the severity of their symptoms, according to a study with healthcare workers co-led by the Barcelona Institute for Global Health (ISGlobal), the Catalan Health Institute (ICS) and the Jordi Gol Institute (IDIAP JG), with the collaboration of the Daniel Bravo Andreu Private Foundation. The results suggest that vaccine-generated immunity will also be long-lasting.
One of the key questions to better predict the pandemic's evolution is the duration of natural immunity. A growing number of studies suggest that most people generate a humoral ...
The eruption of the Laacher See volcano in the Eifel, a low mountain range in western Germany, is one of Central Europe's largest eruptions over the past 100,000 years. The eruption ejected around 20 cubic kilometers of tephra and the eruption column is believed to have reached at least 20 kilometers in height, comparable to the Pinatubo eruption in the Philippines in 1991. Technical advances in combination with tree remains buried in the course of the eruption now enabled an international research team to accurately date the event. Accordingly, the eruption of the Laacher See volcano occurred 13,077 years ago and thus 126 years earlier than previously assumed. This sheds new light on the climate history of the entire North ...