Big Data | Ying Wu College of Computing

Screen Shot 2023-12-13 at 3.34.35 PM.png

David Bader
Distinguished Professor
bader@njit.edu

Research Areas: Data science, high-performance computing

High Performance Algorithms for Interactive Data Science at Scale

A real-world challenge in data science is to develop interactive methods for quickly analyzing new and novel data sets that are potentially of massive scale. This project will design and implement fundamental algorithms for high performance computing solutions that enable the interactive large-scale data analysis of massive data sets. Based on the widely-used data types and structures of strings, sets, matrices and graphs, this methodology will produce efficient and scalable software for three classes of fundamental algorithms that will drastically improve the performance on a wide range of real-world queries or directly realize frequent queries. These innovations will allow the broad community to move massive-scale data exploration from time-consuming batch processing to interactive analyses that give a data analyst the ability to comprehensively, deeply and efficiently explore the insights and science in real world data sets. By enabling the increasing number of developers to easily manipulate large data sets, this will greatly enlarge the data science community and find much broader use in new communities.

Building Faster, Energy-Efficient Analytics Pipelines for Decision- Making

Big data analysis is used for problems related to massive data sets. Today, these data sets are loaded from storage into memory, manipulated and analyzed using high performance computing (HPC) algorithms and then returned in a useful format. This end-to-end workflow provides an excellent platform for forensic analysis; however, there is a critical need for systems that support decision-making with a continuous workflow. HPC systems must focus on ingesting data streams; incorporating new microprocessors and custom data science accelerators that assist with loading and transforming data; and accelerating performance by moving key data science tasks and solutions from software to hardware. These workflows must be energy-efficient and easy to program, while reducing transaction times by orders of magnitude. Analysts and data scientists must be able to ask queries in their subject domain and receive rapid solutions that execute efficiently, rather than requiring sophisticated programming expertise.

Scalable Graph Algorithms

Our research is supported in part by an NVIDIA AI Lab (NVAIL) award. NVIDIA makes graphics processing unit (GPU) accelerators such as the DGX Deep Learning server. We contribute to RAPIDS.ai, an open GPU data science framework for accelerating end-to-end data science and analytics pipelines entirely on GPUs. These new analytics pipelines are more energy-efficient and run significantly faster, which is critical for making swift, data driven decisions.

Screen Shot 2023-12-13 at 3.40.29 PM.png

Vincent Oria
Professor
vincent.oria@njit.edu

Research Areas: Multimedia databases, spatio-temporal databases, recommender systems

Dimensionality and Scalability Issues in High-Dimensional Spaces

When researching fundamental operations in areas such as search and retrieval, data mining, machine learning, multimedia, recommendation systems and bioinformatics, the efficiency and effectiveness of implementations depends crucially on the interplay between measures of data similarity and the features by which data objects are represented. When the number of features known as data dimensionality is high, the discriminative ability of similarity measures diminishes to the point where methods that depend on them lose their effectiveness. Our research looks at the interplay between local features, the intrinsic dimensionality and their application to search, indexing and machine learning.

Multi-Instrument Database of Solar Flares

Solar flares are the most prominent manifestation of the sun’s magnetic activity. They emit radiation that could potentially damage power systems, interfere with civilian and military radio frequencies and disrupt spacecraft operations. To improve analysis, in collaboration with the department of physics, we aim to integrate, clean, and enrich solar data captured by various solar flare observing instruments around the world and are using them for some predictive analysis tasks.

Screen Shot 2023-12-13 at 3.41.56 PM.png

Senjuti Basu Roy
Associate Professor
senjuti.basuroy@njit.edu

Research Areas: Human-in-the-loop large-scale data analytics, optimization algorithms

Big Data Analytics Laboratory - Data Analytics with Humans-in-the-Loop

The Big Data Analytics Lab (BDaL) is an interdisciplinary research laboratory that focuses on large-scale data analytics problems rising in different application domains and disciplines. One focus of our lab is to investigate an alternative computational paradigm that involves humans-in-the-loop for big data. These problems arise at different stages in a traditional data science pipeline, such as data cleaning, query answering, ad-hoc data exploration or predictive modeling, as well as from emerging applications. We study optimization opportunities that arise because of this unique man-machine collaboration and address data management and computational challenges. Our focus application domains are crowdsourcing, social networks, health care, climate science, retail and business, naval applications and spatial data.

Screen Shot 2023-12-13 at 3.43.22 PM.png

Chase Wu
Professor
chase.wu@njit.edu

Research Areas: Big data, machine learning, green computing and networking, parallel and distributed computing

Revolutionizing Big Data Scientific Computations

Next-generation scientific applications are experiencing a rapid transition from traditional experiment-based methodologies to large-scale simulations featuring complex numerical modeling with a large number of tunable parameters. Such model-based simulations generate colossal amounts of data, which are then processed and analyzed against experimental or observation data for parameter calibration and model validation. The sheer volume and complexity of such data, the large model-parameter space and the intensive computation make it practically infeasible for domain experts to manually configure and tune hyperparameters for accurate modeling in complex and distributed computing environments. We develop visualization algorithms for 3D volume data generated by scientific computations on supercomputers and apply machine learning techniques to automate, expedite and optimize the parameter tuning process in model development.

Modeling and Optimizing Big Data Ecosystems

The execution of big data workflows is now commonly supported on reliable and scalable data storage and computing platforms such as Hadoop. There are a variety of factors affecting workflow performance across multiple layers in the technology stack of big data ecosystems. Modeling and optimizing the performance of big data workflows is challenging because
the compound effects of such technology layers are complex and opaque to end users. We develop a cross-layer coupled design framework, which integrates information theory-based feature selection and stochastic approximation-based profiling to automate and optimize the configuration of big data ecosystems.

Optimizing Distributed Training and Inference of Deep Neural Networks (DNNs)

Deep Neural Networks (DNNs) have grown rapidly in size and complexity, requiring various data/model/tensor parallelization techniques to make training/inference practically feasible. For example, BLOOM 176B and Megatron-Turing 530B require terabytes of memory and zettaflops of compute. We represent parallelized DNNs as workflows and develop new approaches to workflow partitioning, mapping, and scheduling alongside memory saving techniques such as activation recomputation to optimize the training and inference processes of DNNs in heterogeneous multi-node, multi-GPU/CPU systems.

Reducing Energy Consumption in Big Data Computation

The transfer of big data across high-performance networks consumes a significant amount of energy. Employing two widely adopted power models — power-down and speed scaling — we have made inroads into green computing and networking in big data environments. Our approach allows network providers to reduce operational costs and reduce carbon dioxide emissions.

Uncovering Low-Level, Hazardous Radiation

Radioactive substances and biological agents present a serious threat to public health and safety, particularly in densely populated areas. Through the collection and analysis of large amounts of sensor measurements, we develop reliable tools to detect and contain radioactive materials to protect the populace and reduce the risk of radiological dispersal devices, such as so-called dirty bombs.