Programmers can tell you what machine learning does and how it works, but they can't really prove why it works. Enter the mathematicians.

The what and how of machine learning are well documented — it's software that examines big data to find meaning and possibly suggest actions, based on looking for patterns and complicated statistics — and now NJIT mathematics professor Zuofeng Shang is among a small group of researchers worldwide who want to understand and document the underlying mathematical principles of it.

Shang received $250,000 in National Science Foundation grants during the last three years. He started the research at Indiana University and made the switch to Newark this semester, he said, because that research could be better supported in a major metropolitan region.

Machine learning is an effective tool to discern meaning from mountains of data in fields as diverse as computer science, engineering, finance, healthcare, law, pharmaceuticals and even professional sports.

"Of course the computer science people or engineering people, they focus more on the algorithm/optimization part of machine learning," Shang noted, "however as a mathematician or statistician we find that there are more fundamental or principle aspects hidden in this area, and we want to focus on those things."

"The computer science people, maybe their research is bringing benefits sooner than us, however investigating the principle parts is also important because [it] can lead us or guide us to propose other new and interesting things, just like in history or physics," Shang continued. "We want to see why it works. We want to provide the mathematical answer to this."

Shang explained the technical details in the grant summary. "This project consists of three major components," he wrote. "First, the principal investigator will establish a Gaussian approximation of general nonparametric posterior distributions which serves as a theoretical foundation for general distributed Bayesian algorithms. Second, the principal investigator will develop a nonparametric Bayesian aggregation procedure with theoretical guarantees that is particularly useful to handle massive data in a parallel fashion. Third, the principal investigator will develop an efficient parallel Markov Chain Monte Carlo algorithm for nonparametric Bayesian models which will perform as well as traditional MCMC with substantially less computational costs."

"This research will lead to an emergence of 'Splitotics (Split+Asymptotics) Theory' providing theoretical guidelines for Bayesian practices. The smoothing spline inference results recently obtained by the principal investigator will be used as a promising tool for achieving the above goals," Shang stated.

He elaborated in a recent paper for the Journal of Machine Learning Research. "We develop a set of scalable Bayesian inference procedures for a general class of nonparametric regression models. Specifically, nonparametric Bayesian inferences are separately performed on each subset randomly split from a massive dataset, and then the obtained local results are aggregated into global counterparts. This aggregation step is explicit without involving any additional computation cost. By a careful partition, we show that our aggregated inference results obtain an oracle rule in the sense that they are equivalent to those obtained directly from the entire data (which are computationally prohibitive). For example, an aggregated credible ball achieves desirable credibility level and also frequentist coverage while possessing the same radius as the oracle ball," the paper's abstract states.

In plain terms, Shang said, that means he's working on a way to understand the results of machine learning while using less data yet maintaining high-quality results.

Shang said a side benefit is that he realized the research could also be used to study deep learning, which is a form of artificial intelligence. He and his graduate students plan to explore that in a later stage of their grant.

Most of the research coding is done in the Python language and some is done in R, which is a specialty language for statisticians. The data uses about 1 terabyte of storage when it's not being processed. Results will most likely be published in the Journal of Machine Learning Research and presented at an event such as the Conference on Learning Theory and International Conference on Machine Learning.