Yet people are not used to thinking in many dimensions, and without reducing the dataset to a neat two- or three-dimensional representation, it might be difficult to come up with meaningful hypotheses and recognize important patterns.
“Visualization renders the data intuitive, but it does not necessarily reveal its ‘shape.’ A dataset might have a large-scale structure to it — complete with clusters, voids, loops, and so on — and we want all that to be in the reduced-dimensionality representation, too. Physicists need it to recognize distinct particles in a myriad detector blips, market researchers need it to identify consumer groups, climate scientists need it to tell where a certain process begins and where it ends. Unlike other techniques, ours achieves dimensionality reduction without compromising global data structure,” a co-author of the study, Skoltech alumnus and AIRI researcher Daniil Cherniavskii, commented.
There are a number of approaches to reducing data dimensionality, some using so-called autoencoders. These are neural networks that create lower-dimensionality representations of the data. “The problem is that most of the techniques used, including those involving autoencoders, operate locally. They care about the position of a data point relative to the neighboring points, but the large-scale structure is lost,” Cherniavskii said. “What we did is we supplemented the autoencoder with a new additional loss function. Its sole purpose is to minimize topological discrepancy between the initial dataset and its low-dimensional representation. At loss equal to zero, the ‘shape’ of the visualization is guaranteed to match that of the original.”
The team tested to what extent dataset topology is preserved using multiple metrics that capture how well the relative positions of the data points in general — not just those in the immediate neighborhood — are retained. The test, which encompassed datasets of varying nature, confirmed that the team’s solution outperformed all the most popular methods for dimensionality reduction (see image above).
“Topological data analysis is becoming an increasingly popular tool for investigating the properties of multidimensional data. We expect that the method we have developed and other similar approaches will become the standard in the nearest future,” study co-author Professor Evgeny Burnaev of Skoltech Applied AI and AIRI said.