Shrinking the Data Mountain: A TEADAL ProjectTowards Sustainable AI

Authored by i2cat

In today’s world, sensors are everywhere – from your fitness tracker to manufacturing plants and medical monitoring devices. They collect an astonishing amount of data, constantly streaming information about our health, environments, and machines.

While this data holds immense potential for insights, a significant portion of it often lacks useful content, acting as “noise” rather than “signal”. This becomes a problem when attempting to store and process these colossal volumes of data, which is something we usually take for granted. In order to do it in a timely manner, equally huge data centres and server farms are required, leading to a substantial environmental footprint and increased computational costs. As part of the TEADAL project, we’ve been exploring alternative ways to tackle this issue, aiming to make data science and machine learning more energy-efficient and sustainable.

Imagine trying to find a tiny needle in an enormous haystack. That’s often what machine learning algorithms have to do with raw sensor data.

For instance, a clinical trial for an epilepsy detection wearable collected over 4,000 hours of data, yet less than three minutes of that data actually contained information about an epilepsy episode – a mere 0.001% of the total. This means that millions of observations were transmitted and stored without providing any useful insights. Feeding such vast, “information-diluted” datasets into machine learning models not only consumes immense storage space but also drains computational resources unnecessarily, as the algorithms expend effort sifting through mostly irrelevant data.

Our approach, developed at Fundació i2CAT, focuses on transforming raw sensor data into a much smaller, yet equally informative, representation.

Instead of storing every single data point, our method intelligently divides a signal into smaller, meaningful chunks. We identify key points where the signal reaches a peak or valley (extrema) or where its overall trend significantly changes (change points). Each of these segments is then represented by just a handful of “linear regression coefficients” – essentially, a simple mathematical description of that segment’s starting value, slope, and length. This clever “encoding” process drastically reduces the volume of data that needs to be stored and transferred, directly contributing to lower energy consumption.

When tested with real-world physiological data collected by medical wearables, our method achieved an average data volume reduction of 7.5-fold.

For some signals, like heart rate, the compression was even more dramatic, reaching a 26-fold reduction. Crucially, this significant compression came with minimal information loss, averaging just 4.3%. Even better, this slight loss of information had no significant negative impact on the performance of a recurrent neural network trained to forecast outcomes. The encoding process took, on average, just 31 seconds, and reconstructing the data (decoding) was even faster, at around 1 second, highlighting the efficiency and practicality of our solution.

This outcome of the TEADAL project, represents a vital step towards sustainable data science. By curating datasets before they reach computationally intensive machine learning models, we can significantly reduce the energy and resource demands of AI pipelines.

Looking ahead, we plan to further optimise our encoding algorithms for greater speed and compression, and explore its application across a wider range of datasets and machine learning models. Ultimately, our goal is to quantify the energy savings offered by this data volume reduction technique, demonstrating its real-world impact on building a greener, more efficient future for artificial intelligence.