Expert Insights: How to Protect Sensitive Machine Learning Training Data Without Getting Bored

Expert Insights: How to Protect Sensitive Machine Learning Training Data Without Getting Bored

Previous columns in this series have introduced the problem of data protection in machine learning (ML), highlighting the real challenge posed by operational query data. That is, when you use an ML system, you’re likely facing more data exposure risks than when you train one in the first place.

By my rough estimate, data accounts for at least 60% of known machine learning security risks identified by the Berryville Institute of Machine Learning (BIML). This share of risk (the 60%) is still roughly nine-to-one divided with exposure to operational data versus exposure to training data. Training data components represent a minority of data risk in ML, but are a significant minority. The upshot is that we need to devote real energy to mitigating the data operational risk problem posed by ML that we discussed earlier, and we also need to consider training data exposure.

Interestingly, everyone in the field seems to be talking only about training data protection. So why all the fuss? Remember, the ultimate fact about ML is that the algorithm that does all the learning is really just an instantiation of the data in machine-executable form!

So if your training set includes sensitive data, then by definition the machine you build from that data (using ML) includes sensitive information. And if your training set includes biased or regulated data, then by definition the machine you build from those data elements (using ML) includes biased or regulated information. And if your training set includes confidential corporate data, then by definition the machine you build from those pieces of data (using ML) includes confidential corporate information. Etc.

The algorithm is the data and bECOMES training data.

Apparently, the ML domain’s focus on protecting training data has some merit. Unsurprisingly, one of the main ideas for addressing the problem of training data is to correct training data so that it no longer directly includes sensitive, biased, regulated or confidential data. At one extreme, you can simply remove these data items from your training set. Slightly less drastic, but no less problematic, is the idea of ​​adjusting training data to hide or obscure sensitive, biased, regulated or confidential data.

Let’s spend some time looking at this.

Owner vs. Data Scientist

One of the hardest things to understand in this new machine learning paradigm is who is taking what risk. This makes the idea of ​​where to place and apply trust boundaries a bit tricky. For example, we need to separate and understand not only operational data and training data as described above, but also determine who has (and who should have) access to training data.

And even worse, whether any of the training data elements are biased, subject to protected class membership, protected by law, regulated, or otherwise confidential is an even thornier question.

First of all. Someone generated the potentially disturbing data in the first place, and they own those data components. So the data owner may end up with a bunch of data that they are responsible for protecting, such as race information or social security numbers or photos of faces. It is the owner of the data.

Most often, the owner of the data is not the same entity as the data scientist, who is supposed to use the data to train a machine to do something interesting. This means that security managers must recognize a meaningful trust boundary between the data owner and the data scientist who trains the ML system.

In many cases, the data scientist needs to be kept away from the “radioactive” training data that the data owner controls. So how would that work?

Differential Privacy

Let’s start with the worst approach to protecting sensitive training data: do nothing at all. Or maybe even worse, intentionally doing nothing while pretending to do something. To illustrate this problem, we’ll use Meta’s assertion about facial recognition data that has been harvested by Facebook (now Meta) over the years. Facebook has built a facial recognition system using many photos of its users’ faces. Many people think this is a huge privacy issue. (There are also real concerns about the racial nature of facial recognition systems, but that’s for another article.)

After facing privacy pressures from its facial recognition system, Facebook has built a data transformation system that transforms raw facial data (images) into a vector. This system is called Face2Vec, where each face has a unique Face2Vec representation. Facebook then said it removed all faces, even though it kept the huge Face2Vec data set. Note that mathematically speaking, Facebook has done nothing to protect user privacy. Rather, they maintained a single representation of the data.

One of the most common forms of doing something about privacy is differential privacy. Simply put, differential privacy aims to protect particular data points by statistically “mungifying” the data so that individually sensitive points are no longer in the data set, but the ML system still works. The trick is to maintain the power of the resulting ML system even if the training data has been blocked by a process of aggregation and “fuzzification”. If the data components are treated too much this way, the ML system cannot do its job.

But while a user of the ML system can determine whether a particular individual’s data was in the original training data (called membership inference), the data was not padded enough. Note that differential privacy works by modifying the sensitive data set itself before training.

One system being researched – and marketed – is to adjust the training process itself to mask sensitivities in a training dataset. The essence of the approach is to use the same kind of mathematical transformation at the time of training and at the time of inference to protect against exposure of sensitive data (including membership inference).

Based on the mathematical idea of ​​mutual information, this approach involves adding Gaussian noise only to non-conductive features so that a data set is obscured but its inferential power remains intact. The core of the idea is to build a hidden internal representation at the sensitive entity layer.

One interesting thing about targeted feature obfuscation is that it can help protect a data owner from data scientists by preserving the trust boundary that often exists between them.

Integrating security into

Does all this mean that the problem of sensitive training data is solved? No way. The challenge of any new field remains: people who build and use ML systems must integrate security. In this case, that means recognizing and mitigating the risks of training data sensitivity when building their systems.

The time has come to do so. If we build a slew of ML systems with huge built-in data exposure risks, well, we’ll get what we asked for: another security disaster.

#Expert #Insights #Protect #Sensitive #Machine #Learning #Training #Data #Bored

Leave a Comment

Your email address will not be published. Required fields are marked *