Here’s how machine learning can invade your privacy

By | May 23, 2024

Machine learning has pushed the boundaries in many areas, including personalized medicine, driverless cars, and customized advertising. However, research has shown that these systems memorize parts of the data they are trained to learn patterns, raising privacy concerns.

In statistics and machine learning, the goal is to learn from past data to make new predictions or inferences about future data. To achieve this goal, the statistician or machine learning expert chooses a model that will capture suspicious patterns in the data. A model applies a simplifying structure to data, making it possible to learn patterns and make predictions.

Complex machine learning models have some inherent advantages and disadvantages. On the positive side, they can learn much more complex patterns and work with richer data sets for tasks such as image recognition and predicting how a particular person will respond to treatment.

However, there is also a risk of overfitting to the data. This means that they make accurate predictions about the data they were trained on, but they begin to learn additional aspects of the data that are not directly relevant to the task at hand. This results in models not generalizing; This means that the models perform poorly on new data that is the same type as the training data, but not exactly the same.

While techniques exist to address the prediction error associated with overfitting, there are also privacy concerns that come with being able to learn so much from data.

How do machine learning algorithms make inferences?

Each model has a certain number of parameters. A parameter is an element of a model that can be changed. Each parameter has a value or setting that the model derives from the training data. Parameters can be thought of as different knobs that can be turned to affect the performance of the algorithm. While a straight line pattern has only two knobs, slope and intercept, machine learning models have a large number of parameters. For example, the GPT-3 language model has 175 billion.

To select parameters, machine learning methods use training data to minimize the prediction error in the training data. For example, if the goal is to predict whether a person will respond well to a particular medical treatment based on their medical history, the machine learning model will make predictions on the data where model developers know whether someone responds well or poorly. The model is rewarded for correct predictions and penalized for incorrect predictions; This causes the algorithm to adjust its parameters, that is, to flip some “knobs” and try again.

Machine learning models are also checked against a validation dataset to avoid overfitting the training data. The validation dataset is a separate dataset that is not used in the training process. By checking the performance of the machine learning model on this validation dataset, developers can ensure that the model is able to generalize its learning beyond the training data and avoid overfitting.

Although this process succeeds in ensuring that the machine learning model performs well, it does not directly prevent the machine learning model from memorizing the information in the training data.

privacy concerns

Due to the large number of parameters in machine learning models, there is the potential for the machine learning method to memorize some of the data it is trained on. In fact, this is a common phenomenon and users can extract memorized data from the machine learning model by using tailored queries to obtain the data.

If the training data contains sensitive information, such as medical or genomic data, the privacy of the individuals whose data is used to train the model may be compromised. Recent research has shown that it is actually necessary for machine learning models to memorize some aspects of the training data to achieve the best performance when solving certain problems. This shows that there may be a fundamental trade-off between the performance of a machine learning method and privacy.

Machine learning models also make it possible to predict sensitive information using seemingly non-sensitive data. For example, Target was able to predict which customers were likely to be pregnant by analyzing the purchasing habits of customers enrolled in the Target baby registry system. Once the model was trained on this data set, it was able to send pregnancy-related ads to customers it suspected were pregnant because they purchased products like supplements or unscented lotion.

Is it possible to protect confidentiality?

While there are many methods proposed to reduce memorization in machine learning methods, most have been largely ineffective. Currently, the most promising solution to this problem is to impose a mathematical limit on privacy risk.

The most advanced method for protecting formal privacy is differential privacy. Differential privacy requires that the machine learning model does not change much if someone’s data changes in the training dataset. Differential privacy methods achieve this guarantee by introducing additional randomness into algorithm learning that “obfuscates” the contribution of any individual. When a method is protected by differential privacy, no possible attack can violate this privacy guarantee.

Even if a machine learning model is trained using differential privacy, this does not prevent it from making sensitive inferences like in the Target example. To prevent these privacy violations, all data transmitted to the organization must be protected. This approach is called local differential privacy, and Apple and Google have implemented it.

This inhibits memorization, as differential privacy limits how dependent a machine learning model can be on an individual’s data. Unfortunately, it also limits the performance of machine learning methods. Because of this trade-off, there is criticism of the usefulness of differential privacy, as it often leads to a significant decrease in performance.

going forward

Due to the tension between inferential learning and privacy concerns, a societal question ultimately arises as to which is more important and in what context. When the data does not contain sensitive information, it is easy to recommend using the most powerful machine learning methods available.

But it’s important to weigh the consequences of privacy leaks when working with sensitive data, and it may be necessary to sacrifice some machine learning performance to protect the privacy of the people whose data trains the model.

This article is republished from The Conversation, an independent, nonprofit news organization providing facts and analysis to help you understand our complex world.

Written by Jordan Awan Purdue University.

Read more:

Jordan Awan receives funding from the National Science Foundation and the National Institutes of Health. He also serves as privacy counsel for the federal nonprofit MITER.

Leave a Reply

Your email address will not be published. Required fields are marked *