Privacy risks from artificial intelligence and machine learning models — what do lawyers need to know?

21 April, 2023

By: Dr Jonathan Cohen, Principal TAYLOR FRY

In this article from the December 2022 edition of the Privacy Law Bulletin expert author Dr Jonathon Cohen discusses Australia's rapidly changing privacy law framework. He also argues that lawyers and their clients would benefit from more clarity around the extent to which amendments to the Privacy Act might impact corporate governance requirements.

Subscribers can access the full bulletin HERE.

It covers three main questions. What are the key privacy risks that are associated with artificial intelligence and machine learning models? What is privacy leakage, and what are the implications of the concept? And what does the future of consumer control over data hold, and what needs to change to shape it?

Artificial intelligence and machine learning models introduce novel privacy risks, including leakage of personal information, and challenges in removing customers’ data from a model. We provide an overview for lawyers of how artificial intelligence and machine learning models are constructed and used. We also set out what privacy issues arise and how these characteristics may relate to proposed changes to the Privacy Act 1988 (Cth).

Introduction and context

The Australian Government’s Discussion Paper released as part of the government’s review of the Privacy Act puts forward detailed reform options in its aim to “ensure privacy settings empower consumers, protect their data and best serve the Australian economy”.

This article discusses how and why the proposed reforms may impact organisations that use artificial intelligence and machine learning algorithms and models. We provide an overview of how these algorithms are constructed and used, where privacy issues arise and how these characteristics may relate to proposed changes within the discussion paper. It is written from the perspective of a data science expert, with a focus on subtle technical issues that have a particular bearing on privacy and that may not be widely recognised outside of the artificial intelligence and machine learning research community.

Artificial intelligence and machine learning models

“Artificial Intelligence” refers to the ability of computers to perform tasks that we would normally associate with human intelligence. The current main approach to achieving this is to create systems based on a class of algorithms known as “machine learning”. These infer patterns of behaviour from data and differ from the previous generation of systems, which relied on painstaking construction of deductive rules and hand encoding of human knowledge. The acceleration of computational power in recent years has allowed machine learning approaches to make rapid strides in domains such as understanding language and recognising images.

In practical terms within commercial and government contexts, “artificial intelligence” and “machine learning” are often used interchangeably. Both terms are used to refer to a group of related algorithmic techniques that identify patterns in historical data that can be generalised to current and future populations and systems.

Most often, the objective is to create a model that uses known information about an individual to infer a quantity of interest. For example, a model may seek to predict the recidivism risk of a defendant using data from historical cases where a defendant did or did not re-offend, and the characteristics of the defendant in each case, such as the type of crime and the defendant’s demographics and criminal history. The data used in constructing the model is known as the training data.

Models such as these are pervasive across modern business and government. They predict how likely you are to purchase an item at a particular price, the likelihood of an insurance claim being fraudulent, and estimate the quantum of future social welfare benefits that an individual may receive over their lifetime.

Models are used for many purposes. They may be used directly in the operations of an organisation, such as recommending grocery items during online checkout, triaging injured people to the most appropriate model of care, or providing input to sentencing decisions in a courtroom. Or they may be used to support internal decision-making, such as informing government policy or organisational strategy.

Privacy questions relating to models

Privacy questions occur at several stages in the construction, storage and use of models. Specifically:

training data may include personal or sensitive information or information collected without appropriate consent
models themselves may encode personal information from the training data, risking its exposure in certain model usage scenarios
inferences produced by the models might be considered personal information in certain contexts, for example, political affiliation based on social media interactions

Of the proposed changes in the Discussion Paper, there are two areas with particular relevance for models and associated algorithms:

clarifying and expanding the reach of what is considered to be personal information
providing consumers with mechanisms to take a more active role in how their personal information is collected, used and stored by an organisation

Amending the definition of personal information

The Discussion Paper proposes to amend the definition of personal information to make it clear that it includes inferred personal information, which it defines to be “information collected from a number of sources which reveals something new about an individual”. This cuts to the core of the models’ primary purpose of inferring new information based on available data.

If adopted, it is plausible that most inferences produced by models would fall under this revised definition of personal information. This would place additional governance requirements on organisations that currently apply different levels of risk control with different types of data, depending on the associated privacy risks.

One question that arises is the extent to which models themselves (rather than inferences made from the models) would be subject to additional governance requirements following changes to include inferred information in privacy definitions — for instance, due to providing the functionality for producing inferred information.

The concept of privacy leakage

More subtly, the models themselves may continue to carry personal information from the training datasets, a phenomenon known as “privacy leakage”. In this scenario, a user might be able to recover information from the training data given only access to the model and limited information about an individual of interest.

Representative examples of privacy leakage include the following:

Using a drug dosage prediction model, researchers were able to reverse engineer and individually identify a patient’s genetic markers with an exceptionally high degree of accuracy.
This was done without using details of the model’s (albeit relatively simple) underlying structure and using only general patient demographic information.
Related studies have demonstrated that recognisable images of people’s faces can be recovered from certain facial recognition models alone. All that was needed was “blind” access to the underlying model and the individual’s name. The images produced allowed skilled crowd workers to identify the individual from a line-up with 95% accuracy.
Additional challenges exist for generative sequence models like those used for text auto-completion and artificial intelligence copywriting assistants, as these models are often trained on sensitive or private data such as the text of private messages. For these models, the risks of privacy leakage are not restricted to sophisticated attacks but may even occur through normal use of the model. For example, model users may find that “my credit card number is” is auto-completed by the model with oddly specific details or even obvious secrets such as a valid-looking credit card number. This risk manifested in June 2021 when social media users reported that GitHub Copilot, an AI tool developed by Microsoft and OpenAI to automatically generate programming code based on user prompts, appeared to be leaking sensitive information contained in the training data, such as users’ private keys for accessing programming services.

Protecting against the sorts of privacy leakages described above is surprisingly challenging. For instance, research indicates that memorisation by a model of its training data can occur and sophisticated approaches to reduce this direct, but unintended, risk are often ineffective.

One alternative is a “differential privacy” approach that adds noise to the training data to ensure strong privacy protection. However, this approach requires a trade-off: “high-noise” models typically increase privacy protections but may reduce model utility to the point of futility; “low-noise” models may improve model utility but fail to substantially reduce the risk of privacy leakage. The drug dosage prediction model research discussed above, also demonstrated how differential privacy techniques to protect genetic privacy had substantially interfered with the main purpose of this model, increasing the risk of negative patient outcomes such as strokes, bleeding events and mortality beyond acceptable levels.

These observations indicate that privacy leakage is likely to be a risk in many models, particularly as the use of complex machine learning methods such as deep learning models with up to billions of internal parameters continues to grow.

Providing more control to consumers in how their data is used

The Discussion Paper lists several proposed changes that would provide consumers with increased control over their personal information. These include:

strengthening consent requirements to be “voluntary, informed, current, specific, and an unambiguous indication through clear action”
introducing a right for individuals to object or withdraw their consent at any time to the collection, use of disclosure of their personal information
introducing a right to erasure under certain conditions

It is likely that these changes would require organisations to develop procedures for removing customers’ information from models, or risk penalties. For example, in March 2022 the United States Free Trade Commission ordered WW International (formerly known as Weight Watchers) to destroy models and algorithms built using personal information from children as young as eight, which had been collected without parental consent.

Removing a customer or group of customers’ data from a model would typically require rebuilding the model on new set of training data excluding the relevant customers’ data. There are many practical challenges with this, including the following:

Building models can consume significant resources, often taking several hours of computational time and considerable human time.
Organisations with large customer bases may receive changes in permissions, objections or erasure requests daily.
Many organisations may not currently maintain training data and relevant model metadata in a format that would facilitate ready reconstruction of the model.

These overheads and considerations mean that organisations will likely require time to adjust to any new privacy-related requirements. They may need to rebuild their relevant infrastructure, including modelling processes and data systems, so that they appropriately capture, categorise and apply individual customer permissions with sufficient efficiency.

Conclusion

Artificial intelligence and machine learning models introduce novel privacy risks, including leakage of personal information, and challenges in removing customers’ data from a model.

Lawyers and their clients alike would benefit from clarity around the extent to which amendments to the Privacy Act impact on governance requirements, and sufficient time to review and amend their processes to meet those requirements.

Subscribe to our Newsletter

Tags:

Legal Technology and AI

Latest Articles

The Importance of Background Checks for Company Directors

Bird v DP: A Landmark Ruling on Vicarious Liability – Where to...

Security of Payment (SoP) Case Developments 2024

Expected Changes in Australian Construction Law 2025

Understanding the recent changes to casual conversion rights...

Privacy risks from artificial intelligence and machine learning models — what do lawyers need to know?

Subscribe to our Newsletter