AI, Large Language Models and Data Protection
18th July 2024
In recent years, Artificial Intelligence (AI) and, particularly, chatbots based on Generative AI (Gen-AI) have become popular products on the internet and are often freely available to the public. Generally, these chatbot systems are able to respond to input questions or tasks accurately because they use an underlying model that have been through a process of training, often based on many large datasets, sometimes including data that are publicly accessible on the internet.
These systems, known as Large Language Models (LLMs) often use a process known as Natural Language Processing, to learn and mimic how human speech is naturally used [1]. Such LLMs can be packaged and tuned for many different use cases including chat [2], but also for internet search, creative writing, assistance with software writing, creating multimedia, assisting with essay writing, providing possible answers to maths or science problems and a range of other purposes.
In addition to Generative AI language model systems, there are many other AI systems and products available for use or purchase, to perform other kinds of tasks. These can include summarising documents or extracting keywords; assisting with industrial, financial, legal, educational, retail, media, advertising, medical, or scientific analysis; performing repetitive, simple or high volume tasks.
Some of these AI systems can also be further specialised by “re-training”, “augmenting” or “fine tuning” with specific datasets related to a particular task or with particular information that you have. Often this is done by using cloud based AI products or by licensing an existing model. Sometimes, these can then also be offered to others as a new product but which now includes your specialised information embedded in it. Due to their versatility, these types of technologies can potentially be more attractive and useful to some individuals at home or in the workplace.
In using AI systems like this, there is potential for personal data processing both in the development of an AI model and their further usage. Additionally, depending on the context and scope of the personal data processing, there may be associated risks that people and organisations need to be aware of from the start, for example:
- There may be a tendency to use large amounts of personal data during any training phases, sometimes unnecessarily and without your knowledge, agreement or permission
- There may be issues for you or others arising from the accuracy or retention of personal data used (or generated) – for example, in situations where the outputs of AI systems are used as part of a process to make decisions
- There may be issues for you, if models based on your personal data, are shared with others for purposes you are not aware of or do not agree with or who do not properly secure the data
- Inaccurate or incomplete training data may cause biases in AI systems, for you or for others that lead to decisions that affect your rights in some way
- Where new personal data becomes part of training data for new fine-tuned versions of a model, this may expose you or others to these kinds of risks – where they were not before
So, whether you are an individual or an organisation, from the very beginning you might need to be aware of what you are using, how you interact with it and what the consequences are, regardless of the useful or creative results you may get from doing so. Where personal data is involved, GDPR and data protection regulations come in to play, for individuals and organisations using AI systems, as well as for the model/product providers. Some of these considerations are set out below.
Risks to the organisation using AI systems
In order for an organisation to assess whether using AI products or system is appropriate for them – the risks associated with any processing of personal data with a chosen AI system must first be recognised so that the organisation can ensure the processing is compliant with GDPR. Because of their recent increase in availability and accessibility, AI products trained on personal data, or where personal data is input by your staff may pose new risks to organisations and to data subjects that were either not before recognised or given consideration to date. These risks can also vary depending on if you are using a standalone product or using a cloud or internet connected service.
As a user of an AI product relying on personal data your organisation could be a data controller and if so a formal risk assessment should be considered. Before you start using an AI system, you should first understand what personal data it uses, how it uses it, where the personal data goes in situations where a third-party is involved in the processing, whether it is retained by the provider of the AI product or re-used in any way, and how the product allows you to meet your GDPR obligations. Their documentation should clearly tell you in an understandable and accessible form.
Some of risks in using AI products, either supplied by a third party or developed and used by your own organisation might be:
- Risks can arise from unwanted, unneeded or unanticipated processing of personal data input to or used to train or fine tune an AI model. This may impact or involve several principles of the GDPR, including but not limited to the lawfulness, fairness and transparency principle and the purpose limitation principle.
- You should put processes in place to facilitate the exercise of data subject rights related to the engagement with the AI products. This is especially the case if you are inputting your own or others’ personal data and need to know where it is going or how it is being processed. If someone asks you for access to their personal data, or to delete their personal data you hold on them within the AI system – can you do it?
- Where your organisation is using an AI product supplied by a third party, it may result in additional security or other data protection risks for your organisation arising from the use of personal data that your employees (or your data subjects) input to the AI tool. You need to fully understand both how personal data is protected when your organisation processes it and where you instruct another organisation to do so on your behalf.
- Some AI models have inherent risks relating to the way in which they respond to inputs or “prompts”, such as memorisation, which can cause passages of (personal) training data to be unintentionally regurgitated by the product. If so, your organisation may be unexpectedly or unnecessarily further processing data related to identifiable people, and like with AI training, you need to consider your GDPR obligations and how you support GDPR user rights.
- Sometimes, AI products rely on a process of “filtering” to prevent certain types of data (such as personal data, inappropriate data, and copyright data) to be provided to a user in response to a query or prompt. In some cases, those filters can be attacked and circumvented to cause such data to be made available or processed in unintended, unauthorised, insecure or risky ways. Like other security risks associated with personal data handling under GDPR, you need to understand if this is possible and to what extent you and your AI supplier or is obligated to mitigate this, and how it is effectively accomplished.
- AI products – such as LLMs – can be prone to producing inaccurate or biased information. If outputs of these products are relied upon without critical human analysis or intervention than you may be introducing “automated decision making” risks. Purely automated decisions have the potential to cause harm, significant consequences or limitation of rights for individuals involved.
- Without a retention schedule and associated processes, your organisation risks non-compliance with the principle of ‘storage limitation’.
- Consider if you publish personal data on your own website. This may be from your staff or from your website users. You may need to ensure that you protect that personal data from being collected and used for AI training or other processing where you have not already agreed that purpose with your staff or users, or if they do not have a reasonable expectation it will be used for AI training.
AI Product Designers, Developers, Providers
If your organisation intends to create an AI product that uses personal data, if you intend to offer it to others or if you intend to repurpose an existing AI product for a new purpose, you need to assess your GDPR compliance obligations.
As an AI product provider, you may be considered a controller or processor under the GDPR. If so, you must ensure that you meet all relevant GDPR obligations during collection, processing and operational use of the product that uses personal data. Additionally, where you intend to license it or offer it as a standalone product that interacts with personal data in any way, you should still ensure that your product facilitates data protection in its design and operation [3].
Below are some considerations you should take account of where personal data is involved in your AI product:
- Consider the purpose and goals of your processing and if there are other non-AI technologies or means to reaching them. These alternatives may be less risky or more appropriate for you.
- If you are creating an AI model and using publicly accessible data, remember firstly that publicly accessible personal data still falls within the scope of the GDPR.
- If you are creating an AI model with existing personal data you have collected, then you also need to consider if any proposed processing purposes are within the scope of the existing legal basis you have.
- In particular, when assessing the necessity and proportionality of the processing (including any collection of personal data, publicly accessible or not), you must account for the purposes for which people have made their personal data publicly accessible. This includes their reasonable expectations about how it can be used, beyond, for instance, others viewing or reading it.
- When using personal data, consider all the risks involved in your AI model or product design, creation and its onward usage. This may require performing a data protection impact assessment [4], especially where the technology or processing is new to you, where you are combining data sets, or where you are using or intend to use personal data related to minors or vulnerable members of society. You may also consider that it would be best practice to perform such an impact assessment anyway.
- If you have data sharing agreements with other organisations allowing their personal data to be used for training your model and sharing or licensing that model, you must ensure that you have a legal basis to process that personal data in that way, as well as ensuring that the processing is fair and transparent.
- As well as data protection obligations, you may need to consider other obligations like copyright, safety and security.
- Consider the requirement to be transparent with the data subjects whose personal data you are processing or have already processed and tell them what processing you are doing, how you are doing it, and how they can exercise their data protection rights. These include the rights of access, rectification or erasure. The ability to uphold these rights effectively must be considered at the earliest stage possible in development.
- Consider how you meet the principle of ‘storage limitation’ in respect of any personal data you process to produce your AI system, product or model.
- Evaluate the consequences and impacts if you choose to share the models you make, or to make them available as a product for others to use. Consider how this affects your GDPR obligations, others’ obligations (as per Recital 78 GDPR) and how the rights of data subjects are respected and achieved.
- Consider how you secure and protect your AI product/models and any associated personal data. Do so considering how they are used, from both authorised and unauthorised use; any possible unintended consequences; any malicious use or interaction; any wider impacts than the initial design intended or any inherent limitations of your product.
- As a controller or processor, ensure you have appropriate personal data governance, design, policy and decision-making controls in place and where outcomes of AI processing might affect the rights and freedoms of data subjects, in accordance with GDPR accountability requirements
[1] As noted by ISO “By employing statistical models, machine learning and linguistic rules, NLP enables computers to perform tasks such as sentiment analysis, text classification, machine translation, chatbot development, and more.” ISO - Unravelling the secrets of natural language processing.
[2] Often known as a “prompt” interface.
[3] As per Recital 78 GDPR.
[4] Further details on data protection impact assessments are available on the DPC website. Note that sometimes such assessments are mandatory, more details are available at this link.