Protecting Sensitive Data in the Age of Large Language Models (LLMs)

Vinay Roy
6 min readJun 19, 2024

--

A common concern we have heard among the senior leaders of early or mid-stage companies is how to safeguard against leaking sensitive PII data while allowing their employees to use LLM models and other 3rd party AI tools. According to a survey, 71% of Senior IT leaders hesitate to adopt Generative AI due to security and privacy risks.

What is the risk?

Many business leaders are unaware of the risk associated with the unregulated usaged of AI tools. So let us first understand the risk.

In March 2023, Economist reported 3 separate data leakage incidents by Samsung employees just after the company allowed their employees to use ChatGPT. OpenAI in its FAQ shares that any information shared by users can be used for re-raining future models. Open AI also states in its ChatGPT usage guide ‘Do not enter sensitive information.’

This is not the only incident. There have been a growing number of concerns and incidents related to data security and privacy with large language models (LLMs) like ChatGPT. Here are a couple of cases that illustrate similar issues:

1. Doctor-Patient Data Breach with Medical LLM: A news report (source might be difficult to find due to privacy concerns) highlighted a potential data breach involving a medical LLM used in a healthcare setting. A doctor reportedly used the LLM to analyze patient data and generate reports. There are concerns that the doctor might have inadvertently included identifiable patient information in the queries submitted to the LLM.

2. AI Bias in Hiring Decisions: This isn’t a data leak, but it demonstrates a potential risk associated with using LLMs in tasks involving sensitive information. There have been reports of AI recruiting tools using biased language models, leading to discriminatory hiring practices. The language models might pick up on subtle biases present in the training data, leading to unfair evaluation of candidates. While not directly a data leak, it showcases how LLMs can perpetuate biases or make discriminatory decisions based on the data they are trained on. This is a concern when using LLMs for tasks involving sensitive information like job applications or loan approvals.

Overall:

These incidents highlight the evolving landscape of data security and privacy in the age of LLMs. Here are some key takeaways:

  • LLMs are powerful tools, but require caution: They can be incredibly useful, but it’s crucial to be mindful of the data they are exposed to, especially when dealing with confidential or sensitive information.
  • User awareness is critical: Educating users about the potential risks and best practices for using LLMs is essential. Users should be aware of what data they are sharing and how it might be used.
  • Need for robust security protocols: Developers and organizations using LLMs need to implement robust security protocols to minimize the risk of data leaks or misuse.

As LLMs become more integrated into our lives, addressing these concerns will be crucial for ensuring responsible and ethical use of this powerful technology.

How to manage the risk? To avoid this, some organizations have restricted their employees from using AI tools. This, in my opinion, is more harmful than helpful. Instead what we need is to evaluate the specific needs of an organization, the kind of information/task that these tools are helping achieve, and explore some ways in which a secured and controlled environment can be created for the safe use of AI tools. This approach ensures that the benefits of AI are harnessed without compromising security or confidentiality. It involves implementing robust security protocols, continuous monitoring, and regular training for employees on best practices and potential risks associated with AI tool usage. Below we will discuss some common methodologies:

Create a robust AI Policy for your organization: This is where you should start. The AI Policy can be in addition to Third party / Open Source policy that the organization has. This process to creating an AI policy can be broken down into the following tasks:

  1. Assess Organizational Needs: Start by evaluating the specific needs and goals of your organization. Identify the tasks and processes where Gen AI or other AI tools can provide business / Professional / Personal value. Understand the type of data you will be working with and the potential risks involved.
  2. Define Acceptable Use: Clearly outline what constitutes acceptable use of Gen AI tools within your organization. This includes:
    Permitted Applications: Specify which tasks Gen AI can be used for, such as content creation, data analysis, etc. This will require creating a whitelist of allowed AI tools and also a blacklist of disallowed AI tools. Prohibited Usage: Identify uses that are not allowed, such as generating misleading information or content that violates company policies or laws.
  3. Compliance with Regulations: Ensure that the use of Gen AI tools complies with relevant regulations. This may include:
    GDPR: Protecting the personal data of EU citizens;
    CCPA: Ensuring the privacy rights of California residents.
    Other Local Laws: Adhering to local and industry-specific regulations. A huge concern emanates from PII and other company confidential data that we will discuss how to safeguard later in the article.
  4. Ethical Guidelines: Establish ethical guidelines for the use of Gen AI / Other AI tools. This includes:
    Transparency: Being transparent about when and how Gen AI is used. Bias Mitigation: Implementing measures to detect and mitigate biases in AI-generated content. Accountability: Holding individuals accountable for the misuse of AI tools.
  5. Monitoring and Auditing: Implement continuous monitoring and auditing processes to ensure compliance with the Gen AI policy. This involves
    Regular Audits: Conduct regular audits of AI tool usage and data handling.
    Incident Response: Having a clear incident response plan in case of data breaches or misuse of AI tools.
    Policy Review and Updates: Regularly review and update the AI policy to keep up with technological advancements and regulatory changes. This ensures that the policy remains relevant and effective.

While these policies are a good first start, one may have to look at 3rd party tools and Techniques that can provide additional guardrails.

These techniques may involve:

Data Minimization and Sandboxing:

  • Provide only the minimum data necessary: Don’t share more data with the LLM than is absolutely required for it to complete the task. This reduces the risk of exposing sensitive information inadvertently.
  • Sandbox environments: Consider using isolated environments, like sandboxes, to test and interact with LLMs. This can help prevent leaks from accidentally spreading to other parts of a system.

Data anonymization, Data Redaction, and Data Pseudonymization:

  • Anonymize sensitive data: If you must share sensitive data, explore techniques like anonymization or Tokenization. Anonymization removes personally identifiable information (PII) like names or addresses. Tokenization replaces sensitive data with random tokens that the LLM can understand but lack inherent meaning on their own.
  • Automated Redaction: Leveraging or implementing automated tools to identify and redact sensitive information from inputs before processing.
  • Pseudonymization: Replaces sensitive data with non-identifiable placeholders that maintain reference integrity.
  • Beware of re-identification risks: Even with anonymization, there might still be a risk of re-identification if other datasets can be used to link back to the original data.

User Training and Awareness:

  • Educate users: Train users who interact with LLMs about data security best practices. This includes understanding what information is safe to share and the potential risks involved. What helps with this is Regular workshops on AI ethics and best practices.
  • Clear guidelines and protocols: Establish clear guidelines and protocols for using LLMs, especially when dealing with sensitive data.

Model Training and Development:

  • Data cleaning and filtering: During LLM training, ensure the training data is cleaned and filtered to remove any sensitive information that could be leaked through the model.
  • Security audits and penetration testing: Regularly conduct security audits and penetration testing on LLMs to identify and address potential vulnerabilities.

Additional Techniques:

  • Access controls and monitoring: Implement strong access controls to restrict who can use LLMs and monitor their activity to detect suspicious behavior.
  • Encryption: Consider encrypting sensitive data before feeding it to the LLM for additional security.

Some 3rd party tools that are worth mentioning: Strac, OpaquePrompt from Langchain, Presidio by Microsoft, LLM Guard among others.

If you are not sure how to approach this at your organization, feel free to reach out to us at Growthclap.

--

--

Vinay Roy
Vinay Roy

Written by Vinay Roy

https://growthclap.com https://www.linkedin.com/in/royvinay — Fractional AI / ML Strategist | ex-CPO | ex-Nvidia | ex-Apple | UC Berkeley

No responses yet