Expert in the Loop: Ensuring Safe and Effective LLM Integration in Code Development

By
Sajjad Abdoli, Founding AI Scientist
3.19.2025

Large language models (LLMs) have emerged as powerful tools that can assist developers in various tasks, such as code generation, refactoring, and debugging. With the ability to understand context, generate syntactically correct code across multiple programming languages, and explain complex concepts, these AI systems are rapidly transforming software development workflows worldwide. Major models like GPT-4, Claude, Llama, and Gemini can now produce functional code from natural language descriptions, suggest optimizations for existing implementations, and even identify potential bugs or performance issues.

The adoption of LLMs for coding assistance has accelerated dramatically, with GitHub reporting that 46% of code on their platform now comes from GitHub Copilot [3], their LLM-powered coding assistant. This widespread integration into development environments has increased productivity for many developers, allowing them to focus on higher-level design decisions while automating routine coding tasks. From completing function implementations to generating entire classes and modules, these models now handle tasks that previously required significant manual effort.

However, as demonstrated through comprehensive benchmarking studies and security evaluations like CYBERSECEVAL [4], these models can also introduce significant errors and vulnerabilities into the code they generate. Security researchers have documented numerous cases where LLM-generated code contains subtle flaws that could lead to injection attacks, memory leaks, authorization bypasses, and other serious security issues. The same helpful capabilities that make LLMs effective coding assistants can be exploited through carefully crafted prompts to generate harmful or vulnerable code, creating new attack vectors in the software development lifecycle.

This dual nature of LLMs as both powerful assistants and potential security risks necessitates careful implementation, rigorous evaluation frameworks, and continuous human oversight throughout the development process. As organizations increasingly rely on these tools for code generation, establishing robust safeguards becomes essential to prevent the propagation of vulnerabilities into production systems specially where the LLMs are used by junior developers.

Expanding the LLM Benchmarking Horizon

Current LLM benchmarking practices face several critical limitations that require addressing to ensure safe and effective deployment:

To address these gaps, the field needs to develop more diverse, comprehensive benchmarks that:

Key Security Risks in LLM-Assisted Coding

The CYBERSECEVAL study revealed significant vulnerabilities across major LLMs, including Llama3 models, GPT-4 Turbo, Mixtral 8x22B, Qwen2 72B, and Gemini Pro. The Figure below provides an insight even top-tier LLMs are not immune to adversarial prompts [4]:

Even top-tier LLMs are not immune to adversarial prompts! [4]

The findings raise serious concerns about security implications [4]:

  1. Prompt Injection: LLMs can misinterpret user input, potentially leading to system intrusions or security breaches when malicious instructions are embedded within seemingly innocent requests.

  2. Malicious Code Execution: Models may enable execution of harmful code, especially when connected to code interpreters. Llama3 models specifically showed vulnerability to explicit malicious prompts, complying with requests to execute harmful code.

  3. Vulnerability to Cyber Attacks: Advanced LLMs can be tricked into using their knowledge to assist with cyber attacks through carefully crafted prompts that exploit the model's helpful nature.

  4. Insecure Code Generation: LLMs frequently suggest unsafe code patterns that can be exploited. All LLMs tested as coding assistants, including Llama3 405B, generated security vulnerabilities in code across multiple programming languages.

Areas for Improvement in LLM-Assisted Coding

To enhance the effectiveness and safety of LLM-assisted coding, several key improvement areas have been identified:

  1. Adaptive Complexity Alignment:


    • LLMs should align code complexity with the developer's proficiency level
    • Provide multiple solution pathways (beginner, intermediate, advanced) for the same problem
    • Tailor explanations and code style to match the user's expertise level
    • Adjust terminology and depth based on detected user capabilities
    • Making sure that the LLM adds clear, helpful inline comments when generating code to explain what the code does and guide users through it.

  2. Multilingual Coding Resources:


    • Current coding LLMs often lack robust non-English support, limiting their global accessibility
    • Expanding language coverage would benefit developer communities worldwide
    • Models should support programming explanations in multiple natural languages
    • Documentation generation should adapt to the user's preferred language

  3. Security-First Development Patterns:


    • Implement proactive detection of security vulnerabilities in generated code
    • Develop specialized training for common security pitfalls in different languages
    • Include automatic warning systems for potentially risky code patterns
    • Provide secure alternatives alongside any potentially vulnerable code suggestions

The Expert-in-the-Loop Framework

Considering the mentioned issues and shortfalls, the most promising approach to mitigate risks while maximizing benefits is implementing an expert-in-the-loop framework that follows a three-stage cycle:

  1. Alignment with Expert Feedback (Stage 01):


    • Models are refined according to expert feedback
    • Continuous learning from domain specialists improves model output quality
    • Experts help identify edge cases and security concerns that automated testing might miss
    • This stage establishes guardrails and best practices for the model to follow

  2. AI-Generated Code (Stage 02):


    • LLMs create code according to input prompts
    • Multiple solutions can be generated for different skill levels
    • Code generation follows established patterns and security guidelines
    • Built-in safety measures prevent generating obviously harmful code

  3. Expert Validation & Correction (Stage 03):


    • Human experts review and correct the code according to established guidelines
    • Security audits identify potential vulnerabilities
    • Performance optimization ensures efficient implementation
    • Final quality assurance before deployment or integration

This cyclical approach combines AI efficiency with human expertise to ensure reliable and secure code generation, leveraging LLMs as powerful assistants while maintaining human oversight of the development process. The expert in the loop serves as both a safety mechanism and a training resource, continuously improving the system's capabilities.

The Importance of Benchmark Datasets

Comprehensive evaluation of LLM coding capabilities requires robust benchmarking through specialized datasets that magnifies the importance of the experts in generating them.  Below are two examples of how a dataset looks when used for training, fine-tuning or benchmarking the AI models for coding purposes:

A data sample in the APPs benchmark dataset [1]

A data sample in the CodeXGLUE dataset [2]

Challenges and Considerations

Implementing the expert-in-the-loop approach faces several practical challenges:

Future Directions

The evolution of LLM-assisted coding will likely focus on:

  1. Improved Security Guarantees: Developing formal verification methods for LLM-generated code
  2. Specialized Domain Adaptation: Creating LLMs optimized for specific programming languages or development contexts
  3. Personalized Learning Curves: Adapting to individual developers' skills and preferences over time
  4. Transparent Reasoning: Providing clear explanations for code design decisions and security considerations

Conclusion

The expert-in-the-loop approach offers a promising framework for ensuring the safe and effective integration of LLMs in code development. By leveraging the combined strengths of human experts and LLMs, this approach helps developers produce high-quality code while minimizing the risk of errors and vulnerabilities.

As LLM technology continues to advance, this human-AI collaboration will likely play a vital role in the future of code development, balancing the efficiency gains of automation with the critical judgment and domain expertise that human developers provide. The ongoing research in benchmark datasets, security evaluation techniques, and collaborative development patterns will be essential in shaping how we safely harness the power of LLMs for coding assistance.

References:

  1. Hendrycks, Dan, et al. "Measuring coding challenge competence with apps." arXiv preprint arXiv:2105.09938 (2021).
  2. Lu, Shuai, et al. "Codexglue: A machine learning benchmark dataset for code understanding and generation." arXiv preprint arXiv:2102.04664 (2021).
  3. GitHub. (2023, February 14). GitHub Copilot for Business is now available. GitHub Blog.
  4. Wan, Shengye, et al. "Cyberseceval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models." arXiv preprint arXiv:2408.01605 (2024).

Get in touch

Learn how
Perle can help 

No matter how specific your needs, or how complex your inputs, we’re here to show you how our  innovative approach to data labelling, preprocessing, and governance can unlock Perles of wisdom for companies of all shapes and sizes. 

You can unsubscribe from these communications at any time. For more information on how to unsubscribe, our privacy practices, and how we are committed to protecting and respecting your privacy, please review our Privacy Policy.

By clicking submit, you consent to allow Perle to store and process the personal information submitted above to provide you the content requested.