Expert in the Loop: Ensuring Safe and Effective LLM Integration in Code Development
By
Sajjad Abdoli, Founding AI Scientist
3.19.2025
Large language models (LLMs) have emerged as powerful tools that can assist developers in various tasks, such as code generation, refactoring, and debugging. With the ability to understand context, generate syntactically correct code across multiple programming languages, and explain complex concepts, these AI systems are rapidly transforming software development workflows worldwide. Major models like GPT-4, Claude, Llama, and Gemini can now produce functional code from natural language descriptions, suggest optimizations for existing implementations, and even identify potential bugs or performance issues.
The adoption of LLMs for coding assistance has accelerated dramatically, with GitHub reporting that 46% of code on their platform now comes from GitHub Copilot [3], their LLM-powered coding assistant. This widespread integration into development environments has increased productivity for many developers, allowing them to focus on higher-level design decisions while automating routine coding tasks. From completing function implementations to generating entire classes and modules, these models now handle tasks that previously required significant manual effort.
However, as demonstrated through comprehensive benchmarking studies and security evaluations like CYBERSECEVAL [4], these models can also introduce significant errors and vulnerabilities into the code they generate. Security researchers have documented numerous cases where LLM-generated code contains subtle flaws that could lead to injection attacks, memory leaks, authorization bypasses, and other serious security issues. The same helpful capabilities that make LLMs effective coding assistants can be exploited through carefully crafted prompts to generate harmful or vulnerable code, creating new attack vectors in the software development lifecycle.
This dual nature of LLMs as both powerful assistants and potential security risks necessitates careful implementation, rigorous evaluation frameworks, and continuous human oversight throughout the development process. As organizations increasingly rely on these tools for code generation, establishing robust safeguards becomes essential to prevent the propagation of vulnerabilities into production systems specially where the LLMs are used by junior developers.
Expanding the LLM Benchmarking Horizon
Current LLM benchmarking practices face several critical limitations that require addressing to ensure safe and effective deployment:
Predominantly English-focused: Most benchmark datasets are in English, creating a substantial barrier to evaluating multilingual capabilities. Current benchmarks datasets are predominantly in English, limiting their usefulness for global implementation.
Text-prompt constraints: Current security benchmarks primarily rely on text prompts, not accounting for multimodal security challenges. This approach fails to address the emerging threat landscape where attacks may come through various input channels [4].
To address these gaps, the field needs to develop more diverse, comprehensive benchmarks that:
Support global developer communities through multilingual datasets that enable evaluation across different programming languages and natural languages
Establish robust guardrails against harmful code generation by testing models against a wide range of adversarial inputs
Implement multimodal benchmarking to defend against novel attack vectors that might exploit visual or other non-textual inputs
Key Security Risks in LLM-Assisted Coding
The CYBERSECEVAL study revealed significant vulnerabilities across major LLMs, including Llama3 models, GPT-4 Turbo, Mixtral 8x22B, Qwen2 72B, and Gemini Pro. The Figure below provides an insight even top-tier LLMs are not immune to adversarial prompts [4]:
Even top-tier LLMs are not immune to adversarial prompts! [4]
The findings raise serious concerns about security implications [4]:
Prompt Injection: LLMs can misinterpret user input, potentially leading to system intrusions or security breaches when malicious instructions are embedded within seemingly innocent requests.
Malicious Code Execution: Models may enable execution of harmful code, especially when connected to code interpreters. Llama3 models specifically showed vulnerability to explicit malicious prompts, complying with requests to execute harmful code.
Vulnerability to Cyber Attacks: Advanced LLMs can be tricked into using their knowledge to assist with cyber attacks through carefully crafted prompts that exploit the model's helpful nature.
Insecure Code Generation: LLMs frequently suggest unsafe code patterns that can be exploited. All LLMs tested as coding assistants, including Llama3 405B, generated security vulnerabilities in code across multiple programming languages.
Areas for Improvement in LLM-Assisted Coding
To enhance the effectiveness and safety of LLM-assisted coding, several key improvement areas have been identified:
Adaptive Complexity Alignment:
LLMs should align code complexity with the developer's proficiency level
Provide multiple solution pathways (beginner, intermediate, advanced) for the same problem
Tailor explanations and code style to match the user's expertise level
Adjust terminology and depth based on detected user capabilities
Making sure that the LLM adds clear, helpful inline comments when generating code to explain what the code does and guide users through it.
Multilingual Coding Resources:
Current coding LLMs often lack robust non-English support, limiting their global accessibility
Expanding language coverage would benefit developer communities worldwide
Models should support programming explanations in multiple natural languages
Documentation generation should adapt to the user's preferred language
Security-First Development Patterns:
Implement proactive detection of security vulnerabilities in generated code
Develop specialized training for common security pitfalls in different languages
Include automatic warning systems for potentially risky code patterns
Provide secure alternatives alongside any potentially vulnerable code suggestions
The Expert-in-the-Loop Framework
Considering the mentioned issues and shortfalls, the most promising approach to mitigate risks while maximizing benefits is implementing an expert-in-the-loop framework that follows a three-stage cycle:
Alignment with Expert Feedback (Stage 01):
Models are refined according to expert feedback
Continuous learning from domain specialists improves model output quality
Experts help identify edge cases and security concerns that automated testing might miss
This stage establishes guardrails and best practices for the model to follow
AI-Generated Code (Stage 02):
LLMs create code according to input prompts
Multiple solutions can be generated for different skill levels
Code generation follows established patterns and security guidelines
Final quality assurance before deployment or integration
This cyclical approach combines AI efficiency with human expertise to ensure reliable and secure code generation, leveraging LLMs as powerful assistants while maintaining human oversight of the development process. The expert in the loop serves as both a safety mechanism and a training resource, continuously improving the system's capabilities.
The Importance of Benchmark Datasets
Comprehensive evaluation of LLM coding capabilities requires robust benchmarking through specialized datasets that magnifies the importance of the experts in generating them. Below are two examples of how a dataset looks when used for training, fine-tuning or benchmarking the AI models for coding purposes:
APPs benchmark dataset [1]:
Evaluates coding competence in Python using challenge-based tasks
Categorized by complexity (Introductory, Interview, and Competition levels)
Provides standardized problems with clear evaluation metrics
A data sample in the APPs benchmark dataset [1]
CodeXGLUE [2]:
A machine learning benchmark for code understanding and generation
Includes tasks like code summarization, translation, and defect detection
A data sample in the CodeXGLUE dataset [2]
Challenges and Considerations
Implementing the expert-in-the-loop approach faces several practical challenges:
Expertise Requirements: Finding qualified experts with deep domain knowledge and security expertise can be challenging and costly
Scalability Concerns: As code generation volume increases, maintaining consistent expert oversight becomes more difficult
Integration Complexity: Seamlessly incorporating human feedback into automated systems requires sophisticated workflows
Global Language Support: Providing equal quality across different programming and natural languages remains technically challenging
Balancing Assistance and Autonomy: Finding the right level of LLM independence versus expert control requires careful calibration
Future Directions
The evolution of LLM-assisted coding will likely focus on:
Improved Security Guarantees: Developing formal verification methods for LLM-generated code
Specialized Domain Adaptation: Creating LLMs optimized for specific programming languages or development contexts
Personalized Learning Curves: Adapting to individual developers' skills and preferences over time
Transparent Reasoning: Providing clear explanations for code design decisions and security considerations
Conclusion
The expert-in-the-loop approach offers a promising framework for ensuring the safe and effective integration of LLMs in code development. By leveraging the combined strengths of human experts and LLMs, this approach helps developers produce high-quality code while minimizing the risk of errors and vulnerabilities.
As LLM technology continues to advance, this human-AI collaboration will likely play a vital role in the future of code development, balancing the efficiency gains of automation with the critical judgment and domain expertise that human developers provide. The ongoing research in benchmark datasets, security evaluation techniques, and collaborative development patterns will be essential in shaping how we safely harness the power of LLMs for coding assistance.
References:
Hendrycks, Dan, et al. "Measuring coding challenge competence with apps." arXiv preprint arXiv:2105.09938 (2021).
Lu, Shuai, et al. "Codexglue: A machine learning benchmark dataset for code understanding and generation." arXiv preprint arXiv:2102.04664 (2021).
GitHub. (2023, February 14). GitHub Copilot for Business is now available. GitHub Blog.
Wan, Shengye, et al. "Cyberseceval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models." arXiv preprint arXiv:2408.01605 (2024).
Get in touch
Learn how Perle can help
No matter how specific your needs, or how complex your inputs, we’re here to show you how our innovative approach to data labelling, preprocessing, and governance can unlock Perles of wisdom for companies of all shapes and sizes.