What Nobody Tells You About Deploying GenAI
Large Language Models (LLMs) became popular after the release of ChatGPT two years ago. The idea of chatting with an AI through a browser significantly reduced the technical barriers, making LLMs the fastest-growing platform globally. Since then, ChatGPT-like applications have surged in popularity, driven by their ease of use and groundbreaking innovations.
Although deploying Generative AI (GenAI) apps may seem straightforward by using APIs like OpenAI, Google, or Anthropic, it’s more complex in practice. Before integrating these models into workflows, it’s crucial to understand their capabilities and limitations thoroughly.
Like all AI systems, building LLM-native applications demands a research-driven approach to ensure safety and reliability. At Precisely’s Innovation Labs, continuous learning and innovating with cutting-edge technologies are part of our DNA. Here, we share key lessons from building GenAI applications for production.
1. Create our own LLM evaluations
Leaderboards are rankings that compare the performance of various LLMs using standardized test sets. Many leaderboards offer a general perspective on LLM performance using static test sets. However, these evaluations often fail to provide insights into specific use cases. Additionally, some LLMs may use these benchmarks in training, inflating their scores. This is why relying solely on leaderboard results is not sufficient.
The most reliable way to assess an LLM’s performance for a specific use case is to develop a tailored evaluation process with sufficient coverage of edge cases and failures. Even with just a few hundred samples relevant to your problem, creating a custom dataset helps ensure meaningful evaluations. This allows you to assess performance accurately, set a baseline, experiment with new models, and debug issues more effectively.
2. Enforce input and output guardrails
With the rise of GenAI applications, safeguarding users and systems is essential. Recent incidents, like a car dealership bot chatbot that mistakenly offered cars for $1 or a transportation company promising substantial discounts, highlight the risks of unchecked inputs. Every input exposed to users must have guardrails to prevent such errors. For example, a customer support chatbot should avoid answering political or competitor-specific queries.
AWS Bedrock, a Precisely technology partner, supports guardrails like PII detection, denied topics, custom words, and unsafe/NSFW filters, helping make GenAI applications safer.
eBook
Trusted AI 101: Tips for Getting Your Data AI-Ready
Future-proof your AI today with data integrity. It’s time to maximize the potential of your artificial intelligence (AI) initiatives. Get inspired by valuable AI use cases, and find out how to overcome bias, inaccurate results, and other top challenges. Technology-driven insights and capabilities depend on trusted data.
3. Experiment and choose the correct response format
LLMs are useful only if their outputs can be effectively parsed and utilized downstream. Recent tools like OpenAI’s JSON mode help enforce output schemas, ensuring consistency. However, research has shown that imposing format restrictions can impact quality and reasoning abilities.
Experimenting with different response formats like JSON, YAML, or CSV helps identify the best fit. In our experience, YAML often works well with LLMs, saving tokens and improving efficiency.
4. Proactively detect and repair failures
Despite prompt engineering and fine-tuning, LLMs will produce errors. It’s crucial to have a system in place to detect and handle these errors. You’ll likely observe recurring failure patterns like improperly formatted JSON or YAML when experimenting.
Simply retrying API calls isn’t ideal, as it doubles the calls and latency. Instead, mechanisms should be implemented to repair common failures and reduce retries automatically. Below is a sample code that removes XML and YAML tokens from raw LLM responses:
def _clean_llm_response(text) -> str:
“”” Remove XML and YAML tags from raw LLM response
Args:
text (str): raw text response from LLM
Returns:
str: cleaned text to parse
“””
removed_yaml_tags = re.sub(r’^“`yaml\n|“`$’, ”, text)
removed_xml_tags = re.sub(r'<[^>]+>\n?’, ”, removed_yaml_tags)
cleaned_string = removed_xml_tags.replace(‘”‘, “‘”)
return cleaned_string
We can expect more conformity to user instructions and complex schemas as LLMs advance.
5. Optimizing LLM inference with batch mode
LLM API calls can quickly exhaust quotas, leading to delays. Switching to batch processing reduces the load on synchronous requests, optimizing rate limits for tasks requiring immediate responses.
Batch mode inferencing generates embeddings, runs evaluations, or labels data. AWS Bedrock and OpenAI batch modes can save up to 50% in costs, making GenAI use cases more affordable.
Key takeaways for successful GenAI deployment
Deploying GenAI in production involves much more than simply calling APIs. Creating tailored evaluations, enforcing guardrails, experimenting with response formats, handling failures, and optimizing with batch inference are crucial steps. These lessons from our experience can help you build safer, more efficient, and reliable GenAI applications. Understanding the complexities upfront will save time and resources, ensuring your AI deployments are effective and trustworthy.
For more actionable insights and tips, read our eBook: Trusted AI 101: Tips for Getting Your Data AI-Ready.