Exploring Open Source LLMs for Text Generation, Code Generation & More

Vishal Pallerla

Cover Image for Exploring Open Source LLMs for Text Generation, Code Generation & More

This article provides an overview of open-source large language models (LLMs) organized by their primary categories. Explore their applications in various domains such as text generation, code generation, transcription, and image generation. With a wide range of models available, users can harness the power of LLMs to create chatbots, develop virtual assistants, perform sentiment analysis, and much more. In this article, we delve into some of the top open-source LLMs and their unique features, providing insights into their capabilities and potential uses.

Want to leverage the power of LLMs without compromising data privacy? Don't miss our detailed guide on Building Your Private ChatGPT with DevZero, where we walk you through each step of the process.

General Text Generation and Instruction Following #

General Text Generation and Instruction Following are two crucial tasks in Language Model Models (LLMs). General Text Generation entails producing coherent and meaningful text based on a given input, such as a prompt or a topic. This can be applied in various applications, including chatbots, summarization, and creative writing. Instruction Following, conversely, involves comprehending and adhering to a set of instructions provided in natural language. This may encompass tasks such as answering questions, filling out forms, or executing specific actions.

Falcon-40B-Instruct #

Falcon-40B-Instruct is a ready-to-use chat/instruct model based on Falcon-40B and is the little brother of Falcon-40B. It has been fine-tuned on a chat dataset. Currently, Falcon-40B-Instruct is the top-performing model on the OpenLLM Leaderboard. This is an instruct model, which may not be ideal for further finetuning. If you are interested in building your own instruct/chat model, we recommend starting from Falcon-40B.

Falcon-40B vs Falcon-40B-Instruct #

Falcon-40B is a foundational LLM with 40B parameters, trained on one trillion tokens, and is an autoregressive decoder-only model. Falcon-40B-Instruct is a causal decoder-only model with 40B parameters and is made available under the Apache 2.0 license. The main difference between Falcon-40B and Falcon-40B-Instruct is that Falcon-40B-Instruct is a ready-to-use chat/instruct model based on Falcon-40B, while Falcon-40B is a foundational LLM that can be fine-tuned for various tasks such as generating creative content, solving complex problems, customer service operations, virtual assistants, language translation, and sentiment analysis.

Dolly 2.0 #

Dolly 2.0 is a 13B parameter LLM that is specifically designed to follow instructions. It was fine-tuned on a human-generated instruction dataset, and it has been shown to be able to follow instructions with high accuracy.

MPT-30B #

MPT-30B is a commercial Apache 2.0 licensed, open-source foundation model that exceeds the quality of GPT-3. MPT-30B is a GPT-style decoder-only transformer with several improvements including higher speed, greater stability, and longer context lengths.

Text embeddings #

Text embeddings are low-dimensional vector representations for arbitrary-length texts and play key roles in many NLP tasks such as large-scale retrieval. Text embeddings have the potential to overcome the lexical mismatch issue and facilitate efficient retrieval and matching between texts. It also offers a versatile interface easily consumable by downstream applications.

While pre-trained language models such as BERT and GPT can produce transferrable

text representations, they are not ideal for tasks such as retrieval and text matching where a single vector embedding of texts is more desired due to its efficiency and versatility.

Read more about Text embeddings here

PostgresML #

PostgresML makes it easy to generate embeddings from text in your database using a large selection of state-of-the-art models with one simple call to pgml.embed(model_name, text). It provides a variety of state-of-the-art models to choose from, including HuggingFace models.

Instructor👨‍🏫 #

Instructor is an instruction-finetuned text embedding model that can generate text embeddings tailored to any task and domains by simply providing the task instruction, without any finetuning.

Transcription (Speech to Text) #

Transcription (speech to text) in the context of LLMs refers to the process of converting spoken language into text using Large Language Models (LLMs). The process of converting speech to text involves a complex machine learning model that uses linguistic algorithms to sort auditory signals from spoken words and transfer those signals into text using characters called Unicode. Some of the top open-source LLM models in Transcription (speech to text) include Whisper, DeepSpeech, Kaldi, Vosk, and Coqui. These models can be used to transcribe audio data, real-time speech to text, and asynchronous speech to text. They are all open-source and can be used for free commercial use.

Whisper #

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language.

Kaldi #

Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. Kaldi is intended for use by speech recognition researchers.

Wav2Vec 2.0 #

Wav2vec 2.0 is a framework for self-supervised learning of speech representations.

It learns speech representations on unlabeled data and has been used to learn speech representations in multiple languages. Fairseq Wav2Vec is a speech recognition model developed by Facebook Research.

Image Generation #

Image generation with LLMs generate images from text descriptions. The image is generated by a process called diffusion modeling. Diffusion modeling is a type of machine learning that starts with a random image and then gradually refines it until it matches the text description.

Image generation with LLMs is still a developing field, but it has the potential to revolutionize the way we create and interact with images. In the future, LLMs could be used to generate images for a variety of purposes, such as creating realistic product images for e-commerce websites, generating personalized images for social media, and creating educational images for students.

Stable Diffusion #

Code Generation #

Code generation refers to the ability to generate programming code based on various inputs, such as natural language descriptions, incomplete code, or execution examples. Code generation tools can assist developers in creating and maintaining code, improving programming productivity, and automating programming tasks.

Here are some of the open-source LLMs that you can use for code generation:

StarCoder #

StarCoder is an open-source Large Language Model (LLM) for code generation developed jointly by Hugging Face and ServiceNow as part of the BigCode project. It is trained on permissively licensed data from GitHub, including over 80 programming languages, Git commits, GitHub issues, and Jupyter notebooks

CodeT5+ #

CodeT5+ is a family of open code LLMs trained with flexible model architecture and diverse learning objectives. It is designed for code understanding and generation tasks

Replit-code-v1-3b #

Replit-code-v1-3b is a 2.7 billion parameter Causal Language Model focused on code completion. Developed by Replit, Inc., the model has been trained on a subset of the Stack Dedup v1.2 dataset, which includes 20 different programming languages such as Markdown, Java, JavaScript, Python, TypeScript, PHP, SQL, JSX, reStructuredText, Rust, C, CSS, Go, C++, HTML, Vue, Ruby, Jupyter Notebook, R, and Shell.

Learn more about pros and cons of various open source CodeGen tools here

Conclusion #

In conclusion, open-source large language models have advanced significantly in recent years, offering a variety of applications in text generation, instruction following, retrieval embeddings, transcription, image generation, and code generation. These models provide valuable resources for researchers, developers, and businesses, enabling them to leverage cutting-edge AI technology to improve productivity, enhance user experiences, and solve complex problems. As the field continues to progress, we can expect even more sophisticated and versatile LLMs to emerge, further revolutionizing the way we interact with and utilize language and data.


picture of Vishal Pallerla

Vishal Pallerla

Developer Advocate, DevZero

Share this post