Glossary and Abbreviations of AI Words

January 7, 2024

AWQ: Activation-aware Weight Quantisation

It is a quantisation method that considers the actual data distribution in the activations produced by the model during inference.

It is a technique used in neural network compression to reduce the memory footprint and improve the efficiency of deep learning models while maintaining their performance. The main idea behind AWQ is to quantise the weights of a model based on the adaptive activation values during training or inference rather than using fixed-point representations.

AWQ works by dynamically adjusting the scale and zero point of the weight quantisation based on the input data distribution and the corresponding output activations. This allows the model better to represent the essential features and relationships within the data while minimising memory usage.

The process of AWQ involves the following steps:

Forward pass: Compute the outputs (activations) for a given input using the original floating-point weights.
Identify significant activations: Determine which are most relevant to the output, often by applying some saliency or importance measure.
Quantize weights: Apply quantisation to the weights that contribute significantly to the important activations. This may involve using different bit widths, scales, and zero points for different model parts.
Backward pass: Compute the gradients required for training by backpropagating through the quantised weights.
Update the scale and zero point: Adaptively adjust the scale and zero point of the weight quantisation based on the input data distribution and the corresponding output activations during the forward and backward passes.
Repeat steps 1-5 until convergence or when a satisfactory level of performance is achieved.

AWQ has been shown to improve the performance of deep learning models regarding memory usage and accuracy. It is a promising technique for deploying neural networks on resource-constrained devices like smartphones, edge servers, and IoT devices.

BLOOM: BigScience Large Open-science Open-access Multilingual Language Model

CNN: Convoluted Neural Network

It is a specialised artificial neural network commonly used for image recognition, object detection, and various computer vision tasks. CNNs are inspired by the organisation of the human visual system, where simple features are extracted from the input images and processed at multiple levels to extract complex features.

The primary components of a CNN include convolutional layers, pooling layers, and fully connected layers. Here's an overview of how they work:

Convolutional layer: This layer applies a set of learnable filters (kernels) to the input image. Each filter captures specific features in the picture, such as edges or textures. The output of this layer is called the feature map, which represents the presence and position of the learned features within the input image.
Activation function: After the convolution operation, an activation function (such as ReLU) is applied to introduce non-linearity into the network and produce more complex features. This helps the network learn more abstract representations from the raw pixel data.
Pooling layer: The pooling layer reduces the spatial dimensions of the feature maps by applying downsampling operations, such as average or max pooling. This operation helps to decrease the computational cost and prevent overfitting while retaining essential features.
Fully connected layers (if needed): Depending on the task, a CNN may have one or more fully connected layers that process the extracted features and make predictions or classifications. For example, in image classification tasks, the final fully connected layer usually consists of as many units as there are classes to be classified.
Output layer: The output layer produces the final prediction or classification for the input image. This could be a single value (e.g., probability) for regression tasks or a vector of probabilities for multi-class classification tasks.

CNNs have been widely used and proven effective in various applications, including face recognition, object detection, image segmentation, etc. They have become an essential tool in computer vision due to their ability to learn hierarchical representations and extract relevant features from complex visual data.

CoT: Chain of Thought (Reasoning)

Chain of Thought reasoning refers to generating explanations or justifications for an answer by breaking down the thought process into smaller, more manageable steps. CoT reasoning aims to provide a clear and logical flow of ideas that leads to a conclusion, making it easier for humans and artificial intelligence systems to understand and follow the rationale behind a decision or response.

In natural language processing (NLP) and AI, CoT reasoning has gained importance as researchers strive to develop models capable of providing coherent explanations and reasoning processes alongside their answers. This enables users to have more trust in the generated responses and better understand how the model arrived at its conclusion.

The process of generating a CoT explanation typically involves the following steps:

Identify the question or problem: Understand what is being asked or what needs to be solved.
Gather relevant information: Collect necessary data, facts, or knowledge to address the question or solve the problem.
Formulate a plan: Determine the logical steps needed to answer the question or solve the problem, considering any constraints or considerations.
Execute the plan: Carry out each step of the plan and document the reasoning behind each decision made along the way.
Evaluate the solution: Assess the results of the solution and determine if it is appropriate, accurate, and complete. You can adjust the plan and repeat steps 3-5 if necessary until an acceptable outcome is reached.
Communicate the CoT explanation: Present the step-by-step reasoning process, highlighting key insights and decisions to provide a clear and understandable justification for the final answer or solution.

CoT reasoning is still an area of active research in AI, with ongoing efforts to develop models that can generate accurate responses and explain their thought processes clearly and logically.

DPO: Direct Preference Optimisation

Direct Preference Optimization is a method of solving multi-objective optimisation problems by directly optimising for the preferences among the solutions rather than the objectives themselves. In many real-world applications, decision-makers often express their preferences over the alternatives instead of specifying the objective functions explicitly. DPO addresses this challenge by incorporating these preferences into the optimisation process.

The basic idea behind DPO is to formulate a preference relation among the alternatives and then optimise for this preference directly. The preference relation can be represented using various formalisms, such as weak or strong orders, partial orders, or indifference relations. In contrast to traditional multi-objective optimisation methods that search for Pareto optimal solutions, DPO seeks to find a solution that is preferred over all other alternatives according to the given preference relation.

The DPO process typically involves the following steps:

Define the decision problem and preferences: Clearly state the objectives or criteria to be optimised and any preferences the decision maker expresses regarding these objectives.
Formulate the preference relation: Convert the stated preferences into a formal representation that captures the decision maker's priorities among the alternatives.
Construct the preference space: Create a mathematical representation of the preference relation, which may involve assigning weights or rankings to the objectives based on their relative importance.
Optimize the preference space: Use optimisation techniques, such as linear programming, integer programming, or more advanced methods like evolutionary algorithms or metaheuristics, to find solutions that satisfy the preferences and optimise for the objectives simultaneously.
Interpret the results: Analyze the optimal solution(s) found by the DPO method in light of the decision maker's expressed preferences and assess their suitability for the problem.
Refine or validate the preferences (if necessary): If the results do not fully align with the decision maker's expectations, refine or revise the preference relation and repeat the optimisation process until an acceptable solution is found.

DPO has several advantages over traditional multi-objective optimisation methods, particularly in cases where decision-makers clearly understand their preferences but may need clarification about the precise formulation of the objective functions. By directly incorporating these preferences into the optimisation process, DPO can help ensure that the resulting solutions align more closely with the decision-maker's goals and priorities.

FFL: Feed Forward Layer

A Feed Forward Layer (FFL) is a neural network component that transforms input data into an intermediate representation before passing it on to the next layer. It consists of several neurons, each with its own set of weights and biases, which act as the activation functions for the layer. The output of the FFL is then fed into the next layer of the network.

The primary purpose of a Feed Forward Layer is to perform non-linear transformations on the input data, enabling the neural network to learn complex patterns and relationships between the input and output variables. This allows the network to make accurate predictions and classifications based on the input data.

FLARE: Forward-Looking Active Retrieval Augmented Generation.

Forward-looking Active Retrieval Augmented Generation (FLARE) refers to a technology that enables the generation of content based on forward-looking or future events and data. It combines active retrieval techniques, which involve searching for relevant information from various sources in real-time, with advanced augmentation methods like natural language processing (NLP), machine learning (ML), and deep learning (DL) algorithms to generate new and relevant content.

FLARE aims to provide users with up-to-date and personalised information tailored to their needs or interests. This is particularly useful for applications that require real-time updates, such as news aggregation, financial market analysis, sports coverage, weather forecasting, and more.

The FLARE process typically involves the following steps:

Define the target domain and user preferences: Identify the subject area or topic of interest and gather information about the user's preferences, needs, and requirements for content generation.
Collect and preprocess data: Gather relevant data from various sources (e.g., databases, APIs, web crawling) and clean, transform, and format it to ensure consistency and accuracy.
Implement active retrieval techniques: Use real-time search algorithms and data mining methods to find the most current and relevant information based on user preferences and the target domain.
Utilize augmentation methods: Employ NLP, ML, and DL algorithms to analyse the retrieved data, extract essential insights, and generate new content that is engaging, informative, and tailored to the user's interests.
Integrate with user interfaces: Incorporate the generated content into an appropriate user interface, such as a mobile app, web portal, or chatbot, to give users easy access to the information they need.
Continuously update and refine: Regularly evaluate the performance of FLARE, gather feedback from users, and update the algorithms and data sources to ensure the content remains relevant, accurate, and valuable over time.

FLARE has the potential to revolutionise information retrieval and content generation by providing users with a personalised, forward-looking perspective on various domains and topics. By leveraging advanced AI techniques like machine learning and deep learning, FLARE can help individuals stay informed about events as they happen and gain insights into future trends and developments.

FM: Foundation Model

A Foundation Model (FM) refers to a pre-trained large-scale machine learning model that can serve as a basis for solving a wide range of problems or tasks in various domains. Foundation models are designed to be versatile, adaptable, and capable of generating useful results with minimal fine-tuning on specific datasets or problem instances.

The critical characteristics of foundation models include:

Large scale: These models typically consist of millions or even billions of parameters, allowing them to learn complex data representations from various sources.
Pre-training: Foundation models are trained on massive amounts of diverse data using self-supervised learning techniques, such as predicting masked words in sentences (e.g., BERT) or reconstructing image patches (e.g., Vision Transformer). This pre-training process helps the model learn generalised representations that can be applied to various tasks.
Transfer learning: One of the primary advantages of foundation models is their ability to transfer learned knowledge from one task or domain to another with minimal additional training, often outperforming models trained from scratch on specific datasets.
Multi-modal and multi-task capabilities: Many modern foundation models are designed to process data in multiple formats (e.g., text, images, audio) and can handle various tasks without requiring specialised architectures or training procedures.
Modular architecture: Foundation models often consist of interchangeable components that can be fine-tuned or replaced depending on the specific task, allowing for easy customisation and adaptation.
Open-source availability: Many foundation models are openly shared with the research community, enabling researchers and practitioners to build upon and extend their capabilities.

Using foundation models is effective in various applications, including natural language processing (e.g., text generation, question answering), computer vision (e.g., image classification, object detection), and speech recognition. By providing a powerful starting point for solving a wide range of problems, foundation models have the potential to accelerate progress in AI research and development while reducing the need for custom model training in many cases.

FP16: Floating point 16 (Model precision)

FP32: Floating point 32 is often termed “full precision.”

GGML: Georgi Gerganov Machine Learning (file format)

The Georgi Gerganov Machine Learning (GGML) file format is a data structure for storing machine learning models and data. It was developed by Georgi Gerganov, a Bulgarian computer scientist and data scientist.

The GGML file format supports binary and text formats, making sharing and loading models easy across different platforms. It uses a simple and intuitive structure, including metadata, hyperparameters, and features for training and testing data.

The GGML file format has become popular among machine learning practitioners due to its simplicity, portability, and ability to store large models without sacrificing performance. It's often used for storing trained models from popular machine learning libraries such as TensorFlow, PyTorch, and scikit-learn.

GGUF: GGML Universal File (file format)

GGUF is a file format that stores GGML-based executors and models for inference. The binary GGUF format is made to load and save models quickly while maintaining readability. Traditionally, PyTorch or another framework is used to construct models, which are then transformed to GGUF for usage in GGML.

GLM: General Language Model

A General Language Model (GLM) is a machine learning model focusing on understanding and generating natural language text across various domains and tasks. It is a pre-trained, large-scale neural network that captures the statistical patterns and relationships in natural language data.

The primary characteristics of GLMs include:

Large scale: These models typically consist of millions or even billions of parameters, allowing them to learn complex representations of text data.
Transfer Learning Modular Architect: GLM is a pre-trained foundation model with adaptive fine-tuning on various domains and tasks.
Foundation models are designed to be versatile, adaptable, and capable of generating useful results with minimal fine-tuning on specific datasets or problem instances.
Multi-modal and multi-task capabilities, including text image audio.
Modular architecture: Foundation models often consist of interchangeable components that can be fine-tuned or replaced depending on the specific task, allowing for easy customisation and adaptation.
Open-source availability: Many foundation models are openly shared with the research community, enabling researchers and practitioners to build upon and extend their capabilities.

GLMs are effective in various applications, including natural language processing (e.g., text generation, question answering), computer vision (e.g., image classification, object detection), and speech recognition. By providing a powerful starting point for solving a wide range of problems, GLMs have the potential to accelerate progress in AI research and development while reducing the need for custom model training in many cases.

GoT: Graph of Thought

Graph of Thought (GoT) is a visual representation of human cognition based on the idea that our thoughts can be represented as a network of interconnected concepts. Cognitive scientist Roger Schank proposed it in the late 1980s and early 1990s.

In GoT, concepts are represented as nodes in the graph, and connections between concepts are represented as edges. The strength of the connection between two concepts indicates the degree of association or similarity between them. The graph is dynamic, changing as new information is learned and existing knowledge is revised.

These concepts are used in LLM inference, too. GoT's main feature is its ability to express LLM information as an arbitrary graph, with vertices representing "LLM thoughts" and edges representing dependencies between them. This technique combines LLM thoughts to create synergistic effects, distils the essence of networks, and enhances thoughts through feedback loops.

GPT: Generative Pretrained Transformer

Generative Pre-trained Transformer (GPT) is a type of neural network architecture based on transformer technology. It is designed for language generation tasks like text completion or translation. The critical aspect of GPT is its pre-training method, which allows it to learn the patterns and structures of language from large amounts of data. This pre-training enables the model to generate new, coherent text when provided with a prompt or context. GPT models perform impressively in various language generation tasks and are considered state-of-the-art in many applications.

In simple terms, GPT is an AI system trained on vast amounts of text data to understand how language works. This training allows it to generate new text resembling human-written content based on a given prompt or context. It uses the transformer architecture to process and generate long text sequences efficiently. GPT represents a significant advancement in natural language processing and generation capabilities.

GPTQ: GPT Quantised

HQQ: Half-Quadratic Quantisation

Half-quadratic quantisation (HQQ) is used for quantising signals, representing them as discrete values. It is an improvement over traditional quantisation techniques like uniform or linear quantisation.

In HQQ, the input signal is first transformed using a half-quadratic function, which means the input is squared and then multiplied by 0.5. This transformed signal is then quantised using a uniform or linear quantiser. The critical advantage of HQQ over other methods is that it better preserves the input signal's energy, resulting in less distortion and noise during the quantisation process.

HQQ can be particularly useful in applications requiring signal compression, such as audio coding or image compression. It has been shown to perform better than traditional quantisation methods regarding signal fidelity and robustness against quantisation error.

ICL: In-context learning

In-context learning (ICL) refers to machine learning where models are trained on examples in their original context rather than isolated inputs and outputs. This approach aims to improve the model's ability to generalise and transfer knowledge across different tasks or variations within the same task by leveraging the contextual information present in the training data.

The key features of ICL include:

Context-rich training: Instead of providing individual input-output pairs, ICL trains models using complete examples, including their surrounding text or context. This allows the model better to understand the relationships and dependencies between different input parts.
Zero-shot learning: In ICL, a model is expected to perform well on new tasks without additional fine-tuning or adaptation. The model learns generalisable patterns from the training data, enabling it to apply them to unseen examples.
Task-agnostic pre-training: ICL often involves training models on large, diverse datasets with various tasks and topics. This task-agnostic approach helps the model learn generalised representations that can be applied to various downstream functions without explicit supervision or fine-tuning.
Knowledge distillation: In some cases, ICL can transfer knowledge from an expert model (e.g., a human) to a student model through in-context examples. The expert's performance on specific examples guides the student model to learn and improve its capabilities.
Adaptive prompting: To facilitate zero-shot learning, ICL often relies on adaptive prompting techniques that help guide the model toward generating the correct output for a given task or example without explicit supervision.
Open-ended reasoning: ICL can enable models to engage in open-ended reasoning and problem-solving by leveraging contextual information from the training data, allowing them to tackle new tasks with little or no additional guidance.

In-context learning has gained popularity recently due to its potential to improve the generalisation capabilities of large language models like GPT-3 and enable zero-shot learning across a wide range of natural language processing tasks.

IFT: Instruction fine-tuning

Instruction fine-tuning (IFT) is a technique used in natural language processing (NLP) to improve the performance of large pre-trained language models like GPT-3 on specific tasks by using task-specific instructions as guidance during fine-tuning.

The key features of IFT include:

Task-specific instructions: Instead of directly optimizing for a specific metric or objective, IFT provides the model with clear and concise instructions on performing a given task. These instructions can be natural language prompts or structured templates that guide the model's behaviour.
Fine-tuning: IFT builds upon pre-trained language models by fine-tuning them on specific tasks using a combination of task-specific instructions and labelled data. This process allows the model to adapt its parameters to perform the target task better while retaining its general language understanding capabilities.
Zero or few-shot learning: By providing clear instructions, IFT can enable models to learn new tasks with little or no additional training data, improving zero or few-shot learning capabilities.
Adaptive prompting: To facilitate task-specific learning, IFT often relies on adaptive prompting techniques that help guide the model toward generating the correct output for a given task without explicit supervision.
Robustness and generalisation: By leveraging pre-trained language models and task-specific instructions, IFT can improve the robustness and generalisation capabilities of NLP models, allowing them to perform well on various tasks and variations within those tasks.
Human alignment: IFT can help ensure that AI models follow human intentions and values by providing clear, unambiguous instructions during fine-tuning. This approach can also make interpreting and understanding the model's behaviour in different contexts easier.
Transfer learning: Since IFT builds upon pre-trained language models, it can enable knowledge transfer between related tasks or domains, further improving the model's generalisation capabilities.

KG: Knowledge Graph

A Knowledge Graph (KG) is a data structure representing facts or knowledge about entities and their relationships. It typically consists of nodes representing entities and edges representing relationships between them.

Knowledge Graphs are used in various applications such as question answering, information retrieval, and recommendation systems. They can store information about people, places, organisations, events, and other entities. The relationships between entities can be represented using various edges, such as "is a" for hierarchical relationships or "knows" for social relationships. One everyday use case for Knowledge Graphs is in search engines, where they help to understand the meaning behind user queries and return more relevant results.

LLM: Large Language Model

A Large Language Model (LLM) is an artificial intelligence model designed to generate human-like text or perform natural language processing tasks with high accuracy and fluency. LLMs are typically built using neural networks, such as Transformers, and trained on vast amounts of text data from the web, books, and other sources to learn the patterns and structures of language.

Some well-known examples of Large Language Models include GPT (Generative Pre-trained Transformer), BERT (Bidirectional Encoder Representations from Transformers), and T5 (Text-to-Text Transfer Transformer). These models have significantly advanced the field of natural language processing and opened up new possibilities for AI-powered applications in various domains.

LRU (Cache): Least Recently Used Cache

The least Recently Used Cache (LRU) is a caching strategy used in memory management systems to determine which data to store in the cache and which to discard. The basic idea behind LRU is that the data most recently accessed by the user will likely be reaccessed soon, so it should be kept in the cache for quick access.

In an LRU cache, the oldest entry is the one that has not been accessed for the longest time. When the cache needs to store a new piece of data or remove an existing one, it replaces the least recently used entry with the new data or removes the oldest entry.

This strategy helps improve performance by keeping frequently accessed data in the cache and ensuring that the most critical data is available when needed. However, it can also lead to inefficiencies if specific data is accessed repeatedly but other data with higher priority is constantly replaced.

LSTMs: Long Short-Term Memory Networks

Long Short-Term Memory Networks (LSTMs) are a type of recurrent neural network that is particularly well-suited for processing sequential data. Sepp Hochreiter and Jürgen Schmidhuber first introduced them in the late 1990s.

LSTMs consist of multiple memory cells containing a weight vector and a bias term. These cells interact with the input data in a way that allows them to remember long-term dependencies while adapting to short-term changes. This unique ability makes LSTMs particularly useful for language modelling, speech recognition, and time-series forecasting tasks.

In contrast to traditional recurrent neural networks, which suffer from vanishing gradients and exploding gradients issues, LSTMs use gated activation functions to control the flow of information between memory cells. This enables them to balance short-term and long-term memory, leading to more accurate predictions over extended periods.

LoRA: Low-rank adaptation

Low-rank adaptation is a machine learning technique that seeks to reduce the dimensionality of data while preserving its essential characteristics. It approximates the data by combining a low-rank matrix and a sparse vector.

The low-rank matrix represents the data's inherent structure, while the sparse vector captures the most essential features. By representing the data in this way, LoRA can significantly reduce the dimensionality of the data, making it easier to handle and process.

LoRA has been applied to various tasks, such as image compression, signal processing, and recommender systems. In these applications, it has shown promise in preserving the essential characteristics of the data while reducing its dimensionality, leading to improved efficiency and performance.

MoE: Mixture of Experts

Mixture of Experts (MoE) is a neural network architecture that combines multiple experts to generate predictions. Each expert is a separate neural network specialising in a specific task or domain. Combining these experts allows the overall network to make more accurate predictions across a broader range of inputs.

In an MoE network, the input data is routed to each expert through a weighted combination of paths. The weights are determined using a gating mechanism that selects the most appropriate expert for each input based on its characteristics. This allows the network to dynamically adjust its composition based on the input, ensuring that the right experts are involved in generating the final prediction.

MoE has been applied to various tasks, such as image classification, natural language processing, and speech recognition. It has shown promising results in these domains by leveraging individual experts' strengths to produce more accurate and robust predictions.

MMLU: Massive Multi-Task Language Understanding

Massive Multi-Task Language Understanding (MMLU) is a machine learning approach for natural language processing (NLP) that aims to handle many tasks simultaneously. It involves training a single model to perform multiple tasks, such as named entity recognition, sentiment analysis, and question answering, rather than training separate models for each task.

The critical advantage of MMLU is its ability to leverage the shared knowledge and features across tasks, leading to improved performance and reduced model complexity. This is achieved by jointly training a single model on multiple tasks, using techniques like multi-task learning and transfer learning.

MMLU has been applied in various NLP applications such as chatbots, language translation, and information extraction. It has shown promise in efficiently handling various tasks, demonstrating the potential of multi-task learning for NLP.

OIG: Open Instruction Generalist

Open Instruction Generalist (OIG) is an artificial intelligence (AI) model designed to be highly versatile and adaptable across different tasks and domains. It is based on "learning to learn," where the AI continuously improves its ability to understand and perform new functions by analysing past successes and failures.

OIG uses a modular architecture to learn new tasks quickly without extensive retraining. The model consists of modules responsible for a specific task or domain. When presented with a new task, OIG can dynamically combine these modules to create a customised solution.

OIG has shown promising results in various domains, such as language understanding, visual perception, and decision-making. Its versatility and ability to learn from past experiences make it a potential candidate for more advanced AI applications where generalisation across tasks and domains is critical.

ONNX: Open Neural Network Exchange

Open Neural Network Exchange (ONNX) is an open standard for representing machine learning models and algorithms. It allows developers to create and share models across different frameworks and platforms, enabling collaboration and innovation in artificial intelligence.

ONNX defines a standard format for describing neural network models, including their architecture, parameters, and operations. This standardised format makes it easier to compare, combine, and reproduce different models and port them between frameworks and hardware platforms.

AI companies and research institutions like Google, Facebook, Microsoft, and IBM have adopted ONNX. It is gaining popularity due to its ability to facilitate interoperability and collaboration in developing and deploying machine learning models.

OPT: Open Pre-trained Transformer

Open Pre-trained Transformer (OPT) is a family of pre-trained language models developed by the non-profit organisation Hugging Face. It was Meta AI which proposed them in the first place. These models are based on the transformer architecture, a state-of-the-art approach in natural language processing (NLP).

OPT models are trained on large corpora of text data, allowing them to learn the patterns and structure of human language. This pre-training enables them to perform various NLP tasks accurately, such as text classification, question answering, and summarisation.

The OPT family includes several models with different sizes and capabilities, ranging from small models suitable for mobile devices to large models capable of handling complex tasks. These models are made available under open-source licenses, enabling researchers and developers to build upon them and contribute to the advancement of AI in NLP.

PaLM: Pathway language model

The pathway language model (PaLM) is a transformer-based language model developed by Google AI. It is one of the largest and most advanced language models currently available, with over 540 billion parameters.

PaLM uses pre-training methods, including self-supervised learning and fine-tuning techniques on specific tasks. This allows it to generate high-quality text across various domains and languages. The model has been trained on a diverse corpus of text data, including web pages, books, and articles. This helps it generate more coherent and diverse responses compared to smaller models.

PaLM is designed for various natural language processing (NLP) tasks such as language translation, text summarisation, and question answering. It has achieved state-of-the-art performance on several benchmarks, including the GLUE (General Language Understanding Evaluation) and SuperGLUE datasets.

PEFT: Parameter Efficient Fine Tuning

Parameter Efficient Fine Tuning (PEFT) is a technique used to fine-tune large pre-trained language models on a specific task with fewer parameters.

PEFT finds a subset of the pre-trained model's parameters relevant to the target task. This subset is then used to fine-tune the model on the new task, resulting in significantly fewer parameters than the original pre-trained model.

PEFT has been shown to achieve competitive performance while significantly reducing the computational cost and memory requirements compared to traditional fine-tuning methods. It has been applied to various natural language processing tasks such as question-answering, sentiment analysis, and machine translation.

PTQ: post-training quantisation

Post-training quantisation (PTQ) is a technique used in machine learning to reduce the size of trained neural networks by quantising their weights and activations.

PTQ works by finding an optimal set of quantisation steps for the weights and activations of the network. This is done after the network has been trained, allowing the network to adapt to the quantised representation.

PTQ has achieved competitive performance while significantly reducing the model's size and improving its efficiency on resource-constrained devices. It has been applied to machine learning tasks such as image classification, object detection, and natural language processing.

QAT: Quantization-Aware Training

Quantization-aware training (QAT) is a technique used in machine learning to optimise the performance of neural networks on low-precision hardware.

QAT trains the network with quantised weights and activations during the learning process. This allows the network to adapt to the quantised representation and improve its accuracy on low-precision hardware.

QAT has been shown to achieve better performance than traditional quantisation methods while maintaining the same level of accuracy. It has been applied to machine-learning tasks such as image classification, object detection, and natural language processing.

QLoRA: Quantized LoRA

QA LoRA: Quantization-Aware LoRA

RAG: Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) is a method used in natural language generation to improve the quality and diversity of generated text. RAG works by retrieving relevant context from a large-scale knowledge base and incorporating it into the generation process. This context guides text generation, ensuring it is informative and coherent.

RAG has been shown to produce more diverse and high-quality text compared to traditional generative models. It has been applied to various natural language generation tasks such as text summarisation, question answering, and machine translation.

RLAIF: Reinforcement learning from AI feedback

Reinforcement Learning from AI feedback (RLAIF) is a process used in machine learning to improve the performance of neural networks by incorporating feedback from other AI systems.

RLAIF works by having a second AI system evaluate the output of the first AI system. Based on this evaluation, the first AI system adjusts its parameters to improve performance. RLAIF has been shown to perform better than traditional reinforcement learning methods, mainly when the task involves complex decision-making processes. It has been applied to machine-learning tasks such as image classification, object detection, and natural language processing.

RLHF: Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is an approach to fine-tuning large language models, which involves providing human feedback to guide the model's improvement. In RLHF, humans interact with the model and provide feedback on the quality of its responses using a rating system. This feedback adjusts the model's parameters and enhances its performance. RLHF aims to create more coherent and accurate LLMs that can better understand human language and intent, improving their ability to generate relevant and helpful responses.

RM: Reward Model

A Reward Model (RM) is a method used in the context of large language models (LLMs) to improve their performance by providing them with rewards based on their ability to generate relevant and coherent text. The goal of an RM is to guide the LLM to produce text that not only meets the given input but also satisfies specific criteria or goals set by the user, such as generating text that is grammatically correct, factually accurate, or adheres to a specific writing style. By using an RM, LLMs can be trained to generate text that better suits the needs and preferences of the user, thereby enhancing their overall effectiveness and usefulness.

RNN: Recurrent Neural Net

A Recurrent Neural Net (RNN) is an artificial neural network that processes sequential data by repeatedly cycling through the input data to make predictions based on identified patterns. In an RNN, information from previous time steps is incorporated into the processing of the current time step, enabling it to maintain a "memory" of the input sequence and better capture long-term dependencies between elements within the sequence. RNNs are particularly effective in tasks such as language modelling, machine translation, and speech recognition, and they form the basis for many state-of-the-art LLMs.

S4: Structured State Space for Sequence Modelling

Structured State Space for Sequence (S4) Modelling refers to a method of representing a sequence as a series of states, where each state represents a specific position or context within the sequence. In this context, the state space is structured in that it includes multiple levels of abstraction, such as individual tokens, words, phrases, or even entire sentences. Using this approach, sequences can be more effectively modelled and processed, allowing for better prediction and understanding of their underlying structure and meaning.

SFT: Supervised Fine-Tuning

Supervised fine tuning (SFT) is a technique used in large language models (LLMs) to adapt the model's performance to a specific task or domain. The LLM is first pre-trained on a large corpus of text data using unsupervised methods, such as auto-encoding or language modelling. Then, it is fine-tuned by using a smaller dataset of labelled examples from the target task or domain. The fine-tuning process is supervised as the labels in the training data guide the LLM to improve its performance on the target task.

This method has been shown to significantly improve the accuracy and effectiveness of LLMs for specific applications, such as chatbots, question-answering systems, or sentiment analysis. However, it requires access to large amounts of labelled data to achieve good results.

SMoE: Sparse MoE

SSM: State Space Sequence Models

State Space Sequence Models (SSM) are a class of machine learning models designed to represent and predict data sequences. These models break down a sequence into discrete states, representing individual elements, such as tokens or words, or higher-level units, like phrases or sentences. The state space for a sequence model typically includes multiple levels of abstraction, allowing for better understanding and prediction of the sequence's underlying structure and meaning.

SSMs are often used in natural language processing applications, where they are used to model text data and perform tasks like language translation, text summarisation, and sentiment analysis. They can also be applied in other domains, such as time series analysis and bioinformatics.

The main advantage of SSMs is their ability to capture the complexity and context-dependence of sequences, which can be challenging for other models to represent accurately. However, they typically require more computational resources and data than simpler sequence models like Markov chains.

SWA: Sliding Window Attention

Sliding Window Attention (SWA) is a method used to improve LLMs' performance on tasks that require processing sequences of data, such as natural language understanding.

In SWA, the LLM is trained to attend to a "sliding window" of input data, focusing on a specific portion of the input sequence at a time. This allows the model to adjust its attention dynamically to the most relevant parts of the sequence, depending on the context and the specific task. The sliding window approach provides a flexible and adaptive way for the LLM to process sequences, as it can dynamically change the size and position of the window based on the needs of the task. This makes it particularly well-suited for tasks that require the model to attend to different parts of the input sequence depending on the context or the desired output.

TICL: Textual In-Context Learning

Textual in-context learning is a machine learning approach that trains a model on a text corpus with associated contextual information. The model learns to recognise patterns and relationships between words and phrases in the text data and their contextual relevance. This enables the model to understand the meaning of words or phrases in a sentence more accurately, even if it has not seen those exact words or phrases before.

In other words, textual in-context learning allows the model to better understand a sentence's context and meaning rather than rely on individual word frequencies or pre-trained word embeddings. This makes it more versatile and capable of understanding complex language structures and nuances.

ToT: Tree of Thought (prompting)

Tree of Thought (ToT) prompting is a technique used in language models to generate more coherent and relevant responses to the input. It involves providing the model with additional context or hints about the desired output, which helps it better understand the user's intent and generate more appropriate responses.

For example, if the user asks a question like "How can I improve my communication skills?" a typical language model might generate generic answers like "Practice speaking more often" or "Read books on communication". However, the model can generate more specific and relevant responses by providing additional context or hints.

To do this, the user can include additional information in their question or prompt, such as "What are some specific strategies I can use to improve my communication skills?" or "I am looking for practical tips that I can apply right away". This additional context helps the model understand the user's intent and generate more targeted responses.

ToT prompting is a valuable technique for improving the performance of language models in various applications, such as chatbots, virtual assistants, and search engines. It allows the model to generate more appropriate and helpful responses, enhancing the user experience.