What is an embedding for AI?

When a question is presented to an artificial intelligence (AI) algorithm, it must be converted into a format that the algorithm can understand. This is often called “embedding a problem,” to use the verb form of the word. Scientists also use the word as a noun and talk about an “embedding.”

In most cases, the embeddings are collections of numbers. They are often arranged in a vector to simplify their representation. Sometimes they’re presented as a square or rectangular matrix to enable some mathematical work.

Embeddings are constructed from raw data that may be numerical audio, video or textual information. Pretty much any data from an experiment or a sensor can be converted into an embedding in some form.

In some cases, it’s an obvious process. Numbers like temperatures or times can be copied pretty much verbatim. They may also be rounded off, converted into a different set of units (say to Celsius from Fahrenheit), normalized or cleaned of simple errors.

In other cases, it’s a mixture of art and knowledge. The algorithms take the raw information and look for salient features and patterns that might help answer the question at hand for the AI. For instance, an autonomous car may look for octagonal patterns to identify stop signs. Similarly, a text algorithm may look for words that generally have an angry connotation so it can gauge the sentiment of a statement.

What is the structure of an AI embedding?

The embedding algorithm transforms these raw files into simpler collections of numbers. This numerical format for the problem is usually a deliberate simplification of the different elements from the problem. It’s designed so that the details can be described with a much smaller set of numbers. Some scientists say that the embedding process goes from an information-sparse raw format into an information-dense format of the embedding.

This shorter vector shouldn’t be confused with the larger raw data files, which are all ultimately just collections of numbers. All data is numerical in some form because computers are filled with logic gates that can only make decisions based on the numeric.

The embeddings are often a few important numbers — a succinct encapsulation of the important components in the data. An analysis of a sports problem, for example, may reduce each entry for a player to height, weight, sprinting speed and vertical leap. A study of food may reduce each potential menu item to its composition of protein, fats and carbohydrates.

The decision of what to include and leave out in an embedding is both an art and a science. In many cases, this structure is a way for humans to add their knowledge of the problem area and leave out extraneous information while guiding the AI to the heart of the matter. For example, an embedding can be structured so that a study of athletes could exclude the color of their eyes or the number of tattoos.

In some cases, scientists deliberately begin with as much information as possible and then let the algorithm search out the most salient details. Sometimes the human guidance ends up excluding useful details without recognizing the implicit bias that doing so causes.

How are embeddings biased?

Artificial intelligence algorithms are only as good as their embeddings in their training set and their embeddings are only as good as the data inside them. If there is bias in the raw data collected, the embeddings built from them will — at the very least — reflect that bias.

For example, if a dataset is collected from one town, it will only contain information about the people in that town and carry with it all the idiosyncrasies of the population. If the embeddings built from this data are used on this town alone, the biases will fit the people. But if the data is used to fit a model used for many other towns, the biases may be wildly different.

Sometimes biases can creep into the model through the process of creating an embedding. The algorithms reduce the amount of information and simplify it. If this eliminates some crucial element, the bias will grow.

There are some algorithms designed to reduce known biases. For example, adataset may be gathered imperfectly and may overrepresent, say, the number of women or men in the general population. Perhaps only some responded to a request for information or perhaps the data was only gathered in a biased location. The embedded version can randomly exclude some of the overrepresented set to restore some balance overall.

Is there anything that can be done about bias?

In addition to this, there are some algorithms designed to add balance to a dataset. These algorithms use statistical techniques and AI to identify ways that there are dangerous or biased correlations in the dataset. The algorithms can then either delete or rescale the data and remove some bias.

A skilled scientist can also design the embeddings to target the best answer. The humans creating the embedding algorithms can pick and choose approaches that can minimize the potential for bias. They can either leave off some data elements or minimize their effects.

Still, there are limits to what they can do about imperfect datasets. In some cases, the bias is a dominant signal in the data stream.

What are the most common structures for embeddings?

Embeddings are designed to be information-dense representations of the dataset being studied. The most common format is a vector of floating-point numbers. The values are scaled, sometimes logarithmically, so that each element of the vector has a similar range of values. Some choose values between zero and one.

One goal is to ensure that the distances between the vectors represents the difference between the underlying elements. This can require some artful decision-making. Some data elements may be pruned. Others may be scaled or combined.

While there are some data elements like temperatures or weights that are naturally floating-point numbers on an absolute scale, many data elements don’t fit this directly. Some parameters are boolean values, for example, if a person owns a car. Others are drawn from a set of standard values, say, the model, make and model year of a car.

A real challenge is converting unstructured text into embedded vectors. One common algorithm is to search for the presence or absence of uncommon words. That is, words that aren’t basic verbs, pronouns or other glue words used in every sentence. Some of the more complex algorithms include Word2vec, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA) and – Biterm Topic Model (BTM).

Are there standards for embeddings?

As AI has grown more common and popular, scientists have created and shared some standard embedding algorithms. These versions, often protected by open-source licenses, are often developed by university researchers who share them to increase knowledge.

Other algorithms come directly from companies. They’re effectively selling not just their AI learning algorithms, but also the embedding algorithms for pre-processing the data.

Some better known standards are:

Object2vec – From Amazon’s SageMaker. This algorithm finds the most salient parts of any data object and keeps them. It’s designed to be highly customizable, so the scientist can focus on the important data fields.
Word2vec – Google created Word2vec by analyzing the language and finding an algorithm that converts words into vector embeddings by analyzing the context and creating embeddings that capture the semantic and syntactic patterns. It is trained so that words with similar meanings will end up with similar vector embeddings.
GloVe – Stanford researchers built this algorithm that tries by analyzing data about word usage around the world. The name is short for Global Vectors.
Inception – This model uses a convolutional neural network to analyze images directly and then produce embeddings based upon the content. Its principle authors came from Google and several major universities.

How are the market leaders creating embeddings for their AI algorithms?

All of the major computing companies have strong investments in artificial intelligence and also the tools needed to support the algorithms. Pre-processing any data and creating customized embeddings is a key step.

Amazon’s SageMaker, for instance, offers a powerful routine, Object2Vec, that converts data files into embeddings in a customizable way. The algorithm also learns as it progresses, adapting itself to the dataset in order to produce a consistent set of embedding vectors. They also support several algorithms focused on unstructured data like BlazingText for extracting useful embedding vectors from large text files.

Google’s TensorFlow project supports a Universal Sentence Encoder to provide a standard mechanism for converting text into embeddings. Their image models are also pre-trained to handle some standard objects and features found in images. Some use these as a foundation for custom training on their particular sets of objects in their image set.

Microsoft’s AI research team offers broad support for a number of universal embeddings models for text. Their Multitask, Deep Neural Network model, for example, aims to create strong models that are consistent even when working with language used in different domains. Their DeBERT model uses more than 1.5 billion parameters to capture many of the intricacies of natural language. Earlier versions are also integrated with the AutomatedML tool for easier use.

IBM supports a variety of embedding algorithms, including many of the standards. Their Quantum Embedding algorithm was inspired by portions of the theory used to describe subatomic particles. It is designed to preserve logical concepts and structure during the process. Their MAX-Word approach uses the Swivel algorithm to preprocess text as part of the training for their Watson project.

How are startups targeting AI embeddings?

The startups tend to focus on narrow areas of the process so they can make a difference. Some work on optimizing the embedding algorithm themselves and others focus on particular domains or applied areas.

One area of great interest is building good search engines and databases for storing embeddings so it’s easy to find the closest matches. Companies like Pinecone.io, Milvus, Zilliz and Elastic are creating search engines that specialize in vector search so they can be applied to the vectors produced by embedding algorithms. They also simplify the embedding process, often using common open-source libraries and embedding algorithms for natural language processing.

Intent AI wants to unlock the power of network connections discovered in first-party marketing data. Their embedding algorithms help marketers apply AI to optimize the process of matching buyers to sellers.

H20.ai builds an automated tool for helping businesses apply AI to their products. The tool contains a model creation pipeline with prebuilt embedding algorithms as a start. Scientists can also buy and sell model features used in embedding creation through their feature store.

The Rosette platform from Basis Technology offers a pre-trained statistical model for identifying and tagging entities in natural language. It integrates this model with an indexer and translation software to provide a pan-language solution.

Is there anything that cannot be embedded?

The process of converting data into the numerical inputs for an AI algorithm is generally reductive. That is, it reduces the amount of complexity and detail. When this destroys some of the necessary value in the data, the entire training process can fail or at least fail to capture all the rich variations.

In some cases, the embedding process may carry all the bias with it. The classic example of AI training failure is when the algorithm is asked to make a distinction between photos of two different types of objects. If one set of photos is taken on a sunny day and the other is taken on a cloudy day, the subtle differences in shading and coloration may be picked up by the AI training algorithm. If the embedding process passes along these differences, the entire experiment will produce an AI model that’s learned to focus on the lighting instead of the object.

There will also be some truly complex datasets that can’t be reduced to a simpler, more manageable form. In these cases, different algorithms that don’t use embeddings should be deployed.