What is artificial intelligence (AI) clustering? How it identifies patterns
Check out all the on-demand sessions from the Intelligent Security Summit here.
AI clustering is the machine learning (ML) process of organizing data into subgroups with similar characteristics or elements. Clustering algorithms tend to work well in environments where the answer doesn’t have to be perfect, it just has to be similar or close to be an acceptable match. AI clustering can be particularly effective in identifying patterns in unsupervised learning. Some common applications are in human resources, data analytics, recommender systems, and social science.
Data scientists, statisticians, and AI scientists use clustering algorithms to search for answers that are close to other answers. They first use a training data set to define the problem and then search for potential solutions that are similar to those generated with the training data.
One challenge is defining “closeness” because the desired answer is usually generated with the training data. When the data has multiple dimensions, data scientists can also guide the algorithm by assigning weights to the different data columns in the equation used to define proximity. It is not uncommon to work with several different functions that define proximity.
When defining the proximity function, also called the similarity metric or distance measure, much of the work is storing the data in a way that it can be searched quickly. Some database designers create special layers to simplify that search. An important part of many algorithms is the distance metric which defines how far apart two data points may be.
Intelligent security summit on demand
Learn the critical role of AI and ML in cybersecurity and industry-specific case studies. Watch on-demand sessions today.
Another approach involves turning the problem on its head and deliberately looking for the worst possible match. It is suitable for problems such as anomaly detection in security applications, where the goal is to identify data elements that do not fit in with the others.
What are some examples of clustering algorithms?
Scientists and mathematicians have created different algorithms to detect different types of clusters. Choosing the right solution for a specific problem is a common challenge.
The algorithms are not always definitive. Scientists can use methods that fall into only one classification, or they can use hybrid algorithms that use techniques from several categories.
Categories of clustering algorithms include the following:
- Bottom to top: These algorithms, also known as agglomerative or hierarchical, start by matching each data element with its nearest neighbor. Then the pairs, themselves, are mated. The clusters grow and the algorithm continues until a threshold for the number of clusters or the distance between them is reached.
- Divisive: These algorithms are like the upside down or agglomerative, but they start with all points in one cluster and then they look for a way to split them into two smaller clusters. This often means looking for a plane or other feature that will cleanly divide the group into separate parts.
- K-means: This popular approach looks for k different clusters by first randomly assigning the points to k different groups. The mean of each cluster is calculated and then each point is examined to see if it is closest to the mean of its cluster. If not, it is moved to another. The averages are recalculated and the results converge after several iterations.
- K-medoids: This is similar to the k-means, but the center is calculated using a median algorithm.
- Dizzy: Each point can be a member of multiple clusters computed with any type of algorithm. This can be useful when some points are equidistant from each center.
- Grill: The algorithms start with a grid defined in advance by the scientists to cut the data space into parts. Points are awarded to clusters based on which grid block they fit.
- Golf: The points are first compressed or transformed with a function called a wavelet. The clustering algorithm is then applied using the compressed or transformed version of the data, not the original one.
Note: Many database companies often use the word “cluster” in a different way. The word can also be used to describe a group of machines that work together to store data and answer queries. In that context, the clustering algorithms make decisions about which machines will handle the workload. To make matters more confusing, these data systems will sometimes also apply AI clustering algorithms to classify data elements.
How are clustering algorithms used in specific applications?
Clustering algorithms are deployed as part of a wide variety of technologies. Data scientists rely on algorithms to help with classification and sorting.
For example, a large number of applications to work with people can be more successful with better clustering algorithms. Schools may want to place students in class divisions based on their talents and abilities. Clustering algorithms will group students with similar interests and needs together.
Some businesses want to separate their potential customers into different categories so that they can give the customers more appropriate service. Neophyte buyers can be offered extensive assistance so they can understand the products and the options. Experienced customers can be taken to the offers immediately, and perhaps get special prices that have worked for similar buyers.
There are many other examples from a diverse range of industries, such as manufacturing, banking and shipping. All rely on the algorithms to separate the workload into smaller subgroups that can receive similar treatment. All these options depend heavily on data collection.
How do distance metrics define the clustering algorithms? If a group is defined by the distances between data elements, measuring the distance is an essential part of the process. Many algorithms rely on standard ways to calculate the distance, but some rely on different formulas with different advantages.
Many find the idea of a “distance” itself confusing. We use the term so often to measure how far we have to travel in a room or around the world that it can feel strange to think of two data points—such as describing a user’s preferences for ice cream or paint color—as separated by any distance. But the word is a natural way of describing a number that measures how close the elements can be to each other.
Scientists and mathematicians generally rely on formulas that satisfy what they call the “triangle inequality.” That is, the distance between points A and B plus the distance between B and C is greater than or equal to the distance between A and C. When the formula guarantees this, the process gains more consistency. Some also rely on stricter definitions such as “ultrametry” that offer more complex guarantees. Strictly speaking, the clustering algorithms don’t need to insist on this rule, because any formula that returns a number can do it, but the results are generally better.
How do big companies approach AI clustering?
The statistics, data science and AI services offered by leading technology providers include many of the most common clustering algorithms. The algorithms are implemented in the languages that form the foundation of many of these platforms, which is often Python. Vendors include:
- SageMaker: Amazon’s turnkey solution for building AI models supports a number of approaches, such as K-means clustering. It can be tested in notebooks and deployed after the software builds the model.
- Google includes a variety of clustering algorithms that can be deployed, including density-based, centroid-based, and hierarchical algorithms. Their collaboration provides a good opportunity to explore the potential before deploying an algorithm.
- Microsoft’s Azure tools, like its Machine Learning designer, presents all the major clustering algorithms in a form open to experimentation. Its systems aim to handle many of the configuration details for building a pipeline that transforms data into models.
- Oracle also offers clustering technology in all its AI and data science applications. It has also built algorithms into its flagship database so that the clusters can be built within the datastore without running it.
How do challengers and startups deal with AI clustering?
Established data specialists and a range of startups are challenging the big vendors by offering clustering algorithms as part of broader data analytics packages and AI tools.
Teradata, Snowflake and Databricks are leading niche companies focused on helping enterprises manage the often relentless flow of data by building data lakes or data warehouses. Their machine learning tools support some of the standard clustering algorithms so data analysts can start classification work as soon as the data enters the system.
Startups like Chinese firm Zilliz, with its Milvus open-source vector database, and Pinecone, with its SaaS vector database, are gaining traction as efficient ways to search for matches that can be very useful in clustering applications.
Some also bundle algorithms with tools focused on specific vertical segments. They preset the models and algorithms to work well with the types of problems common in that segment. Zest.ai and Affirm are two examples of startups building models to guide loans. They don’t sell algorithms directly, but rely on algorithms’ decisions to guide their product.
A number of companies use clustering algorithms to segment their customers and provide more direct and personalized solutions. You.com is a search engine company that relies on custom algorithms to provide users with personalized recommendations and search results. Observe AI aims to improve call centers by helping companies recognize the opportunities to offer more personalized options.
Is there anything AI clustering can’t do?
As with all AI, the success of clustering algorithms often depends on the quality and suitability of the data used. If the numbers yield tight clusters with large gaps in between, the clustering algorithm will find them and use them to classify new data with relative success.
The problems occur when there are not tight clusters, or the data elements end up in some gap where they are relatively equidistant between clusters. The solutions are often unsatisfactory because there is no easy way to choose one group over another. One might be slightly closer according to the distance metric, but that might not be the answer people want.
In many cases, the algorithms are not smart enough or flexible enough to accept a partial answer or one that chooses multiple classifications. Although there are many real-world examples of people or things that cannot be easily classified, computer algorithms often have one field that can only accept one answer.
However, the biggest problems arise when the data is too scattered and there are no clearly defined groups. The algorithms may still run and generate results, but the answers will appear random and the findings will not be coherent.
Sometimes it is possible to improve the clusters or make them more distinct by adjusting the distance metric. Adding different weights for some fields or using a different formula can emphasize some parts of the data enough to make the clusters more clearly defined. But if these distinctions are artificial, the users may not be satisfied with the results.
VentureBeat’s mission is to be a digital town square for technical decision makers to learn and transact about transformative enterprise technology. Discover our assignments.