Automatic similarity detection and clustering of data


An algorithm was created which identifies the number of unique clusters in a dataset and assigns the data to the clusters. A cluster is defined as a group of data which share similar characteristics. Similarity is measured using the dot product between two vectors where the data are input as vectors. Unlike other clustering algorithms such as K-means, no knowledge of the number of clusters is required. This allows for an unbiased analysis of the data. The automatic cluster detection algorithm (ACD), is executed in two phases: an averaging phase and a clustering phase. In the averaging phase, the number of unique clusters is detected. In the clustering phase, data are matched to the cluster to which they are most similar. The ACD algorithm takes a matrix of vectors as an input and outputs a 2D array of the clustered data. The indices of the output correspond to a cluster, and the elements in each cluster correspond to the position of the datum in the dataset. Clusters are vectors in N-dimensional space, where N is the length of the input vectors which make up the matrix. The algorithm is distributed, increasing computational efficiency

Cyber Sensing 2017