Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content[citation needed] as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.[1][2] The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations.[3] For example, "car" is similar to "bus", but is also related to "road" and "driving".
Computationally, semantic similarity can be estimated by defining a topological similarity, by using ontologies to define the distance between terms/concepts. For example, a naive metric for the comparison of concepts ordered in a partially ordered set and represented as nodes of a directed acyclic graph (e.g., a taxonomy), would be the shortest-path linking the two concept nodes. Based on text analyses, semantic relatedness between units of language (e.g., words, sentences) can also be estimated using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus. The evaluation of the proposed semantic similarity / relatedness measures are evaluated through two main ways. The former is based on the use of datasets designed by experts and composed of word pairs with semantic similarity / relatedness degree estimation. The second way is based on the integration of the measures inside specific applications such as information retrieval, recommender systems, natural language processing, etc.