Semantic similarity

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content^{[citation needed]} as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.^[1]^[2] The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations.^[3] For example, "car" is similar to "bus", but is also related to "road" and "driving".

Computationally, semantic similarity can be estimated by defining a topological similarity, by using ontologies to define the distance between terms/concepts. For example, a naive metric for the comparison of concepts ordered in a partially ordered set and represented as nodes of a directed acyclic graph (e.g., a taxonomy), would be the shortest-path linking the two concept nodes. Based on text analyses, semantic relatedness between units of language (e.g., words, sentences) can also be estimated using statistical means such as a vector space model to correlate words and textual contexts from a suitable text corpus. The evaluation of the proposed semantic similarity / relatedness measures are evaluated through two main ways. The former is based on the use of datasets designed by experts and composed of word pairs with semantic similarity / relatedness degree estimation. The second way is based on the integration of the measures inside specific applications such as information retrieval, recommender systems, natural language processing, etc.

^ Harispe S.; Ranwez S.; Janaqi S.; Montmain J. (2015). "Semantic Similarity from Natural Language and Ontology Analysis". Synthesis Lectures on Human Language Technologies. 8 (1): 1–254. arXiv:1704.05295. doi:10.2200/S00639ED1V01Y201504HLT027. S2CID 17428739.
^ Feng Y.; Bagheri E.; Ensan F.; Jovanovic J. (2017). "The state of the art in semantic relatedness: a framework for comparison". Knowledge Engineering Review. 32: 1–30. doi:10.1017/S0269888917000029. S2CID 52172371.
^ A. Ballatore; M. Bertolotto; D.C. Wilson (2014). "An evaluative baseline for geo-semantic relatedness and similarity". GeoInformatica. 18 (4): 747–767. arXiv:1402.3371. Bibcode:2014GInfo..18..747B. doi:10.1007/s10707-013-0197-8. S2CID 17474023.

[harispe2015-1] Harispe S.; Ranwez S.; Janaqi S.; Montmain J. (2015). "Semantic Similarity from Natural Language and Ontology Analysis". Synthesis Lectures on Human Language Technologies. 8 (1): 1–254. arXiv:1704.05295. doi:10.2200/S00639ED1V01Y201504HLT027. S2CID 17428739.

[Feng2017-2] Feng Y.; Bagheri E.; Ensan F.; Jovanovic J. (2017). "The state of the art in semantic relatedness: a framework for comparison". Knowledge Engineering Review. 32: 1–30. doi:10.1017/S0269888917000029. S2CID 52172371.

[3] A. Ballatore; M. Bertolotto; D.C. Wilson (2014). "An evaluative baseline for geo-semantic relatedness and similarity". GeoInformatica. 18 (4): 747–767. arXiv:1402.3371. Bibcode:2014GInfo..18..747B. doi:10.1007/s10707-013-0197-8. S2CID 17474023.

[1]

[2]

[3]