Introduction
In the field of Natural Language Processing (NLP), text similarity plays a crucial role in various applications such as information retrieval, recommendation systems, and document clustering. Spacy, a popular open-source NLP library, provides powerful tools for measuring text similarity. In this blog post, we will dive into Spacy’s similarity functionality and explore how it can be used to solve real-world problems.
5.Word Vectors: Word Vectors are numerical representations of words in multidimensional space through matrices these are sometimes also called as word embeddings. The purpose of the word vector is to get a computer system to understand a word. We can do similarity matches very quickly and very reliably. To understand this, we will see the following examples.
Before the start, we install spaCy library and download the pre trained language model using the “en_core_web_md” model
Pip install spacy
!python -m spacy download en_core_web_md
The below code shows how to find the words most closely related to the word “dog”:
import numpy as np
your_word="dog"
ms=nlp.vocab.vectors.most_similar(
np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]),n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)
Output:
[‘puppies’, ‘PUPPIES’, ‘CHINCHILLA’, ‘BREED’, ‘Breed’, ‘cattery’, ‘Poodles’, ‘CHINCHILLAS’, ‘POODLES’, ‘BREEDS’]
For the above code if “en_core_web_lg” is used then we will get more accurate result.
Let’s consider the example for “lg” model.To proceed with this we have to download “en_core_web_lg” model.
!python -m spacy download en_core_web_lg
import numpy as np
your_word="dog"
ms=nlp.vocab.vectors.most_similar(
np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]),n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)
Output:
To demonstrate Spacy’s similarity functionality, let’s consider a simple code example. We will use the pre-trained word vectors provided by Spacy to calculate the similarity between two sentences.
After installing spaCy and downloading the language model we proceed to load the model. Now, we find the similarity score.
import spacy
nlp=spacy.load("en_core_web_md")
sentence1 = "I love pizza"
sentence2 = "I adore hamburgers"
doc1 = nlp(sentence1)
doc2 = nlp(sentence2)
similarity_score =doc1.similarity(doc2)
print(f"Similarity score: {similarity_score}")
In the above code, we load the pre-trained word vectors from the “en_core_web_md” model and process the two input sentences.
We then calculate the similarity score between the two sentences using the similarity() method. The similarity score will be a value between 0 and 1, where 1 indicates high similarity and 0 indicates low similarity.
We get the output like this:
Here, we are getting the similarity score as 0.7720511834712694 which is a better score one can also use “en_core_web_lg” model to get more accurate result.
For the same code we get the below result for “lg” model:
By checking another example
doc3 = nlp("I enjoy Apples.")
doc4 = nlp("I enjoy Apples.")
print(doc3,"<->",doc4,doc3.similarity(doc4))
In the above code, we check the similarity between the two documents by giving the same input for both the documents doc3 and doc4.
Here we get the high similarity score that is 1.0 . By this result we can conclude that both the documents are same.
We can also find the similarity between the words.
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries,"<->",burgers,french_fries.similarity(burgers))
Conclusion
Spacy provides a powerful and intuitive way to measure text similarity in NLP applications. Whether you need to compare individual tokens or entire documents, Spacy’s similarity functionality can be a valuable tool in your NLP toolkit. By leveraging text similarity, you can unlock a wide range of applications, including information retrieval, recommendation systems, and text clustering.
neha.annam