Exploring Text Similarity with Spacy in NLP

Blogs

Behavior of Session Variable and Updatable Views in MySQL 5.7
May 24, 2023
Demystifying Query Store Errors: Troubleshooting and Solutions
June 24, 2023

Exploring Text Similarity with Spacy in NLP

Introduction

In the field of Natural Language Processing (NLP), text similarity plays a crucial role in various applications such as information retrieval, recommendation systems, and document clustering. Spacy, a popular open-source NLP library, provides powerful tools for measuring text similarity. In this blog post, we will dive into Spacy’s similarity functionality and explore how it can be used to solve real-world problems.

  1. Understanding Spacy: Spacy is a Python library designed to manage various NLP tasks efficiently. It offers pre-trained models and pipelines for tasks like part-of-speech tagging, named entity recognition, and dependency parsing. One of its standout features is the ability to compute document and token similarities.
  2. Computing Token Similarity: Token similarity refers to measuring the similarity between individual words or tokens in a text. Spacy’s similarity functionality allows us to compare tokens based on their linguistic characteristics. By utilizing pre-trained word vectors, Spacy can calculate the similarity score between two tokens using cosine similarity or other distance metrics.
  3. Document Similarity: Document similarity involves measuring the similarity between entire documents. Spacy provides a straightforward approach to computing document similarity by leveraging the token similarity functionality. We can compare documents based on the similarity scores of their constituent tokens and obtain a similarity score for the entire document.
  4. Practical Use Cases
  1. Information Retrieval: Text similarity can be used to build efficient search systems. By comparing the similarity between user queries and indexed documents, we can retrieve the most relevant information.
  2. Text Clustering: Document similarity can aid in clustering similar documents together. By measuring the similarity between documents, we can group them into meaningful clusters, enabling efficient organization and analysis.

5.Word Vectors:  Word Vectors are numerical representations of words in multidimensional space through matrices these are sometimes also called as word embeddings.            The purpose of the word vector is to get a computer system to understand a word. We can do similarity matches very quickly and very reliably. To understand this, we                    will see the following examples.

Before the start, we install spaCy library and download the pre trained language model using the “en_core_web_md” model

Pip install spacy
!python -m spacy download en_core_web_md

The below code shows how to find the words most closely related to the word “dog”:

import numpy as np
your_word="dog"
ms=nlp.vocab.vectors.most_similar(
np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]),n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]] distances = ms[2] print(words)

Output:

[‘puppies’, ‘PUPPIES’, ‘CHINCHILLA’, ‘BREED’, ‘Breed’, ‘cattery’, ‘Poodles’, ‘CHINCHILLAS’, ‘POODLES’, ‘BREEDS’]

For the above code if “en_core_web_lg” is used then we will get more accurate result.
Let’s consider the example for “lg” model.To proceed with this we have to download “en_core_web_lg” model.

!python -m spacy download en_core_web_lg
import numpy as np
your_word="dog"
ms=nlp.vocab.vectors.most_similar(
np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]),n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]] distances = ms[2] print(words)

Output:

[‘doG’, ‘DOGS’, ‘PUPPY’, ‘PET’, ‘CAT’, ‘PUPPIES’, ‘CANINE’, ‘PuP’, ‘CATs’, ‘TERRIER’]

To demonstrate Spacy’s similarity functionality, let’s consider a simple code example. We will use the pre-trained word vectors provided by Spacy to calculate the similarity between two sentences.

After installing spaCy and downloading the language model we proceed to load the model. Now, we find the similarity score.

import spacy
nlp=spacy.load("en_core_web_md")
sentence1 = "I love pizza"
sentence2 = "I adore hamburgers"
doc1 = nlp(sentence1)
doc2 = nlp(sentence2)
similarity_score =doc1.similarity(doc2)
print(f"Similarity score: {similarity_score}") 

In the above code, we load the pre-trained word vectors from the “en_core_web_md” model and process the two input sentences.

We then calculate the similarity score between the two sentences using the similarity() method. The similarity score will be a value between 0 and 1, where 1 indicates high similarity and 0 indicates low similarity.

We get the output like this:

Output:
I like salty fries and hambrugers. <-> Fast food tastes very good. 0.7720511834712694

Here, we are getting the similarity score as 0.7720511834712694 which is a better score one can also use “en_core_web_lg” model to get more accurate result.

For the same code we get the below result for “lg” model:

Output:
I like salty fries and hambrugers. <-> Fast food tastes very good. 0.7971436757973674

By checking another example

doc3 = nlp("I enjoy Apples.")
doc4 = nlp("I enjoy Apples.")
print(doc3,"<->",doc4,doc3.similarity(doc4))

In the above code, we check the similarity between the two documents by giving the same input for both the documents doc3 and doc4.

Output:
I enjoy Apples. <-> I enjoy Apples. 1.0

Here we get the high similarity score that is 1.0 . By this result we can conclude that both the documents are same.

We can also find the similarity between the words.

french_fries = doc1[2:4] burgers = doc1[5] print(french_fries,"<->",burgers,french_fries.similarity(burgers))

Output:
salty fries <-> hambrugers 0.0

Conclusion

Spacy provides a powerful and intuitive way to measure text similarity in NLP applications. Whether you need to compare individual tokens or entire documents, Spacy’s similarity functionality can be a valuable tool in your NLP toolkit. By leveraging text similarity, you can unlock a wide range of applications, including information retrieval, recommendation systems, and text clustering.

 

 

 

 

 

 


neha.annam

Leave a Reply

Your email address will not be published. Required fields are marked *