Document Extraction Using Llama-Parse and Llama-Index

Blogs

Latency Optimization for LLM-Related Use Cases
November 7, 2024
Fine-Tuning Large Language Models: Enhancing Performance and Specialization
November 12, 2024

Document Extraction Using Llama-Parse and Llama-Index

In this blog, we will walk through a practical example of document extraction using Llama-Parse, a tool built for parsing different document types, and Llama-Index, a framework for indexing and querying those documents. Specifically, we’ll use these tools to extract and query data from a PDF file.

Whether you’re working with structured or unstructured text, Llama-Parse helps extract content and convert it into usable formats, while Llama-Index enables you to build an efficient index for querying the extracted content.

What is Llama-Parse?

Llama-Parse is a Python library designed to extract text from various document formats. It supports multiple file types such as PDF, DOCX, TXT, and more. The parsed text can be returned in different formats, such as plain text or markdown, depending on your needs. This makes it very flexible and suitable for a wide range of document processing tasks.

What is Llama-Index?

Llama-Index (formerly known as GPT Index) is a data framework that helps in indexing and querying large documents. It allows you to load documents, process them, and then use advanced querying methods to retrieve information efficiently. This framework is commonly used in projects where users need to build a search engine or FAQ system from unstructured text data.

Supported File Formats

When using Llama-Parse and Llama-Index, you can extract text from a variety of file formats. Some of the most common formats include:

  • PDF: Extracts text from PDF documents.
  • TXT: Handles plain text files with simple text extraction.
  • DOCX: Handles Word documents for structured text extraction.
  • CSV: Parses tabular data from CSV files.

In this example, we will focus on extracting content from a PDF file.

Use Case: Extract and Query Content from a PDF Document

Let’s break down the steps needed to extract and query text from a PDF document using Llama-Parse and Llama-Index.

Step 1: Install the Required Libraries

Before we begin, ensure that the necessary Python libraries are installed. If you haven’t done so already, you can install the required packages with pip:

Python code:

pip install llama-index llama-parse python-dotenv

Step 2: Set Up Your Environment

You’ll need to load environment variables, which are often used to store configuration details like API keys or other settings. In this example, we’ll use the python-dotenv library to load these variables from a .env file.

LLAMA_CLOUD_API_KEY= API_KEY

Python code:

from dotenv import load_dotenv
load_dotenv()

Step 3: Initialize the Llama-Parse Parser

Now, we initialize the LlamaParse object, specifying the desired output format. In this case, we’ll extract content as markdown (you can also choose plain text depending on your needs).

Python code:

from llama_parse import LlamaParse
parser = LlamaParse(result_type="markdown")# "markdown" or "text"

Step 4: Load the PDF Document

We use the SimpleDirectoryReader from Llama-Index to load documents from a specified directory or file. We create a dictionary file_extractor that maps file extensions to the corresponding parser (in our case, .pdf files will be parsed using the LlamaParse parser).

Python code:

from llama_index.core import SimpleDirectoryReader
file_extractor = {".pdf": parser}  # map .pdf files to LlamaParse
documents = SimpleDirectoryReader(input_files=['/path/to/your/pdf_file.pdf'], file_extractor=file_extractor).load_data()

This will read the PDF file and parse its content using the LlamaParse parser.

Step 5: Create an Index from the Extracted Documents

Next, we use Llama-Index to create an index from the documents we just extracted. The VectorStoreIndex is a suitable choice for this task as it converts the parsed documents into a vectorized format that can be efficiently queried.

Python code:

from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)

Step 6: Query the Extracted Document

Now that we have indexed the content of the document, we can use the query_engine to search for specific content. For example, we might want to extract all the contents from the PDF file.

Python code:

query_engine = index.as_query_engine()
query = "get the contents from the pdf file"
response = query_engine.query(query)
print(response)

The query_engine.query(query) function performs the query on the indexed content and returns the most relevant results.

Step 7: Result

When you run the code, you will get a response that contains the extracted content from the PDF file, based on the query. The results might include the entire text content or specific sections, depending on how you structure the query.

What Can Be Extracted?

When using Llama-Parse, various types of data can be extracted from documents, such as:

  • Text: The primary content of the document (body text).
  • Tables: Data represented in table formats (from CSV, DOCX, or PDFs).
  • Metadata: Information such as author, creation date, and modification date (available for formats like PDF).
  • Images: While Llama-Parse doesn’t directly extract images, some advanced parsing techniques can be used to retrieve images or their metadata from PDF files.

Advanced Use Cases

  • Multiple File Formats: You can extend the file_extractor dictionary to include other formats like .txt, .docx, or .csv for broader document handling.
  • Custom Querying: Once indexed, the query engine can be customized to handle more complex queries, including semantic searches or filtering based on metadata.

Conclusion

Llama-Parse and Llama-Index together provide a powerful solution for document extraction and querying. By following the steps outlined above, you can easily parse a variety of document formats, index the extracted content, and query it efficiently. Whether you’re building a search engine, knowledge base, or document analysis tool, this framework is a great choice for handling unstructured data from different file types.

 


Neha Vittal Annam

Leave a Reply

Your email address will not be published. Required fields are marked *