In this blog, we will walk through a practical example of document extraction using Llama-Parse, a tool built for parsing different document types, and Llama-Index, a framework for indexing and querying those documents. Specifically, we’ll use these tools to extract and query data from a PDF file.
Whether you’re working with structured or unstructured text, Llama-Parse helps extract content and convert it into usable formats, while Llama-Index enables you to build an efficient index for querying the extracted content.
What is Llama-Parse?
Llama-Parse is a Python library designed to extract text from various document formats. It supports multiple file types such as PDF, DOCX, TXT, and more. The parsed text can be returned in different formats, such as plain text or markdown, depending on your needs. This makes it very flexible and suitable for a wide range of document processing tasks.
What is Llama-Index?
Llama-Index (formerly known as GPT Index) is a data framework that helps in indexing and querying large documents. It allows you to load documents, process them, and then use advanced querying methods to retrieve information efficiently. This framework is commonly used in projects where users need to build a search engine or FAQ system from unstructured text data.
Supported File Formats
When using Llama-Parse and Llama-Index, you can extract text from a variety of file formats. Some of the most common formats include:
In this example, we will focus on extracting content from a PDF file.
Use Case: Extract and Query Content from a PDF Document
Let’s break down the steps needed to extract and query text from a PDF document using Llama-Parse and Llama-Index.
Step 1: Install the Required Libraries
Before we begin, ensure that the necessary Python libraries are installed. If you haven’t done so already, you can install the required packages with pip:
Python code:
pip install llama-index llama-parse python-dotenv
Step 2: Set Up Your Environment
You’ll need to load environment variables, which are often used to store configuration details like API keys or other settings. In this example, we’ll use the python-dotenv library to load these variables from a .env file.
LLAMA_CLOUD_API_KEY= API_KEY
Python code:
from dotenv import load_dotenv load_dotenv()
Step 3: Initialize the Llama-Parse Parser
Now, we initialize the LlamaParse object, specifying the desired output format. In this case, we’ll extract content as markdown (you can also choose plain text depending on your needs).
Python code:
from llama_parse import LlamaParse parser = LlamaParse(result_type="markdown")# "markdown" or "text"
Step 4: Load the PDF Document
We use the SimpleDirectoryReader from Llama-Index to load documents from a specified directory or file. We create a dictionary file_extractor that maps file extensions to the corresponding parser (in our case, .pdf files will be parsed using the LlamaParse parser).
Python code:
from llama_index.core import SimpleDirectoryReader file_extractor = {".pdf": parser} # map .pdf files to LlamaParse documents = SimpleDirectoryReader(input_files=['/path/to/your/pdf_file.pdf'], file_extractor=file_extractor).load_data()
This will read the PDF file and parse its content using the LlamaParse parser.
Step 5: Create an Index from the Extracted Documents
Next, we use Llama-Index to create an index from the documents we just extracted. The VectorStoreIndex is a suitable choice for this task as it converts the parsed documents into a vectorized format that can be efficiently queried.
Python code:
from llama_index.core import VectorStoreIndex index = VectorStoreIndex.from_documents(documents)
Step 6: Query the Extracted Document
Now that we have indexed the content of the document, we can use the query_engine to search for specific content. For example, we might want to extract all the contents from the PDF file.
Python code:
query_engine = index.as_query_engine() query = "get the contents from the pdf file" response = query_engine.query(query) print(response)
The query_engine.query(query) function performs the query on the indexed content and returns the most relevant results.
Step 7: Result
When you run the code, you will get a response that contains the extracted content from the PDF file, based on the query. The results might include the entire text content or specific sections, depending on how you structure the query.
What Can Be Extracted?
When using Llama-Parse, various types of data can be extracted from documents, such as:
Advanced Use Cases
Conclusion
Llama-Parse and Llama-Index together provide a powerful solution for document extraction and querying. By following the steps outlined above, you can easily parse a variety of document formats, index the extracted content, and query it efficiently. Whether you’re building a search engine, knowledge base, or document analysis tool, this framework is a great choice for handling unstructured data from different file types.
Neha Vittal Annam