Skip to main content

🦜🔗 LangChain

Learn how to integrate Apify with LangChain, in order to feed vector databases and LLMs with data crawled from the web.


For more information on LangChain visit its documentation.

In this example, we'll use the Website Content Crawler Actor, which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs and extract text content from the web pages. Then we feed the documents into a vector index and answer questions from it.

This example focuses on how to integrate Apify with LangChain using the Python language, but if you prefer to use JavaScript, you can follow the same steps in the JavaScript LangChain documentation.

Before we start with the integration, we need to install all dependencies:

pip install apify-client langchain openai chromadb

After successful installation of all dependencies, we can start writing code.

First, import os, Document, VectorstoreIndexCreator, and ApifyWrapper into your source code:

import os

from langchain.document_loaders.base import Document
from langchain.indexes import VectorstoreIndexCreator
from langchain.utilities import ApifyWrapper

Find your Apify API token and OpenAI API key and initialize these into environment variable:

os.environ["OPENAI_API_KEY"] = "Your OpenAI API key"
os.environ["APIFY_API_TOKEN"] = "Your Apify API token"

Run the Actor, wait for it to finish, and fetch its results from the Apify dataset into a LangChain document loader.

Note that if you already have some results in an Apify dataset, you can load them directly using ApifyDatasetLoader, as shown in this notebook. In that notebook, you'll also find the explanation of the dataset_mapping_function, which is used to map fields from the Apify dataset records to LangChain Document fields.

apify = ApifyWrapper()

loader = apify.call_actor(
actor_id="apify/website-content-crawler",
run_input={"startUrls": [{"url": "https://python.langchain.com/en/latest/"}], "maxCrawlPages": 10, "crawlerType": "cheerio"},
dataset_mapping_function=lambda item: Document(
page_content=item["text"] or "", metadata={"source": item["url"]}
),
)

NOTE: The Actor call function can take some time as it loads the data from LangChain documentation website.

Initialize the vector index from the crawled documents:

index = VectorstoreIndexCreator().from_loaders([loader])

And finally, query the vector index:

query = "What is LangChain?"
result = index.query_with_sources(query)

print(result["answer"])
print(result["sources"])

If you want to test the whole example, you can simply create a new file, langchain_integration.py, and copy the whole code into it.

import os

from langchain.document_loaders.base import Document
from langchain.indexes import VectorstoreIndexCreator
from langchain.utilities import ApifyWrapper

os.environ["OPENAI_API_KEY"] = "Your OpenAI API key"
os.environ["APIFY_API_TOKEN"] = "Your Apify API token"

apify = ApifyWrapper()

loader = apify.call_actor(
actor_id="apify/website-content-crawler",
run_input={"startUrls": [{"url": "https://python.langchain.com/en/latest/"}], "maxCrawlPages": 10, "crawlerType": "cheerio"},
dataset_mapping_function=lambda item: Document(
page_content=item["text"] or "", metadata={"source": item["url"]}
),
)
index = VectorstoreIndexCreator().from_loaders([loader])
query = "What is LangChain?"
result = index.query_with_sources(query)

print(result["answer"])
print(result["sources"])

To run it, you can use the following command: python langchain_integration.py

After running the code, you should see the following output:

LangChain is a framework for developing applications powered by language models. It provides standard, extendable interfaces, external integrations, and end-to-end implementations for off-the-shelf use. It also integrates with other LLMs, systems, and products to create a vibrant and thriving ecosystem.

https://python.langchain.com/en/latest/

LangChain is a standard interface through which you can interact with a variety of large language models (LLMs). It provides modules you can use to build language model applications. It also provides chains and agents with memory capabilities.

Resources​