From d4b2e16752b719ccbd92256a2ca476f993a03ad8 Mon Sep 17 00:00:00 2001 From: bilgeyucel Date: Mon, 23 Sep 2024 12:31:03 +0300 Subject: [PATCH] Add the vibe checker cookbook to the website through index.toml --- index.toml | 5 + ...haystack_instagram_comments_analysis.ipynb | 286 +++++++++--------- 2 files changed, 148 insertions(+), 143 deletions(-) diff --git a/index.toml b/index.toml index 6236756..eae5030 100644 --- a/index.toml +++ b/index.toml @@ -233,3 +233,8 @@ topics = ["RAG"] title = "Advanced RAG: Query Decomposition and Reasoning" notebook = "query_decomposition.ipynb" topics = ["Advanced Retrieval", "RAG", "Agents"] + +[[cookbook]] +title = "Analyze Your Instagram Comments’ Vibe with Apify and Haystack" +notebook = "apify_haystack_instagram_comments_analysis.ipynb" +topics = ["Prompting", "Data Scraping"] diff --git a/notebooks/apify_haystack_instagram_comments_analysis.ipynb b/notebooks/apify_haystack_instagram_comments_analysis.ipynb index 4b5f785..98dafde 100644 --- a/notebooks/apify_haystack_instagram_comments_analysis.ipynb +++ b/notebooks/apify_haystack_instagram_comments_analysis.ipynb @@ -1,21 +1,10 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - }, - "language_info": { - "name": "python" - } - }, "cells": [ { "cell_type": "markdown", + "metadata": { + "id": "t1BeKtSo7KzI" + }, "source": [ "# Analyze Your Instagram Comments’ Vibe with Apify and Haystack\n", "\n", @@ -23,41 +12,35 @@ "Idea: Bilge Yücel ([deepset.ai](https://github.com/bilgeyucel))\n", "\n", "Ever wondered if your Instagram posts are truly vibrating among your audience?\n", - "In this tutorial, we'll show you how to use the [Instagram Comment Scraper](https://apify.com/apify/instagram-comment-scraper) Actor to download comments from your instagram post and analyze them using a large language model. All performed within the Haystack ecosystem using the [apify-haystack](https://github.com/apify/apify-haystack/tree/main) integration.\n", + "In this cookbook, we'll show you how to use the [Instagram Comment Scraper](https://apify.com/apify/instagram-comment-scraper) Actor to download comments from your instagram post and analyze them using a large language model. All performed within the Haystack ecosystem using the [apify-haystack](https://github.com/apify/apify-haystack/tree/main) integration.\n", "\n", "We'll start by using the Actor to download the comments, clean the data with the [DocumentCleaner](https://docs.haystack.deepset.ai/docs/documentcleaner) and then use the [OpenAIGenerator](https://docs.haystack.deepset.ai/docs/openaigenerator) to discover the vibe of the Instagram posts." - ], - "metadata": { - "id": "t1BeKtSo7KzI" - } + ] }, { "cell_type": "markdown", - "source": [ - "# Install dependencies" - ], "metadata": { "id": "-7zY6NIsCj_5" - } + }, + "source": [ + "# Install dependencies" + ] }, { "cell_type": "code", - "source": [ - "!pip install apify-haystack==0.1.4 haystack-ai" - ], + "execution_count": 1, "metadata": { - "id": "r5AJeMOE1Cou", "colab": { "base_uri": "https://localhost:8080/" }, - "outputId": "63663073-ccc5-4306-ae18-e2720d937407", - "collapsed": true + "collapsed": true, + "id": "r5AJeMOE1Cou", + "outputId": "63663073-ccc5-4306-ae18-e2720d937407" }, - "execution_count": 1, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "Collecting apify-haystack==0.1.4\n", " Downloading apify_haystack-0.1.4-py3-none-any.whl.metadata (5.4 kB)\n", @@ -143,38 +126,34 @@ "Successfully installed apify-client-1.8.0 apify-haystack-0.1.4 apify-shared-1.1.2 backoff-2.2.1 h11-0.14.0 haystack-ai-2.5.0 haystack-experimental-0.1.1 httpcore-1.0.5 httpx-0.27.2 jiter-0.5.0 lazy-imports-0.3.1 monotonic-1.6 openai-1.43.0 posthog-3.6.3 python-dotenv-1.0.1\n" ] } + ], + "source": [ + "!pip install apify-haystack==0.1.4 haystack-ai" ] }, { "cell_type": "markdown", + "metadata": { + "id": "h6MmIG9K1HkK" + }, "source": [ "## Set up the API keys\n", "\n", "You need to have an Apify account and obtain [APIFY_API_TOKEN](https://docs.apify.com/platform/integrations/api).\n", "\n", "You also need an OpenAI account and [OPENAI_API_KEY](https://platform.openai.com/docs/quickstart)\n" - ], - "metadata": { - "id": "h6MmIG9K1HkK" - } + ] }, { "cell_type": "code", - "source": [ - "import os\n", - "from getpass import getpass\n", - "\n", - "os.environ[\"APIFY_API_TOKEN\"] = getpass(\"Enter YOUR APIFY_API_TOKEN\")\n", - "os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter YOUR OPENAI_API_KEY\")" - ], + "execution_count": 2, "metadata": { - "id": "yiUTwYzP36Yr", "colab": { "base_uri": "https://localhost:8080/" }, + "id": "yiUTwYzP36Yr", "outputId": "d79acadc-bd18-44d3-c812-9b40c51d5124" }, - "execution_count": 2, "outputs": [ { "name": "stdout", @@ -184,10 +163,20 @@ "Enter YOUR OPENAI_API_KEY··········\n" ] } + ], + "source": [ + "import os\n", + "from getpass import getpass\n", + "\n", + "os.environ[\"APIFY_API_TOKEN\"] = getpass(\"Enter YOUR APIFY_API_TOKEN\")\n", + "os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter YOUR OPENAI_API_KEY\")" ] }, { "cell_type": "markdown", + "metadata": { + "id": "HQzAujMc505k" + }, "source": [ "## Use the Haystack Pipeline to Orchestrate Instagram Comments Scraper, Comments Cleanup, and Analysis Using LLM\n", "\n", @@ -217,27 +206,27 @@ "]\n", "```\n", "We will convert this JSON to a Haystack Document using the `dataset_mapping_function` as follows" - ], - "metadata": { - "id": "HQzAujMc505k" - } + ] }, { "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "OZ0PAVHI_mhn" + }, + "outputs": [], "source": [ "from haystack import Document\n", "\n", "def dataset_mapping_function(dataset_item: dict) -> Document:\n", " return Document(content=dataset_item.get(\"text\"), meta={\"ownerUsername\": dataset_item.get(\"ownerUsername\")})" - ], - "metadata": { - "id": "OZ0PAVHI_mhn" - }, - "execution_count": 3, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "xtFquWflA5kf" + }, "source": [ "Once we understand the Actor output format and have the `dataset_mapping_function`, we can setup the Haystack component to enable interaction between the Haystack and Apify.\n", "\n", @@ -247,18 +236,20 @@ "- i) when creating the `ApifyDatasetFromActorCall` class \n", "- ii) as arguments in a pipeline. \n", "- iii) as argumennts to the `run()` function when we calling `ApifyDatasetFromActorCall.run()` \n", - "- iv) as a combination of `i)` and `ii)` as shown in this tutorial.\n", + "- iv) as a combination of `i)` and `ii)` as shown in this cookbook.\n", "\n", "For a detailed description of the input parameters, visit the [Instagram Comments Scraper page](https://apify.com/apify/instagram-comment-scraper).\n", "\n", "Let's setup the `ApifyDatasetFromActorCall`" - ], - "metadata": { - "id": "xtFquWflA5kf" - } + ] }, { "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "SUWXxT4y55lH" + }, + "outputs": [], "source": [ "from apify_haystack import ApifyDatasetFromActorCall\n", "\n", @@ -267,25 +258,49 @@ " run_input={\"resultsLimit\": 50},\n", " dataset_mapping_function=dataset_mapping_function,\n", ")" - ], - "metadata": { - "id": "SUWXxT4y55lH" - }, - "execution_count": 4, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "BxHbPUipjrvS" + }, "source": [ "\n", "Next, we'll define a `prompt` for the LLM and connect all the components in the [Pipeline](https://docs.haystack.deepset.ai/docs/pipelines)." - ], - "metadata": { - "id": "BxHbPUipjrvS" - } + ] }, { "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "gdN7baGrA_lR", + "outputId": "b73b1217-3082-4da7-c824-b8671eeef78d" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "🚅 Components\n", + " - loader: ApifyDatasetFromActorCall\n", + " - cleaner: DocumentCleaner\n", + " - prompt_builder: PromptBuilder\n", + " - llm: OpenAIGenerator\n", + "🛤️ Connections\n", + " - loader.documents -> cleaner.documents (list[Document])\n", + " - cleaner.documents -> prompt_builder.documents (List[Document])\n", + " - prompt_builder.prompt -> llm.prompt (str)" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "from haystack import Pipeline\n", "from haystack.components.builders import PromptBuilder\n", @@ -318,126 +333,100 @@ "pipe.connect(\"loader\", \"cleaner\")\n", "pipe.connect(\"cleaner\", \"prompt_builder\")\n", "pipe.connect(\"prompt_builder\", \"llm\")" - ], - "metadata": { - "id": "gdN7baGrA_lR", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "b73b1217-3082-4da7-c824-b8671eeef78d" - }, - "execution_count": 5, - "outputs": [ - { - "output_type": "execute_result", - "data": { - "text/plain": [ - "\n", - "🚅 Components\n", - " - loader: ApifyDatasetFromActorCall\n", - " - cleaner: DocumentCleaner\n", - " - prompt_builder: PromptBuilder\n", - " - llm: OpenAIGenerator\n", - "🛤️ Connections\n", - " - loader.documents -> cleaner.documents (list[Document])\n", - " - cleaner.documents -> prompt_builder.documents (List[Document])\n", - " - prompt_builder.prompt -> llm.prompt (str)" - ] - }, - "metadata": {}, - "execution_count": 5 - } ] }, { "cell_type": "markdown", - "source": [ - "After that, we can run the pipeline. The execution and analysis will take approximately 30-60 seconds." - ], "metadata": { "id": "GxDNZ7LqAsWV" - } + }, + "source": [ + "After that, we can run the pipeline. The execution and analysis will take approximately 30-60 seconds." + ] }, { "cell_type": "code", - "source": [ - "# \\@tiffintech on How to easily keep up with tech?\n", - "url = \"https://www.instagram.com/p/C_a9jcRuJZZ/\"\n", - "\n", - "res = pipe.run({\"loader\": {\"run_input\": {\"directUrls\": [url]}}})\n", - "res.get(\"llm\", {}).get(\"replies\", [\"No response\"])[0]\n", - "\n" - ], + "execution_count": 6, "metadata": { - "id": "qfaWI6BaAko9", "colab": { "base_uri": "https://localhost:8080/", "height": 72 }, + "id": "qfaWI6BaAko9", "outputId": "25e33c1b-f8b9-4b6d-a3d9-0eb54365b820" }, - "execution_count": 6, "outputs": [ { - "output_type": "execute_result", "data": { - "text/plain": [ - "'Overall, the Instagram comments on the post reflect positive energy, excitement, and high engagement. The use of emojis such as 😂, 😍, 🙌, ❤️, and 🔥 indicate enthusiasm and excitement. Many comments express gratitude, appreciation, and eagerness to explore the resources mentioned in the post. There are also interactions between users tagging each other and discussing their interest in the topic, further increasing engagement. Overall, the post seems to be generating high energy and positive vibes from the audience.'" - ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" - } + }, + "text/plain": [ + "'Overall, the Instagram comments on the post reflect positive energy, excitement, and high engagement. The use of emojis such as 😂, 😍, 🙌, ❤️, and 🔥 indicate enthusiasm and excitement. Many comments express gratitude, appreciation, and eagerness to explore the resources mentioned in the post. There are also interactions between users tagging each other and discussing their interest in the topic, further increasing engagement. Overall, the post seems to be generating high energy and positive vibes from the audience.'" + ] }, + "execution_count": 6, "metadata": {}, - "execution_count": 6 + "output_type": "execute_result" } + ], + "source": [ + "# \\@tiffintech on How to easily keep up with tech?\n", + "url = \"https://www.instagram.com/p/C_a9jcRuJZZ/\"\n", + "\n", + "res = pipe.run({\"loader\": {\"run_input\": {\"directUrls\": [url]}}})\n", + "res.get(\"llm\", {}).get(\"replies\", [\"No response\"])[0]\n", + "\n" ] }, { "cell_type": "markdown", - "source": [ - "Now, let's us run the same analysis. This time with the @kamalaharris post" - ], "metadata": { "id": "jPfgD939E2TW" - } + }, + "source": [ + "Now, let's us run the same analysis. This time with the @kamalaharris post" + ] }, { "cell_type": "code", - "source": [ - "# \\@kamalaharris on Affordable Care Act\n", - "url = \"https://www.instagram.com/p/C_RgBzogufK/\"\n", - "\n", - "res = pipe.run({\"loader\": {\"run_input\": {\"directUrls\": [url]}}})\n", - "res.get(\"llm\", {}).get(\"replies\", [\"No response\"])[0]" - ], + "execution_count": 7, "metadata": { - "id": "mCFb8KZOEkpW", "colab": { "base_uri": "https://localhost:8080/", "height": 72 }, + "id": "mCFb8KZOEkpW", "outputId": "f6b61f27-59f6-4898-b202-1838f8fd00f2" }, - "execution_count": 7, "outputs": [ { - "output_type": "execute_result", "data": { - "text/plain": [ - "'The comments on this post are highly polarized, with strong opinions expressed on both sides of the political spectrum. There is a mix of negative and positive sentiment, with some users expressing excitement and support for the current administration (e.g., emojis like 💙💙💙💙, Kamala 👏👏) while others criticize past policies and individuals associated with them (e.g., Trump 2024, lack of education). Overall, the engagement on this post is high, with users actively debating and defending their viewpoints. Despite the divisive nature of the comments, the post is generating a high level of energy and engagement.'" - ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" - } + }, + "text/plain": [ + "'The comments on this post are highly polarized, with strong opinions expressed on both sides of the political spectrum. There is a mix of negative and positive sentiment, with some users expressing excitement and support for the current administration (e.g., emojis like 💙💙💙💙, Kamala 👏👏) while others criticize past policies and individuals associated with them (e.g., Trump 2024, lack of education). Overall, the engagement on this post is high, with users actively debating and defending their viewpoints. Despite the divisive nature of the comments, the post is generating a high level of energy and engagement.'" + ] }, + "execution_count": 7, "metadata": {}, - "execution_count": 7 + "output_type": "execute_result" } + ], + "source": [ + "# \\@kamalaharris on Affordable Care Act\n", + "url = \"https://www.instagram.com/p/C_RgBzogufK/\"\n", + "\n", + "res = pipe.run({\"loader\": {\"run_input\": {\"directUrls\": [url]}}})\n", + "res.get(\"llm\", {}).get(\"replies\", [\"No response\"])[0]" ] }, { "cell_type": "markdown", + "metadata": { + "id": "45YxSr6v__fI" + }, "source": [ "The analysis shows that the first post about [How to easily keep up with tech?](https://www.instagram.com/p/C_a9jcRuJZZ/) is vibrating with high energy:\n", "\n", @@ -448,10 +437,21 @@ "*The comments on this post are generating negative energy but with high engagement. There's a strong focus on political opinions, particularly concerning insurance companies, the Affordable Care Act, Trump, and Biden. Many comments express frustration, criticism, and disagreement, with some users discussing party affiliations or support for specific politicians. There are also mentions of misinformation and conspiracy theories. Engagement is high, with numerous comment threads delving into various political issues. Overall, this post is vibrating with intense energy, driven by political opinions, disagreements, and active discussions.*\n", "\n", "💡 You might receive slightly different results, as the comments may have changed since the last run" - ], - "metadata": { - "id": "45YxSr6v__fI" - } + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" } - ] -} \ No newline at end of file + }, + "nbformat": 4, + "nbformat_minor": 0 +}