From 072fcc49cdae751359c3834f03362e40abfeb4c2 Mon Sep 17 00:00:00 2001 From: Abdur Rahman Date: Fri, 6 Oct 2023 18:43:40 +0530 Subject: [PATCH 1/2] Fixed Jupyter Notebook link --- contrib/notebooks/Whirlwindtour.ipynb | 685 +++++++++----------------- 1 file changed, 244 insertions(+), 441 deletions(-) diff --git a/contrib/notebooks/Whirlwindtour.ipynb b/contrib/notebooks/Whirlwindtour.ipynb index 98e1fb833..7f5dd9156 100644 --- a/contrib/notebooks/Whirlwindtour.ipynb +++ b/contrib/notebooks/Whirlwindtour.ipynb @@ -1,452 +1,255 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "f8466df2-007d-4d77-928c-6cca46abc97d", - "metadata": {}, - "source": [ - "\n", - " \"View\n", - "\n", - "\n", - " \"View\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "ba3d7197-d56d-40ce-94bc-9e8366668808", - "metadata": {}, - "source": [ - "### A whirlwind tour\n", - "\n", - "> ℹ️ **NOTE:** This tour is primarily targeted to Linux and macOS users. Though qsv works on Windows, the tour\n", - "assumes basic knowledge of command-line piping and redirection, and uses other command-line tools (curl, tee, head, etc.)\n", - "that are not installed by default on Windows.\n", - "\n", - "Let's say you're playing with some data from the\n", - "[Data Science Toolkit](https://github.com/petewarden/dstkdata), which contains\n", - "several CSV files. Maybe you're interested in the population counts of each\n", - "city in the world. So grab the 124MB, 2.7M row CSV file and start examining it:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "0495a205-d3d3-44a3-b41c-f381aee6d3c2", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1 Country\n", - "2 City\n", - "3 AccentCity\n", - "4 Region\n", - "5 Population\n", - "6 Latitude\n", - "7 Longitude\n" - ] + "metadata": { + "kernelspec": { + "name": "python", + "display_name": "Python (Pyodide)", + "language": "python" + }, + "language_info": { + "codemirror_mode": { + "name": "python", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8" } - ], - "source": [ - "!qsv headers wcp.csv" - ] - }, - { - "cell_type": "markdown", - "id": "c4b8013e-d107-43bc-88aa-443d56f7c70f", - "metadata": {}, - "source": [ - "The next thing you might want to do is get an overview of the kind of data that\n", - "appears in each column. The `stats` command will do this for you:" - ] }, - { - "cell_type": "markdown", - "id": "e0c644af-8de2-4b84-8aa0-ac550654c2f4", - "metadata": {}, - "source": [ - "!qsv stats wcp.csv | qsv table" - ] - }, - { - "cell_type": "markdown", - "id": "72360a9b-fba1-478f-b0d4-5943df9ac1c3", - "metadata": {}, - "source": [ - "Wow! That was fast! It took just 1.3 seconds to compile all that.[^1] One reason for qsv's speed\r\n", - "is that ***it mainly works in \"streaming\" mode*** - computing statistics as it \"streams\"\r\n", - "the CSV file line by line. This also means it can gather statistics on arbitrarily large files,\r\n", - "as it does not have to load the entire file into memory.[^2]\r\n", - "\r\n", - "But can we get more summary statistics? What's the variance, the modes, the distribution (quartiles), \r\n", - "and the cardinality of the data? No problem. That's why `qsv stats` has an `--everything` option to \r\n", - "compute these more \"expensive\" stats. Expensive - as these extended statistics can only be computed at \r\n", - "the cost of loading the entire file into memory.\r\n" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "af448479-b6f5-4ba3-89b7-c3d124b7c5c5", - "metadata": {}, - "outputs": [ + "nbformat_minor": 5, + "nbformat": 4, + "cells": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "field type sum min max range min_length max_length mean stddev variance nullcount sparsity mad lower_outer_fence lower_inner_fence q1 q2_median q3 iqr upper_inner_fence upper_outer_fence skewness cardinality mode mode_count mode_occurrences antimode antimode_count antimode_occurrences\n", - "Country String ad zw 2 2 0 0 231 ru 1 176934 cc,nf,pn,tf,tk 5 1\n", - "City String al lusayli �ykkvibaer 1 87 0 0 2008182 san jose 1 313 *PREVIEW: al lusayli, amiri-ye `olya, bab el ahmar, baqeri `olya, chahar tang-e bala, chefe muzaman... 1741957 1\n", - "AccentCity String Al Lusayli ���zler 1 87 0 0 2025934 San Antonio 1 307 *PREVIEW: Al Lusayli, Amiri-ye `Olya, Baqeri `Olya, B�b el Ahmar, Chahar Tang-e Bala, Chefe Muzam... 1762562 1\n", - "Region String 00 Z4 0 2 4 0 392 04 1 143900 *PREVIEW: H1,H6,H9,I1,I4,K5,K6,L1,L7,M8 17 1\n", - "Population Integer 2290536125 7 31480498 31480491 0 8 48730.6639 308414.0419 95119221210.8846 2652350 0.9826 8327 -69766.5 -33018 3730.5 10879 28229.5 24499 64978 101726.5 0.4164 28460 1 2652350 *PREVIEW: 100,10001,10002,100023,10003,10005,10007,10008,10009,100105 18970 1\n", - "Latitude Float 76585211.1978 -54.9333333 82.483333 137.4167 1 12 28.3717 21.9384 481.2922 0 0 15.2667 -84.7706 -35.9076 12.9553 33.8667 45.5306 32.5753 94.3935 143.2564 -0.2839 255133 50.8 1 1128 *PREVIEW: -.020556,-.025833,-.035,-.035833,-.047222,-.051944,-.054722,-.068889,-.071111,-.100833 79002 1\n", - "Longitude Float 75976506.6643 -179.9833333 180.0 359.9833 1 14 28.1462 62.4729 3902.8581 0 0 30.9583 -199.3667 -98.4917 2.3833 26.8803 69.6333 67.25 170.5083 271.3833 0.2715 407568 23.1 1 590 *PREVIEW: -.185556,-.208056,-.246944,-.248333,-.263889,-.271667,-.276389,-.285,-.287222,-.288889 162640 1\n" - ] - } - ], - "source": [ - "!qsv stats wcp.csv --everything | qsv table" - ] - }, - { - "cell_type": "markdown", - "id": "c8623cc4-1426-4214-bd4e-8a6be9a68210", - "metadata": {}, - "source": [ - "> ℹ️ **NOTE:** The `qsv table` command takes any CSV data and formats it into aligned columns\n", - "using [elastic tabstops](https://github.com/BurntSushi/tabwriter). You'll\n", - "notice that it even gets alignment right with respect to Unicode characters.\n" - ] - }, - { - "cell_type": "markdown", - "id": "8f5a7d90-fbb0-4fe9-9866-223f81ed5fc2", - "metadata": {}, - "source": [ - "So, this command took 3.22 seconds to run on my machine, but we can speed\r\n", - "it up by creating an index and re-running the command:" - ] - }, - { - "cell_type": "markdown", - "id": "ab1d059c-bb8b-42c7-aa04-f64f32e5f825", - "metadata": {}, - "source": [ - "```\n", - "qsv index wcp.csv\n", - "qsv stats wcp.csv --everything | qsv table\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "6c2c1a14-e61f-4951-9f1a-bad27840a017", - "metadata": {}, - "source": [ - "Which cuts it down to 1.95 seconds - 1.65x faster! (And creating the 21.6mb index took 0.27 seconds. \n", - "What about the first `stats` without `--everything`? From 1.3 seconds to 0.16 seconds with an index - 8.25x faster!)" - ] - }, - { - "cell_type": "markdown", - "id": "eb7e899a-175f-4bef-882a-7a1dae15ef1a", - "metadata": {}, - "source": [ - "Notably, the same type of \"statistics\" command in another\n", - "[CSV command line toolkit](https://csvkit.readthedocs.io/)\n", - "takes about 10 seconds to produce a *subset* of statistics on the same data set. [Visidata](https://visidata.org)\n", - "takes much longer - ~1.5 minutes to calculate a *subset* of these statistics with its Describe sheet. \n", - "Even python [pandas'](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) \n", - "`describe(include=\"all\"))` took 12 seconds to calculate a *subset* of qsv's \"streaming\" statistics.[^3]\n", - "\n", - "This is another reason for qsv's speed. Creating an index accelerated statistics gathering as it enables \n", - "***multithreading & fast I/O***.\n", - "\n", - "**For multithreading** - running `stats` with an index was 8.25x faster because it divided the file into \n", - "16 equal chunks[^1] with ~170k records each, then running stats on each chunk in parallel across 16 \n", - "logical processors and merging the results in the end. It was \"only\" 8x, and not 16x faster as there is \n", - "some overhead involved in multithreading. \n", - "\n", - "**For fast I/O** - let's say you wanted to grab the last 10 records:" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "d8f63f5d-9107-4fbd-af81-614591318690", - "metadata": {}, - "outputs": [ + "cell_type": "markdown", + "source": "\n \"View\n\n\n \"View\n\n", + "metadata": {}, + "id": "f8466df2-007d-4d77-928c-6cca46abc97d" + }, { - "name": "stdout", - "output_type": "stream", - "text": [ - "2,699,354\n" - ] - } - ], - "source": [ - "!qsv count --human-readable wcp.csv" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "ae7a1bbf-e129-4f28-9001-3c7643f41ebb", - "metadata": {}, - "outputs": [ + "cell_type": "markdown", + "source": "### A whirlwind tour\n\n> ℹ️ **NOTE:** This tour is primarily targeted to Linux and macOS users. Though qsv works on Windows, the tour\nassumes basic knowledge of command-line piping and redirection, and uses other command-line tools (curl, tee, head, etc.)\nthat are not installed by default on Windows.\n\nLet's say you're playing with some data from the\n[Data Science Toolkit](https://github.com/petewarden/dstkdata), which contains\nseveral CSV files. Maybe you're interested in the population counts of each\ncity in the world. So grab the 124MB, 2.7M row CSV file and start examining it:\n", + "metadata": {}, + "id": "ba3d7197-d56d-40ce-94bc-9e8366668808" + }, { - "name": "stdout", - "output_type": "stream", - "text": [ - "Country City AccentCity Region Population Latitude Longitude\n", - "zw zibalonkwe Zibalonkwe 06 -19.8333333 27.4666667\n", - "zw zibunkululu Zibunkululu 06 -19.6666667 27.6166667\n", - "zw ziga Ziga 06 -19.2166667 27.4833333\n", - "zw zikamanas village Zikamanas Village 00 -18.2166667 27.95\n", - "zw zimbabwe Zimbabwe 07 -20.2666667 30.9166667\n", - "zw zimre park Zimre Park 04 -17.8661111 31.2136111\n", - "zw ziyakamanas Ziyakamanas 00 -18.2166667 27.95\n", - "zw zizalisari Zizalisari 04 -17.7588889 31.0105556\n", - "zw zuzumba Zuzumba 06 -20.0333333 27.9333333\n", - "zw zvishavane Zvishavane 07 79876 -20.3333333 30.0333333\n" - ] - } - ], - "source": [ - "!qsv slice wcp.csv --start -10 | qsv table" - ] - }, - { - "cell_type": "markdown", - "id": "0f10b533-dc18-4cea-8936-3a58fff526fc", - "metadata": {}, - "source": [ - "`qsv count` took 0.006 seconds and `qsv slice`, 0.017 seconds! These commands are *instantaneous* \n", - "with an index because for `count` - the index already precomputed the record count, and with `slice`,\n", - "*only the sliced portion* has to be parsed - because an index allowed us to jump directly to that \n", - "part of the file. It didn't have to scan the entire file to get the last 10 records. For comparison,\n", - "without an index, it took 0.25 (41x slower) and 0.66 (39x slower) seconds respectively.\n", - "\n", - "> ℹ️ **NOTE:** Creating/updating an index itself is extremely fast as well. If you want\n", - "qsv to automatically create and update indices, set the environment var `QSV_AUTOINDEX`.\n", - "\n", - "Okay, okay! Let's switch gears and stop obsessing over how fast :rocket: qsv is... let's go back to exploring :mag_right:\n", - "the data set.\n", - "\n", - "Hmmmm... the Population column has a lot of null values. How pervasive is that?\n", - "First, let's take a look at 10 \"random\" rows with `sample`. We use the `--seed` parameter\n", - "so we get a reproducible random sample. And then, let's display only the Country,\n", - "AccentCity and Population columns with the `select` command." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "ad0597f0-22c8-4e2b-a5fa-f8cbe5392c91", - "metadata": {}, - "outputs": [ + "cell_type": "code", + "source": "!qsv headers wcp.csv", + "metadata": {}, + "execution_count": 1, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": "1 Country\n\n2 City\n\n3 AccentCity\n\n4 Region\n\n5 Population\n\n6 Latitude\n\n7 Longitude\n" + } + ], + "id": "0495a205-d3d3-44a3-b41c-f381aee6d3c2" + }, { - "name": "stdout", - "output_type": "stream", - "text": [ - "Country AccentCity Population\n", - "co Guamal 5026\n", - "af Jalat Kalay \n", - "ro Caragheorghevici \n", - "mm Padauk \n", - "tr Vakifbeloren \n", - "mx Chupadero de Alcaparra \n", - "ir Gangar Kola \n", - "tg Folo \n", - "pe Huancas \n", - "pk Dilawanwala \n" - ] - } - ], - "source": [ - "!qsv sample --seed 42 10 wcp.csv | qsv select Country,AccentCity,Population | qsv table" - ] - }, - { - "cell_type": "markdown", - "id": "4c037b4c-0896-4de3-9237-6cd86a548087", - "metadata": {}, - "source": [ - "Whoops! The sample we got don't have population counts. It's quite pervasive. Exactly how many cities have empty (NULL) population counts?" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "d3090c4d-18dd-4dbc-987e-bcf4c751df4b", - "metadata": {}, - "outputs": [ + "cell_type": "markdown", + "source": "The next thing you might want to do is get an overview of the kind of data that\nappears in each column. The `stats` command will do this for you:", + "metadata": {}, + "id": "c4b8013e-d107-43bc-88aa-443d56f7c70f" + }, { - "name": "stdout", - "output_type": "stream", - "text": [ - "field value count\n", - "Country ru 176934\n", - "Country us 141989\n", - "Country cn 117508\n", - "City san jose 313\n", - "City san antonio 310\n", - "City santa rosa 288\n", - "AccentCity San Antonio 307\n", - "AccentCity Santa Rosa 288\n", - "AccentCity Santa Cruz 268\n", - "Region 04 143900\n", - "Region 02 127736\n", - "Region 03 105455\n", - "Population (NULL) 2652350\n", - "Population 2310 12\n", - "Population 983 11\n", - "Latitude 50.8 1128\n", - "Latitude 50.95 1076\n", - "Latitude 50.6 1043\n", - "Longitude 23.1 590\n", - "Longitude 23.2 586\n", - "Longitude 23.05 575\n" - ] - } - ], - "source": [ - "!qsv frequency wcp.csv --limit 3 | qsv table" - ] - }, - { - "cell_type": "markdown", - "id": "571d99d7-c867-4c74-bb6f-abd9fcf9f2a6", - "metadata": {}, - "source": [ - "(The `qsv frequency` command builds a frequency table for each column in the\n", - "CSV data. This one only took 1.8 seconds.)\n", - "\n", - "So it seems that most cities do not have a population count associated with\n", - "them at all (2,652,350 to be exact). No matter — we can adjust our previous \n", - "command so that it only shows rows with a population count:\n" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "60f3bb63-aff3-4023-bdb1-c938016abc81", - "metadata": {}, - "outputs": [ + "cell_type": "markdown", + "source": "!qsv stats wcp.csv | qsv table", + "metadata": {}, + "id": "e0c644af-8de2-4b84-8aa0-ac550654c2f4" + }, { - "name": "stderr", - "output_type": "stream", - "text": [ - "'tee' is not recognized as an internal or external command,\n", - "operable program or batch file.\n" - ] - } - ], - "source": [ - "!qsv search --select Population '[0-9]' wcp.csv | qsv sample --seed 42 10 | qsv select Country,AccentCity,Population | tee sample.csv | qsv table" - ] - }, - { - "cell_type": "markdown", - "id": "81c813a8-1296-4df7-a3f4-2ddded7fd4fd", - "metadata": {}, - "source": [ - "> ℹ️ **NOTE:** The `tee` command reads from standard input and writes \n", - "to both standard output and one or more files at the same time. We do this so \n", - "we can create the `sample.csv` file we need for the next step, and pipe the \n", - "same data to the `qsv table` command.
Why create `sample.csv`? Even though qsv is blazing-fast, we're just doing an \n", - "initial investigation and a small 10-row sample is all we need to try out and\n", - "compose the different CLI commands needed to wrangle the data.\n" - ] - }, - { - "cell_type": "markdown", - "id": "640ad7a5-ce97-44a4-aa19-2f43a8133fcf", - "metadata": {}, - "source": [ - "Erk. Which country is `sv`? What continent? No clue, but [datawookie](https://github.com/datawookie) \n", - "has a CSV file called `country-continent.csv`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ac0b8bb8-6249-4244-b759-6fe0568e2b4b", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "markdown", - "id": "aab56b5d-1a12-4086-8d69-fae8304edb66", - "metadata": {}, - "source": [ - "\n", - "```\n", - "$ curl -L https://raw.githubusercontent.com/datawookie/data-diaspora/master/spatial/country-continent-codes.csv > country_continent.csv\n", - "$ qsv headers country_continent.csv\n", - "1 # https://datahub.io/JohnSnowLabs/country-and-continent-codes-list\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "dbccfec9-385b-4494-9fa3-42d3ab940ef9", - "metadata": {}, - "source": [ - "Huh!?! That's not what we we were expecting. But if you look at the country-continent.csv file, it starts with a comment with the # character." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "4550f7fd-acb7-4a9e-9ef5-e79855324c1d", - "metadata": {}, - "outputs": [ + "cell_type": "markdown", + "source": "Wow! That was fast! It took just 1.3 seconds to compile all that.[^1] One reason for qsv's speed\nis that ***it mainly works in \"streaming\" mode*** - computing statistics as it \"streams\"\nthe CSV file line by line. This also means it can gather statistics on arbitrarily large files,\nas it does not have to load the entire file into memory.[^2]\n\nBut can we get more summary statistics? What's the variance, the modes, the distribution (quartiles), \nand the cardinality of the data? No problem. That's why `qsv stats` has an `--everything` option to \ncompute these more \"expensive\" stats. Expensive - as these extended statistics can only be computed at \nthe cost of loading the entire file into memory.\n", + "metadata": {}, + "id": "72360a9b-fba1-478f-b0d4-5943df9ac1c3" + }, + { + "cell_type": "code", + "source": "!qsv stats wcp.csv --everything | qsv table", + "metadata": {}, + "execution_count": 8, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": "field type sum min max range min_length max_length mean stddev variance nullcount sparsity mad lower_outer_fence lower_inner_fence q1 q2_median q3 iqr upper_inner_fence upper_outer_fence skewness cardinality mode mode_count mode_occurrences antimode antimode_count antimode_occurrences\n\nCountry String ad zw 2 2 0 0 231 ru 1 176934 cc,nf,pn,tf,tk 5 1\n\nCity String al lusayli �ykkvibaer 1 87 0 0 2008182 san jose 1 313 *PREVIEW: al lusayli, amiri-ye `olya, bab el ahmar, baqeri `olya, chahar tang-e bala, chefe muzaman... 1741957 1\n\nAccentCity String Al Lusayli ���zler 1 87 0 0 2025934 San Antonio 1 307 *PREVIEW: Al Lusayli, Amiri-ye `Olya, Baqeri `Olya, B�b el Ahmar, Chahar Tang-e Bala, Chefe Muzam... 1762562 1\n\nRegion String 00 Z4 0 2 4 0 392 04 1 143900 *PREVIEW: H1,H6,H9,I1,I4,K5,K6,L1,L7,M8 17 1\n\nPopulation Integer 2290536125 7 31480498 31480491 0 8 48730.6639 308414.0419 95119221210.8846 2652350 0.9826 8327 -69766.5 -33018 3730.5 10879 28229.5 24499 64978 101726.5 0.4164 28460 1 2652350 *PREVIEW: 100,10001,10002,100023,10003,10005,10007,10008,10009,100105 18970 1\n\nLatitude Float 76585211.1978 -54.9333333 82.483333 137.4167 1 12 28.3717 21.9384 481.2922 0 0 15.2667 -84.7706 -35.9076 12.9553 33.8667 45.5306 32.5753 94.3935 143.2564 -0.2839 255133 50.8 1 1128 *PREVIEW: -.020556,-.025833,-.035,-.035833,-.047222,-.051944,-.054722,-.068889,-.071111,-.100833 79002 1\n\nLongitude Float 75976506.6643 -179.9833333 180.0 359.9833 1 14 28.1462 62.4729 3902.8581 0 0 30.9583 -199.3667 -98.4917 2.3833 26.8803 69.6333 67.25 170.5083 271.3833 0.2715 407568 23.1 1 590 *PREVIEW: -.185556,-.208056,-.246944,-.248333,-.263889,-.271667,-.276389,-.285,-.287222,-.288889 162640 1\n" + } + ], + "id": "af448479-b6f5-4ba3-89b7-c3d124b7c5c5" + }, + { + "cell_type": "markdown", + "source": "> ℹ️ **NOTE:** The `qsv table` command takes any CSV data and formats it into aligned columns\nusing [elastic tabstops](https://github.com/BurntSushi/tabwriter). You'll\nnotice that it even gets alignment right with respect to Unicode characters.\n", + "metadata": {}, + "id": "c8623cc4-1426-4214-bd4e-8a6be9a68210" + }, + { + "cell_type": "markdown", + "source": "So, this command took 3.22 seconds to run on my machine, but we can speed\nit up by creating an index and re-running the command:", + "metadata": {}, + "id": "8f5a7d90-fbb0-4fe9-9866-223f81ed5fc2" + }, + { + "cell_type": "markdown", + "source": "```\nqsv index wcp.csv\nqsv stats wcp.csv --everything | qsv table\n```", + "metadata": {}, + "id": "ab1d059c-bb8b-42c7-aa04-f64f32e5f825" + }, + { + "cell_type": "markdown", + "source": "Which cuts it down to 1.95 seconds - 1.65x faster! (And creating the 21.6mb index took 0.27 seconds. \nWhat about the first `stats` without `--everything`? From 1.3 seconds to 0.16 seconds with an index - 8.25x faster!)", + "metadata": {}, + "id": "6c2c1a14-e61f-4951-9f1a-bad27840a017" + }, + { + "cell_type": "markdown", + "source": "Notably, the same type of \"statistics\" command in another\n[CSV command line toolkit](https://csvkit.readthedocs.io/)\ntakes about 10 seconds to produce a *subset* of statistics on the same data set. [Visidata](https://visidata.org)\ntakes much longer - ~1.5 minutes to calculate a *subset* of these statistics with its Describe sheet. \nEven python [pandas'](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) \n`describe(include=\"all\"))` took 12 seconds to calculate a *subset* of qsv's \"streaming\" statistics.[^3]\n\nThis is another reason for qsv's speed. Creating an index accelerated statistics gathering as it enables \n***multithreading & fast I/O***.\n\n**For multithreading** - running `stats` with an index was 8.25x faster because it divided the file into \n16 equal chunks[^1] with ~170k records each, then running stats on each chunk in parallel across 16 \nlogical processors and merging the results in the end. It was \"only\" 8x, and not 16x faster as there is \nsome overhead involved in multithreading. \n\n**For fast I/O** - let's say you wanted to grab the last 10 records:", + "metadata": {}, + "id": "eb7e899a-175f-4bef-882a-7a1dae15ef1a" + }, + { + "cell_type": "code", + "source": "!qsv count --human-readable wcp.csv", + "metadata": {}, + "execution_count": 12, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": "2,699,354\n" + } + ], + "id": "d8f63f5d-9107-4fbd-af81-614591318690" + }, + { + "cell_type": "code", + "source": "!qsv slice wcp.csv --start -10 | qsv table", + "metadata": {}, + "execution_count": 14, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": "Country City AccentCity Region Population Latitude Longitude\n\nzw zibalonkwe Zibalonkwe 06 -19.8333333 27.4666667\n\nzw zibunkululu Zibunkululu 06 -19.6666667 27.6166667\n\nzw ziga Ziga 06 -19.2166667 27.4833333\n\nzw zikamanas village Zikamanas Village 00 -18.2166667 27.95\n\nzw zimbabwe Zimbabwe 07 -20.2666667 30.9166667\n\nzw zimre park Zimre Park 04 -17.8661111 31.2136111\n\nzw ziyakamanas Ziyakamanas 00 -18.2166667 27.95\n\nzw zizalisari Zizalisari 04 -17.7588889 31.0105556\n\nzw zuzumba Zuzumba 06 -20.0333333 27.9333333\n\nzw zvishavane Zvishavane 07 79876 -20.3333333 30.0333333\n" + } + ], + "id": "ae7a1bbf-e129-4f28-9001-3c7643f41ebb" + }, + { + "cell_type": "markdown", + "source": "`qsv count` took 0.006 seconds and `qsv slice`, 0.017 seconds! These commands are *instantaneous* \nwith an index because for `count` - the index already precomputed the record count, and with `slice`,\n*only the sliced portion* has to be parsed - because an index allowed us to jump directly to that \npart of the file. It didn't have to scan the entire file to get the last 10 records. For comparison,\nwithout an index, it took 0.25 (41x slower) and 0.66 (39x slower) seconds respectively.\n\n> ℹ️ **NOTE:** Creating/updating an index itself is extremely fast as well. If you want\nqsv to automatically create and update indices, set the environment var `QSV_AUTOINDEX`.\n\nOkay, okay! Let's switch gears and stop obsessing over how fast :rocket: qsv is... let's go back to exploring :mag_right:\nthe data set.\n\nHmmmm... the Population column has a lot of null values. How pervasive is that?\nFirst, let's take a look at 10 \"random\" rows with `sample`. We use the `--seed` parameter\nso we get a reproducible random sample. And then, let's display only the Country,\nAccentCity and Population columns with the `select` command.", + "metadata": {}, + "id": "0f10b533-dc18-4cea-8936-3a58fff526fc" + }, { - "name": "stdout", - "output_type": "stream", - "text": [ - "2699354\n" - ] + "cell_type": "code", + "source": "!qsv sample --seed 42 10 wcp.csv | qsv select Country,AccentCity,Population | qsv table", + "metadata": {}, + "execution_count": 15, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": "Country AccentCity Population\n\nco Guamal 5026\n\naf Jalat Kalay \n\nro Caragheorghevici \n\nmm Padauk \n\ntr Vakifbeloren \n\nmx Chupadero de Alcaparra \n\nir Gangar Kola \n\ntg Folo \n\npe Huancas \n\npk Dilawanwala \n" + } + ], + "id": "ad0597f0-22c8-4e2b-a5fa-f8cbe5392c91" + }, + { + "cell_type": "markdown", + "source": "Whoops! The sample we got don't have population counts. It's quite pervasive. Exactly how many cities have empty (NULL) population counts?", + "metadata": {}, + "id": "4c037b4c-0896-4de3-9237-6cd86a548087" + }, + { + "cell_type": "code", + "source": "!qsv frequency wcp.csv --limit 3 | qsv table", + "metadata": {}, + "execution_count": 16, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": "field value count\n\nCountry ru 176934\n\nCountry us 141989\n\nCountry cn 117508\n\nCity san jose 313\n\nCity san antonio 310\n\nCity santa rosa 288\n\nAccentCity San Antonio 307\n\nAccentCity Santa Rosa 288\n\nAccentCity Santa Cruz 268\n\nRegion 04 143900\n\nRegion 02 127736\n\nRegion 03 105455\n\nPopulation (NULL) 2652350\n\nPopulation 2310 12\n\nPopulation 983 11\n\nLatitude 50.8 1128\n\nLatitude 50.95 1076\n\nLatitude 50.6 1043\n\nLongitude 23.1 590\n\nLongitude 23.2 586\n\nLongitude 23.05 575\n" + } + ], + "id": "d3090c4d-18dd-4dbc-987e-bcf4c751df4b" + }, + { + "cell_type": "markdown", + "source": "(The `qsv frequency` command builds a frequency table for each column in the\nCSV data. This one only took 1.8 seconds.)\n\nSo it seems that most cities do not have a population count associated with\nthem at all (2,652,350 to be exact). No matter — we can adjust our previous \ncommand so that it only shows rows with a population count:\n", + "metadata": {}, + "id": "571d99d7-c867-4c74-bb6f-abd9fcf9f2a6" + }, + { + "cell_type": "code", + "source": "!qsv search --select Population '[0-9]' wcp.csv | qsv sample --seed 42 10 | qsv select Country,AccentCity,Population | tee sample.csv | qsv table", + "metadata": {}, + "execution_count": 19, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": "'tee' is not recognized as an internal or external command,\n\noperable program or batch file.\n" + } + ], + "id": "60f3bb63-aff3-4023-bdb1-c938016abc81" + }, + { + "cell_type": "markdown", + "source": "> ℹ️ **NOTE:** The `tee` command reads from standard input and writes \nto both standard output and one or more files at the same time. We do this so \nwe can create the `sample.csv` file we need for the next step, and pipe the \nsame data to the `qsv table` command.
Why create `sample.csv`? Even though qsv is blazing-fast, we're just doing an \ninitial investigation and a small 10-row sample is all we need to try out and\ncompose the different CLI commands needed to wrangle the data.\n", + "metadata": {}, + "id": "81c813a8-1296-4df7-a3f4-2ddded7fd4fd" + }, + { + "cell_type": "markdown", + "source": "Erk. Which country is `sv`? What continent? No clue, but [datawookie](https://github.com/datawookie) \nhas a CSV file called `country-continent.csv`.", + "metadata": {}, + "id": "640ad7a5-ce97-44a4-aa19-2f43a8133fcf" + }, + { + "cell_type": "code", + "source": "", + "metadata": {}, + "execution_count": null, + "outputs": [], + "id": "ac0b8bb8-6249-4244-b759-6fe0568e2b4b" + }, + { + "cell_type": "markdown", + "source": "\n```\n$ curl -L https://raw.githubusercontent.com/datawookie/data-diaspora/master/spatial/country-continent-codes.csv > country_continent.csv\n$ qsv headers country_continent.csv\n1 # https://datahub.io/JohnSnowLabs/country-and-continent-codes-list\n```", + "metadata": {}, + "id": "aab56b5d-1a12-4086-8d69-fae8304edb66" + }, + { + "cell_type": "markdown", + "source": "Huh!?! That's not what we we were expecting. But if you look at the country-continent.csv file, it starts with a comment with the # character.", + "metadata": {}, + "id": "dbccfec9-385b-4494-9fa3-42d3ab940ef9" + }, + { + "cell_type": "code", + "source": "!qsv count wcp.csv", + "metadata": {}, + "execution_count": 1, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": "2699354\n" + } + ], + "id": "4550f7fd-acb7-4a9e-9ef5-e79855324c1d" + }, + { + "cell_type": "code", + "source": "", + "metadata": {}, + "execution_count": null, + "outputs": [], + "id": "9d723561-42bd-45fe-af2e-c17d17c41e8c" } - ], - "source": [ - "!qsv count wcp.csv" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9d723561-42bd-45fe-af2e-c17d17c41e8c", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.2" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} + ] +} \ No newline at end of file From f46978ee33d42f83710f7ca8bd13c20bc9f1acb4 Mon Sep 17 00:00:00 2001 From: Abdur Rahman Date: Fri, 6 Oct 2023 18:48:46 +0530 Subject: [PATCH 2/2] Fixed Jupyterlab Link --- contrib/notebooks/Whirlwindtour.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/contrib/notebooks/Whirlwindtour.ipynb b/contrib/notebooks/Whirlwindtour.ipynb index 7f5dd9156..b87ac6140 100644 --- a/contrib/notebooks/Whirlwindtour.ipynb +++ b/contrib/notebooks/Whirlwindtour.ipynb @@ -23,7 +23,7 @@ "cells": [ { "cell_type": "markdown", - "source": "\n \"View\n\n\n \"View\n\n", + "source": "\n \"View\n\n\n \"View\n\n", "metadata": {}, "id": "f8466df2-007d-4d77-928c-6cca46abc97d" },