added m16 class prep

chapmanbe · Nov 21, 2017 · eb20966 · eb20966
1 parent 8de85cc
commit eb20966
Show file tree

Hide file tree

Showing 4 changed files with 800 additions and 0 deletions.
diff --git a/modules/m16_graphs/ClassPrep/RetrievePubMedData.ipynb b/modules/m16_graphs/ClassPrep/RetrievePubMedData.ipynb
@@ -0,0 +1,290 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Graph Relationships Among Researchers\n",
+    "\n",
+    "We are going to create Graphs describing relationships between researchers based on co-authorships. In this notebook we are going to use [Biopython](http://biopython.org/) to query PubMed and get citation information for articles published by various researchers.\n",
+    "\n",
+    "Feel free to create your own list of researchers (including yourself!)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Uncomment and run the cell below if you need to install biopython"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!conda install biopython -y"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from Bio import Entrez\n",
+    "import networkx as nx\n",
+    "import os\n",
+    "DATADIR = os.getcwd()\n",
+    "print(os.path.exists(DATADIR))\n",
+    "from IPython.display import Image\n",
+    "import getpass\n",
+    "import gzip\n",
+    "import pickle"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### An Example List of BMI Faculty\n",
+    "\n",
+    "Since our names are not unique identifiers, it can be challenging to query PubMed based on name. For example, I try to be \"Brian E Chapman\" professionally but I have had papers published as \"Brian Chapman\". The list below is copied from a spreadsheet with some tweaking to get the names into the most common form for publishing. Since I copied this from a spreadsheet, I have to do a little manipulation to get the names into FIRSTNAME LASTNAME form.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "faculty = [tuple(s.split(\"\\t\")) for s in \n",
+    "\"\"\"AbdelRahman\tSamir E\n",
+    "Adler\tFrederick R\n",
+    "Bray\tBruce E\n",
+    "Camp\tNicola J\n",
+    "Chapman\tBrian E\n",
+    "Chapman\tWendy W\n",
+    "Conway\tMichael A\n",
+    "Cummins\tMollie R\n",
+    "Del Fiol\tGuilherme\n",
+    "Drews\tFrank A\n",
+    "Egger\tMarlene J\n",
+    "Eilbeck\tKaren\n",
+    "Evans\tR Scott\n",
+    "Facelli\tJulio C\n",
+    "Gibson\tBryan S\n",
+    "Gouripeddi\tRamkiran\n",
+    "Haug\tPeter J\n",
+    "Huff\tStanley M\n",
+    "Hurdle\tJohn F\n",
+    "Kawamoto\tKensaku\n",
+    "Lee\tYounghee\n",
+    "Narus\tScott P\n",
+    "Nebeker\tJonathan\n",
+    "Parker\tDennis L\n",
+    "Piccolo\tStephen\n",
+    "Quinlan\tAaron\n",
+    "Samore\tMatthew H\n",
+    "Sauer\tBrian C\n",
+    "Staes\tCatherine J\n",
+    "Sward\tKatherine A\n",
+    "Weir\tCharlene R\n",
+    "Yandell\tMark\n",
+    "Dean\tJ Michael\n",
+    "Gesteland\tPer H\n",
+    "Gundlapalli\tAdi V\n",
+    "Jackson\tBrian R\n",
+    "Lincoln\tMichael J\n",
+    "Morris\tAlan H\n",
+    "Xu\tWu\"\"\".split(\"\\n\")]\n",
+    "faculty = [\"%s %s\"%(f[1],f[0]) for f in faculty]\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Here is a shorter, alternative list\n",
+    "#### Edit and uncomment"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#faculty = [\"Brian E Chapman\", \"David Gur\", \"Wendy W Chapman\", \"Peter J Haug\", \"Dennis L Parker\", \"Matthew H Samore\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Get the pubmed IDs matching query"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "email_string = input(\"Enter your e-mail: \").strip()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def search(query, email=''):\n",
+    "    Entrez.email = email\n",
+    "    handle = Entrez.esearch(db='pubmed', \n",
+    "                            sort='relevance', \n",
+    "                            retmax='100',\n",
+    "                            retmode='xml', \n",
+    "                            term=query)\n",
+    "    results = Entrez.read(handle)\n",
+    "    return results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Fetch papers corresponding to ids"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def fetch_details(id_list, email=\"[email protected]\"):\n",
+    "    ids = ','.join(id_list)\n",
+    "    Entrez.email = email\n",
+    "    handle = Entrez.efetch(db='pubmed',\n",
+    "                           retmode='xml',\n",
+    "                           id=ids)\n",
+    "    results = Entrez.read(handle)\n",
+    "    return results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Get Co-authorship\n",
+    "\n",
+    "Entrez returns a lot of information. We hone it down to just the names. We need to use exceptions because the returned papers doesn't always have the fields we want."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_coauthor_lists(papers):\n",
+    "    paper_authors = {}\n",
+    "    for p in papers:\n",
+    "        try:\n",
+    "            tmp = p['MedlineCitation']\n",
+    "            alist = []\n",
+    "            for a in tmp['Article']['AuthorList']:\n",
+    "                try:\n",
+    "                    s = \"%s %s\"%(a['ForeName'],a['LastName'])\n",
+    "                    alist.append(s)\n",
+    "                except Exception as error:\n",
+    "                    pass\n",
+    "                    #print(error)\n",
+    "            paper_authors[tmp['Article']['ArticleTitle']] = alist\n",
+    "        except:\n",
+    "            pass\n",
+    "    return paper_authors"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "def get_faculty_coauthors(faculty, email=''):\n",
+    "    return get_coauthor_lists( \n",
+    "                              fetch_details(\n",
+    "                                  search(faculty, email=email)['IdList'], email=email)[\"PubmedArticle\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Author:Co-author dictionary"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "coauthors_with_ext = {\"%s\"%f : get_faculty_coauthors(f, email=email_string) for f in faculty}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with gzip.open(\"researchers_pubmed.pickle.gzip\", \"wb\") as f0:\n",
+    "    pickle.dump(coauthors_with_ext, f0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!ls -l"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}