{
  "metadata": {
  },
  "nbformat": 4,
  "nbformat_minor": 5,
  "cells": [
    {
      "id": "metadata",
      "cell_type": "markdown",
      "source": "<div style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em; padding: 0.5em;\">\n\n# Python - Files &amp; CSV\n\nby [Helena Rasche](https://training.galaxyproject.org/hall-of-fame/hexylena/), [Donny Vrins](https://training.galaxyproject.org/hall-of-fame/dirowa/), [Bazante Sanders](https://training.galaxyproject.org/hall-of-fame/bazante1/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n- How can I read from a file?\n- How can I parse a CSV file?\n- How can I write results out\n\n**Objectives**\n\n- Read data from a file\n- Write new data to a file\n- Use <code style=\"color: inherit\">with</code> to ensure the file is closed properly.\n- Use the CSV module to parse comma and tab separated datasets.\n\n**Time Estimation: 1H30M**\n</div>\n",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-0",
      "source": "<p>Here we’ll give a quick tutorial on how to read and write files within Python.</p>\n<blockquote class=\"agenda\" style=\"border: 2px solid #86D486;display: none; margin: 1em 0.2em\">\n<div id=\"agenda\" class=\"box-title\" aria-label=\"agenda box: \">Agenda</div>\n<p>In this tutorial, we will cover:</p>\n<ol id=\"markdown-toc\">\n<li><a href=\"#setup\" id=\"markdown-toc-setup\">Setup</a></li>\n</ol>\n</blockquote>\n<h2 id=\"setup\">Setup</h2>\n<p>For this tutorial, we assume you’re working in a notebook (Jupyter, CoCalc, etc) so we’ll run a quick “setup” step to download some CSV data:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-1",
      "source": [
        "import urllib.request\n",
        "# Download a copy of Hamlet.\n",
        "urllib.request.urlretrieve(\"https://gutenberg.org/cache/epub/1524/pg1524.txt\", \"hamlet.txt\")\n",
        "# Download some COVID data for Europe\n",
        "urllib.request.urlretrieve(\"https://opendata.ecdc.europa.eu/covid19/vaccine_tracker/csv/data.csv\", \"vaccinations.csv\")\n",
        "# And a fastq file\n",
        "urllib.request.urlretrieve(\"https://gist.github.com/hexylena/7d249607f8f763301f06c78a48c3bf6f/raw/a100e278cee1c94035a3a644b16863deee0ba2c0/example.fastq\", \"example.fastq\")"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-2",
      "source": "<p>And now we’re ready to get started learning about files!</p>\n<h2 id=\"reading-from-a-file\">Reading from a file</h2>\n<p>Reading from and writing to files in Python is very straightforward, we use <code style=\"color: inherit\">open()</code> to open a file to read from it. Files are accessed through something called a <strong>file <a href=\"https://en.wikipedia.org/wiki/Handle_(computing)\">handle</a></strong>, you’re not accessing the file itself, you’re opening a connection to the file, and then you can read lines from through that file handle. When you open a file handle, you must specify it’s mode:</p>\n<table>\n<thead>\n<tr>\n<th>Mode</th>\n<th>Purpose</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><code style=\"color: inherit\">r</code></td>\n<td>We will <strong>read</strong> from this file handle</td>\n</tr>\n<tr>\n<td><code style=\"color: inherit\">w</code></td>\n<td>We will <strong>write</strong> to this file handle</td>\n</tr>\n<tr>\n<td><code style=\"color: inherit\">a</code></td>\n<td>We will <strong>append</strong> to this file handle (we cannot access earlier contents!)</td>\n</tr>\n</tbody>\n</table>\n<p>You will mostly use <code style=\"color: inherit\">r</code> and <code style=\"color: inherit\">w</code>, <code style=\"color: inherit\">a</code> is especially useful for writing to program logs where you don’t really care what was written before, you just want to add your new logs to the end of the file.</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-3",
      "source": [
        "with open('hamlet.txt', 'r') as handle:\n",
        "    # readlines reads every line of a file into a giant list!\n",
        "    lines = handle.readlines()"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-4",
      "source": "<p>Here we introduce a new bit of syntax, the <code style=\"color: inherit\">with</code> block. Technically <code style=\"color: inherit\">with</code> begins a “context manager” which allows python to setup some things before the block, run some contents in the block, and automatically handle cleanup of this block. When you open a file, you <em>must</em> close it when you’re done with it (otherwise bad things can happen!) and <code style=\"color: inherit\">with</code> prevents most of those issues.</p>\n<p>In the above code snippet after the second line, the file (referred to by <code style=\"color: inherit\">handle</code>) is automatically closed.</p>\n<blockquote class=\"code-2col\">\n<blockquote class=\"code-in\" style=\"border: 2px solid #86D486; margin: 1em 0.2em\">\n<div id=\"code-in-using-with\" class=\"box-title\" aria-label=\"code-in box: Using &lt;code&gt;with&lt;/code&gt;\"><i class=\"far fa-keyboard\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Input: Using <code style=\"color: inherit\">with</code></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">with open('file.txt', 'r') as handle:\n    print(handle.readlines())\n</code></pre></div>    </div>\n</blockquote>\n<blockquote class=\"code-out\" style=\"border: 2px solid #fb99d0; margin: 1em 0.2em\">\n<div id=\"code-out-not-using-with\" class=\"box-title\" aria-label=\"code-out box: Not using &lt;code&gt;with&lt;/code&gt;\"><i class=\"fas fa-laptop-code\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Output: Not using <code style=\"color: inherit\">with</code></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">handle = open('file.txt', 'r')\nprint(handle.readlines())\nhandle.close() # Important!\n</code></pre></div>    </div>\n</blockquote>\n</blockquote>\n<p>Let’s see what’s in our file. `</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-5",
      "source": [
        "print(lines[0])\n",
        "print(lines[1])\n",
        "print(lines[2])"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-6",
      "source": "<p>Notice how it prints out a blank line afterwards! This is due to a <code style=\"color: inherit\">\\n</code>, a newline. A newline just tells the computer “please put content on the next line”. We can see it by using the <code style=\"color: inherit\">repr()</code> function:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-7",
      "source": [
        "print(repr(lines[0]))"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-8",
      "source": "<p>Every line that’s read in ends in a newline currently. This is done because if we wanted to write it back out, we would need to preserve those newlines, or all of the content would be on one giant line. Let’s try writing out a file, it’s <em>just</em> like reading in a file!</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-9",
      "source": [
        "with open('hamlet-copy.txt', 'w') as handle:\n",
        "    # readlines reads every line of a file into a giant list!\n",
        "    for line in lines:\n",
        "        handle.write(line)"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-10",
      "source": "<p>Check this file in your folder, does it look right? Is it identical in size to the original?</p>\n<h3 id=\"use-case-summarisation\">Use Case: Summarisation</h3>\n<p>Let’s use the file’s contents for something useful. This file specifically is the play Hamlet, by Shakespeare. The contents are formatted with a speaker indicated with all capital letters, followed by their lines (potentially spread over multiple lines of the file.)</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">HAMLET. Madam, how like you this play?\n\nQUEEN. The lady protests too much, methinks.\n\nHAMLET. O, but she’ll keep her word.\n</code></pre></div></div>\n<p>So let’s count up how many times each character speaks! (Roughly)</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-11",
      "source": [
        "with open('hamlet.txt', 'r') as handle:\n",
        "    # readlines reads every line of a file into a giant list!\n",
        "    lines = handle.readlines()\n",
        "\n",
        "speakers = {}"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-12",
      "source": "<p>Here we’ve initialised the <code style=\"color: inherit\">lines</code> variable with the contents of the text, and setup <code style=\"color: inherit\">speakers</code> as a dictionary that will let us track how many times each character speaks. Next let’s define a function to check if a character is speaking on that line. It is if it meets two conditions: the first word is all caps, and it ends with a <code style=\"color: inherit\">.</code>.</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-13",
      "source": [
        "def is_speaker(word):\n",
        "    return word == word.upper() and word[-1] == '.'"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-14",
      "source": "<p>We can use that function later to check if a line starts with a speaker</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-15",
      "source": [
        "# Loop over every line we read in\n",
        "for line in lines:\n",
        "    # Split by default splits on whitespace.\n",
        "    words = line.split()\n",
        "\n",
        "    # Are there words on this line? We can't access the first word if we\n",
        "    # haven't any words.\n",
        "    if len(words) == 0:\n",
        "        continue\n",
        "\n",
        "    # Check if the first word is uppercase, and the last character is a `.`,\n",
        "    # then it's a character speaking.\n",
        "    if is_speaker(words[0]):\n",
        "        # Give this an easier to remember and understand name.\n",
        "        speaker = words[0]\n",
        "\n",
        "        # Have we seen this speaker before? If not, we should add them to the\n",
        "        # speakers dictionary. Hint: Try removing this to see why we do this.\n",
        "        if speaker not in speakers:\n",
        "            speakers[speaker] = 0\n",
        "\n",
        "        # Increment the number of times we've seen them speak.\n",
        "        speakers[speaker] = speakers[speaker] + 1"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-16",
      "source": "<p>Ok! We’ve done a couple things here that all fall into the category of <strong>defensive programming</strong>. As programmers, we often accept input from users or from unknown sources. That input may be wrong, it may have bad data, it may be trying to attack us. So we respond by checking very carefully if things match our expectations, and rejecting the input otherwise. We did a couple things here for that:</p>\n<ul>\n<li>Checking if the input matches our expectations exactly (capitals, <code style=\"color: inherit\">.</code>)</li>\n<li>Checking if the line is empty, using <code style=\"color: inherit\">continue</code> to skip it if it was</li>\n<li>Checking if the speaker is already known in the dictionary, adding it otherwise.</li>\n</ul>\n<p>Let’s see who was the most chatty:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-17",
      "source": [
        "for key, value in speakers.items():\n",
        "    print(key, value)"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-18",
      "source": "<p>We’ve clearly caught a number of values that aren’t expected, some section headers (the numeric values, and some rare values we don’t expect.)</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-19",
      "source": [
        "for key, value in speakers.items():\n",
        "    if value > 1:\n",
        "        print(key, value)"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-20",
      "source": "<p>Hamlet, the titular character, has the vast majority of turns speaking throughout the play:</p>\n<table>\n<thead>\n<tr>\n<th>Character</th>\n<th>Turns speaking</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>HAMLET</td>\n<td>358</td>\n</tr>\n<tr>\n<td>HORATIO</td>\n<td>107</td>\n</tr>\n<tr>\n<td>KING</td>\n<td>102</td>\n</tr>\n<tr>\n<td>POLONIUS</td>\n<td>86</td>\n</tr>\n<tr>\n<td>QUEEN</td>\n<td>69</td>\n</tr>\n</tbody>\n</table>\n<h2 id=\"writing-files\">Writing files</h2>\n<p>Writing a file out is exactly like reading a file, we just use the different file mode <code style=\"color: inherit\">w</code> to indicate we wish to write to a file:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-21",
      "source": [
        "with open('hello.txt', 'w') as handle:\n",
        "    # Let's write it out a few times\n",
        "    handle.write(\"Hello, world!\")\n",
        "    handle.write(\"Hello, world!\")\n",
        "    handle.write(\"Hello, world!\")\n",
        "    handle.write(\"Hello, world!\")"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-22",
      "source": "<p>Check the file’s contents in your folder, does it look like you expect? Remember your <code style=\"color: inherit\">\\n</code>s!</p>\n<h3 id=\"use-case-transformation\">Use Case: Transformation</h3>\n<p>A common use case is transforming one file’s contents into another file format or file type. Let’s do a very simple example of that, taking a FASTQ file and transforming it into a FASTA file. Remember first that a FASTQ file looks like:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">@M00970:337:000000000-BR5KF:1:1102:17745:1557 1:N:0:CGCAGAAC+ACAGAGTT\nGTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA\n+\nGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(\n</code></pre></div></div>\n<table>\n<thead>\n<tr>\n<th>Line</th>\n<th>Contents</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>1</td>\n<td>Identifier, prefixed with <code style=\"color: inherit\">@</code></td>\n</tr>\n<tr>\n<td>2</td>\n<td>Sequence</td>\n</tr>\n<tr>\n<td>3</td>\n<td><code style=\"color: inherit\">+</code></td>\n</tr>\n<tr>\n<td>4</td>\n<td>Quality scores</td>\n</tr>\n</tbody>\n</table>\n<p>And a fasta file looks like this, where <code style=\"color: inherit\">&gt;</code> indicates a sequence identifier, and it is followed by one or more lines of ACTGs.</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">&gt;M00970:337:000000000-BR5KF:1:1102:17745:1557 1:N:0:CGCAGAAC+ACAGAGTT\nGTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA\n</code></pre></div></div>\n<p>In the setup portion we downloaded a FASTQ file, now let’s extract all of the sequences from this file, and write them out as a FASTA file. Why would you want to do this? Sometimes after sequencing a sample (especially metagenomics), you want to blast the sequences to figure out which organisms they belong to. A common way to do that is BLAST which accepts fasta formatted sequences. So we’ll write something to convert these formats, removing the <code style=\"color: inherit\">+</code> and quality score lines.</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-23",
      "source": [
        "with open('example.fastq', 'r') as handle:\n",
        "    data = handle.readlines()\n",
        "\n",
        "i = 0\n",
        "with open('example-converted.fasta', 'w') as handle:\n",
        "    for line in data:\n",
        "        # Since fastq files have groups of 4 lines, we can use that to extract\n",
        "        # specific lines of every read:\n",
        "        if i % 4 == 0:\n",
        "            handle.write(\">\" + line[1:])\n",
        "        if i % 4 == 1:\n",
        "            handle.write(line)\n",
        "        i = i + 1"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-24",
      "source": "<p>That’s it! Check if your <code style=\"color: inherit\">fasta</code> file looks correct.</p>\n<blockquote class=\"tip\" style=\"border: 2px solid #FFE19E; margin: 1em 0.2em\">\n<div id=\"tip-enumerate\" class=\"box-title\"><button type=\"button\" aria-controls=\"tip-enumerate-contents\" aria-expanded=\"true\" aria-label=\"Toggle tip box: &lt;code&gt;enumerate&lt;/code&gt;\"><i class=\"far fa-lightbulb\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Tip: <code style=\"color: inherit\">enumerate</code><span role=\"button\" class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>If you want to know which item number you’re on while you’re looping over a list, you can use the function <code style=\"color: inherit\">enumerate()</code>\nTry out this code to see how it works:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">for index, item in enumerate(['a', 'b', 'c']):\n    print(index, item)\n</code></pre></div>  </div>\n<p>Try using it to clean up the above code.</p>\n</blockquote>\n<h2 id=\"reading-csv-data\">Reading CSV data</h2>\n<p>If you’re reading data from a comma separated value (CSV) or tab separated value (TSV) file, you should use the built in <code style=\"color: inherit\">csv</code> module to do this. You might ask yourself “Why, csv parsing is easy” and that is a common thought! It would be so simple to do something like</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-25",
      "source": [
        "# Please don't do this :)\n",
        "with open('vaccinations.csv', 'r') as handle:\n",
        "    first_lines = handle.readlines()[0:10]\n",
        "    for line in first_lines:\n",
        "        print(line.split(','))"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-26",
      "source": "<p>But you would be wrong! This code has a subtle bug that you might not see until someone generates data that specifically affects it, with “quoted” columns. If you have a table like</p>\n<table>\n<thead>\n<tr>\n<th>Patient</th>\n<th>Location</th>\n<th>Disease Indications</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Helena</td>\n<td>Den Haag, the Netherlands</td>\n<td>Z87.890</td>\n</tr>\n<tr>\n<td>Bob</td>\n<td>Little Rock, Arkansas, USA</td>\n<td>Z72.53</td>\n</tr>\n<tr>\n<td>Jane</td>\n<td>London, UK</td>\n<td>Z86.16</td>\n</tr>\n</tbody>\n</table>\n<p>This would probably be exported as a CSV file from Excel that looks like:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">Patient,Location,Disease Indications\nHelena,\"Den Haag, the Netherlands\",Z87.890\nBob,\"Little Rock, Arkansas, USA\",Z72.53\nJane,\"London, UK\",Z86.16\n</code></pre></div></div>\n<p>Note that some columns are quoted. What do you think will happen with the following code?</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-27",
      "source": [
        "csv_data = \"\"\"\n",
        "Patient,Location,Disease Indications\n",
        "Helena,\"Den Haag, the Netherlands\",Z87.890\n",
        "Bob,\"Little Rock, Arkansas, USA\",Z72.53\n",
        "Jane,\"London, UK\",Z86.16\n",
        "\"\"\".strip().split('\\n')\n",
        "\n",
        "# Please don't do this :)\n",
        "for line in csv_data:\n",
        "    print(line.split(','))"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-28",
      "source": "<p>Does that look right? Maybe not. Instead we can use the <code style=\"color: inherit\">csv</code> module to work around this and properly process CSV files:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-29",
      "source": [
        "import csv\n",
        "\n",
        "csv_data = \"\"\"\n",
        "Patient,Location,Disease Indications\n",
        "Helena,\"Den Haag, the Netherlands\",Z87.890\n",
        "Bob,\"Little Rock, Arkansas, USA\",Z72.53\n",
        "Jane,\"London, UK\",Z86.16\n",
        "\"\"\".strip().split('\\n')\n",
        "\n",
        "# Please DO this :)\n",
        "csv_reader = csv.reader(csv_data, delimiter=\",\", quotechar='\"')\n",
        "for row in csv_reader:\n",
        "    print(row)"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-30",
      "source": "<p>That looks a lot better! Now we’ve properly handled the quoted columns that contain one or more <code style=\"color: inherit\">,</code> in the middle of our file. This is actually one of the motivating factors in using the TSV format, the <kbd>tab</kbd> character is much more rare in data than <kbd>,</kbd>. There is less chance for confusion with poorly written software.</p>\n<p>Let’s read in some statistics about vaccinations:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-31",
      "source": [
        "import csv\n",
        "\n",
        "vax = []\n",
        "with open('vaccinations.csv', 'r') as handle:\n",
        "    csv_reader = csv.reader(handle, delimiter=\",\", quotechar='\"')\n",
        "    for row in csv_reader:\n",
        "        # Skip our header row\n",
        "        if row[0] == 'YearWeekISO':\n",
        "            continue\n",
        "        # Otherwise load in the data\n",
        "        vax.append(row)\n",
        "\n",
        "print(vax[0:10])"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-32",
      "source": "<p>Here we have a 2 dimensional array, a list of lists. Each row is an entry in the main list, and each column is an entry in each of those children.</p>\n<p>Our columns are:</p>\n<table>\n<thead>\n<tr>\n<th>Column</th>\n<th>Value</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>0</td>\n<td>YearWeekISO</td>\n</tr>\n<tr>\n<td>1</td>\n<td>ReportingCountry</td>\n</tr>\n<tr>\n<td>2</td>\n<td>Denominator</td>\n</tr>\n<tr>\n<td>3</td>\n<td>NumberDosesReceived</td>\n</tr>\n<tr>\n<td>4</td>\n<td>NumberDosesExported</td>\n</tr>\n<tr>\n<td>5</td>\n<td>FirstDose</td>\n</tr>\n<tr>\n<td>6</td>\n<td>FirstDoseRefused</td>\n</tr>\n<tr>\n<td>7</td>\n<td>SecondDose</td>\n</tr>\n<tr>\n<td>8</td>\n<td>DoseAdditional1</td>\n</tr>\n<tr>\n<td>9</td>\n<td>UnknownDose</td>\n</tr>\n<tr>\n<td>10</td>\n<td>Region</td>\n</tr>\n<tr>\n<td>11</td>\n<td>TargetGroup</td>\n</tr>\n<tr>\n<td>12</td>\n<td>Vaccine</td>\n</tr>\n<tr>\n<td>13</td>\n<td>Population</td>\n</tr>\n</tbody>\n</table>\n<p>Let’s subset the data to make it a bit easier to work with, maybe we’ll just use the Dutch data (please feel free to choose another column though!)</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-33",
      "source": [
        "country = 'NL'\n",
        "\n",
        "subset = []\n",
        "for row in vax:\n",
        "    # Here we select for a country\n",
        "    if row[1] == country:\n",
        "        subset.append(row)\n",
        "print(subset[0:10])"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-34",
      "source": "<p>That should be easier to work with, now we only have one country’s data. Let’s do some exercises with this data:</p>\n<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div id=\"question-which-vaccines-were-given\" class=\"box-title\" aria-label=\"question box: Which vaccines were given?\"><i class=\"far fa-question-circle\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Question: Which vaccines were given?</div>\n<p>Which vaccines were given? Use the <code style=\"color: inherit\">subset</code> to examine which vaccines were given in the Netherlands. <em>Tip</em>: if <code style=\"color: inherit\">x</code> is a list, <code style=\"color: inherit\">set(x)</code> will return the unique values in that list.</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em; padding: 0.5em;\"><summary>👁 View solution</summary>\n<div id=\"solution\" class=\"box-title\"><button type=\"button\" aria-controls=\"solution-contents\" aria-expanded=\"true\" aria-label=\"Toggle solution box: \"><i class=\"far fa-eye\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Solution<span role=\"button\" class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>To figure out which vaccines were given, we can look at column 12:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">vaccines = []\nfor row in subset:\n    vaccines.append(row[12])\nprint(set(vaccines))\n</code></pre></div>    </div>\n<p>We can use the <code style=\"color: inherit\">set</code> function to convert the list to a set, and show only the unique values.</p>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-35",
      "source": [
        "# Try things out here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-36",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div id=\"question-how-many-of-each-vaccine-were-given\" class=\"box-title\" aria-label=\"question box: How many of each vaccine were given?\"><i class=\"far fa-question-circle\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Question: How many of each vaccine were given?</div>\n<p>How many of each were given?</p>\n<p><em>Tip</em>: use the accumulator pattern.\n<em>Tip</em>: Columns 5, 7, and 8 have doses being given out to patients.</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em; padding: 0.5em;\"><summary>👁 View solution</summary>\n<div id=\"solution-1\" class=\"box-title\"><button type=\"button\" aria-controls=\"solution-1-contents\" aria-expanded=\"true\" aria-label=\"Toggle solution box: \"><i class=\"far fa-eye\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Solution<span role=\"button\" class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">doses = {}\nfor row in subset:\n    brand = row[12]\n    if brand not in doses:\n        doses[brand] = 0\n    doses[brand] = doses[brand] + int(row[5]) + int(row[7]) + int(row[8])\nprint(doses)\n</code></pre></div>    </div>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-37",
      "source": [
        "# Try things out here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-38",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div id=\"question-how-many-of-each-vaccine-were-exported-received\" class=\"box-title\" aria-label=\"question box: How many of each vaccine were exported? received?\"><i class=\"far fa-question-circle\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Question: How many of each vaccine were exported? received?</div>\n<p>How many of each were exported? received?</p>\n<p><em>Tip</em>: you only need to loop once.\n<em>Tip</em>: you will need to handle an edge case here. Try and find it out!</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em; padding: 0.5em;\"><summary>👁 View solution</summary>\n<div id=\"solution-2\" class=\"box-title\"><button type=\"button\" aria-controls=\"solution-2-contents\" aria-expanded=\"true\" aria-label=\"Toggle solution box: \"><i class=\"far fa-eye\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Solution<span role=\"button\" class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">export = {}\nreceived = {}\nfor row in subset:\n    brand = row[12]\n    if brand not in export:\n        export[brand] = 0\n    if brand not in received:\n        received[brand] = 0\n    if row[4]:\n        export[brand] = export[brand] + int(row[4])\n    if row[3]:\n        received[brand] = received[brand] + int(row[3])\nprint(export)\nprint(received)\n</code></pre></div>    </div>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-39",
      "source": [
        "# Try things out here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-40",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div id=\"question-when-was-the-first-dose-received\" class=\"box-title\" aria-label=\"question box: When was the first dose received?\"><i class=\"far fa-question-circle\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Question: When was the first dose received?</div>\n<p><em>Tip</em>: use <code style=\"color: inherit\">break</code>, and check how many doses were given!</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em; padding: 0.5em;\"><summary>👁 View solution</summary>\n<div id=\"solution-3\" class=\"box-title\"><button type=\"button\" aria-controls=\"solution-3-contents\" aria-expanded=\"true\" aria-label=\"Toggle solution box: \"><i class=\"far fa-eye\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Solution<span role=\"button\" class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">for row in subset:\n    if row[5] and int(row[5]) &gt; 0:\n        print(f\"On {row[0]}, {row[5]} doses of {row[12]} were given\")\n        break\n</code></pre></div>    </div>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-41",
      "source": [
        "# Try things out here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-42",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div id=\"question-transform-the-data-for-plotting\" class=\"box-title\" aria-label=\"question box: Transform the data for plotting\"><i class=\"far fa-question-circle\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Question: Transform the data for plotting</div>\n<p>Let’s say you want to plot the fraction of the population that has been vaccinated by the various points in time.</p>\n<ul>\n<li>Subset the data further for TargetGroup (column 11) set to ‘ALL’</li>\n<li>Create an accumulator, to count how many doses have been given their FirstDose at each week</li>\n<li>Use column 13 (population) to calculate the fraction of the population that has been given one of those doses at each week</li>\n</ul>\n<p>The output should be a list of percentages ranging from [0.0 to 1.0].</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em; padding: 0.5em;\"><summary>👁 View solution</summary>\n<div id=\"solution-4\" class=\"box-title\"><button type=\"button\" aria-controls=\"solution-4-contents\" aria-expanded=\"true\" aria-label=\"Toggle solution box: \"><i class=\"far fa-eye\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Solution<span role=\"button\" class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">total_doses = 0\npercent_vaccinated_per_week = []\nfor row in subset:\n    if row[11] != 'ALL':\n        continue\n    total_doses = total_doses + int(row[5])\n    percent_vaccinated_per_week.append(total_doses / int(row[13]))\n</code></pre></div>    </div>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-43",
      "source": [
        "# Try things out here!\n",
        "percent_vaccinated_per_week = []\n",
        "# Write code here!\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "# When you're done, you should have a 'results' variable\n",
        "# You may need to `pip install matplotlib`\n",
        "%matplotlib inline\n",
        "import matplotlib.pyplot as plt\n",
        "plt.plot(percent_vaccinated_per_week)\n",
        "plt.xlabel('Week')\n",
        "plt.ylabel('Percent Vaccinated')"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-44",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div id=\"question-which-vaccines-were-given-1\" class=\"box-title\" aria-label=\"question box: Which vaccines were given?\"><i class=\"far fa-question-circle\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Question: Which vaccines were given?</div>\n<p>Write this out to a new file, with two columns. The index and the Percent Vaccinated. It should be a comma separated file, and should have a header. Save this as a csv file named <code style=\"color: inherit\">weekly-percent-vax.csv</code></p>\n<p><em>Tip</em>: use a <code style=\"color: inherit\">csvwriter</code>, it works exactly like a <code style=\"color: inherit\">csvreader</code>. You can use the <code style=\"color: inherit\">writerow()</code> function to write out a row.\n<em>Tip</em>: Use <code style=\"color: inherit\">enumerate()</code> to get a list of items with indexes.</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em; padding: 0.5em;\"><summary>👁 View solution</summary>\n<div id=\"solution-5\" class=\"box-title\"><button type=\"button\" aria-controls=\"solution-5-contents\" aria-expanded=\"true\" aria-label=\"Toggle solution box: \"><i class=\"far fa-eye\" aria-hidden=\"true\"></i><span class=\"visually-hidden\"></span> Solution<span role=\"button\" class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">with open('weekly-percent-vax.csv', 'w') as handle:\n    writer = csv.writer(csv_data, delimiter=\",\", quotechar='\"')\n    for row in enumerate(percent_vaccinated_per_week):\n        writer.writerow(row)\n</code></pre></div>    </div>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-45",
      "source": [
        "# Try things out here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-46",
      "source": "<p>Congratulations on getting this far! Hopefully you feel more comfortable working with files.</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "cell_type": "markdown",
      "id": "final-ending-cell",
      "metadata": {
        "editable": false,
        "collapsed": false
      },
      "source": [
        "# Key Points\n\n",
        "- File reading requires a mode: read, write, and append\n",
        "- Use the CSV module to parse CSV files.\n",
        "- do NOT attempt to do it yourself, it will encounter edge cases that the CSV module handles for you\n",
        "- Use a `with` block to open a file.\n",
        "\n# Congratulations on successfully completing this tutorial!\n\n",
        "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-files/tutorial.html#feedback) and check there for further resources!\n"
      ]
    }
  ]
}