{ "metadata": { }, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
\n\nComment\nThis tutorial is significantly based on the Carpentries Databases and SQL lesson, which is licensed CC-BY 4.0.
\nAbigail Cabunoc and Sheldon McKay (eds): “Software Carpentry: Using Databases and SQL.” Version 2017.08, August 2017,\ngithub.com/swcarpentry/sql-novice-survey, https://doi.org/10.5281/zenodo.838776
\nAdaptations have been made to make this work better in a GTN/Galaxy environment.
\n
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "# This preamble sets up the sql \"magic\" for jupyter. Use %%sql in your cells to write sql!\n", "!python3 -m pip install ipython-sql sqlalchemy\n", "!wget -c http://swcarpentry.github.io/sql-novice-survey/files/survey.db\n", "import sqlalchemy\n", "engine = sqlalchemy.create_engine(\"sqlite:///survey.db\")\n", "%load_ext sql\n", "%sql sqlite:///survey.db\n", "%config SqlMagic.displaycon=False" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-2", "source": "Agenda\nIn this tutorial, we will cover:
\n
We now want to calculate ranges and averages for our data.\nWe know how to select all of the dates from the Visited
table:
but to combine them,\nwe must use an aggregation function\nsuch as min
or max
.\nEach of these functions takes a set of records as input,\nand produces a single record as output:
min
and max
are just two of\nthe aggregation functions built into SQL.\nThree others are avg
,\ncount
,\nand sum
:
We used count(reading)
here,\nbut we could just as easily have counted quant
\nor any other field in the table,\nor even used count(*)
,\nsince the function doesn’t care about the values themselves,\njust how many values there are.
SQL lets us do several aggregations at once.\nWe can,\nfor example,\nfind the range of sensible salinity measurements:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "%%sql\n", "SELECT min(reading), max(reading) FROM Survey WHERE quant = 'sal' AND reading <= 1.0;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-16", "source": "We can also combine aggregated results with raw results,\nalthough the output might surprise you:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "%%sql\n", "SELECT person, count(*) FROM Survey WHERE quant = 'sal' AND reading <= 1.0;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-18", "source": "Why does Lake’s name appear rather than Roerich’s or Dyer’s?\nThe answer is that when it has to aggregate a field,\nbut isn’t told how to,\nthe database manager chooses an actual value from the input set.\nIt might use the first one processed,\nthe last one,\nor something else entirely.
\nAnother important fact is that when there are no values to aggregate —\nfor example, where there are no rows satisfying the WHERE
clause —\naggregation’s result is “don’t know”\nrather than zero or some other arbitrary value:
One final important feature of aggregation functions is that\nthey are inconsistent with the rest of SQL in a very useful way.\nIf we add two values,\nand one of them is null,\nthe result is null.\nBy extension,\nif we use sum
to add all the values in a set,\nand any of those values are null,\nthe result should also be null.\nIt’s much more useful,\nthough,\nfor aggregation functions to ignore null values\nand only combine those that are non-null.\nThis behavior lets us write our queries as:
instead of always having to filter explicitly:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-23", "source": [ "%%sql\n", "SELECT min(dated) FROM Visited WHERE dated IS NOT NULL;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-24", "source": "Aggregating all records at once doesn’t always make sense.\nFor example,\nsuppose we suspect that there is a systematic bias in our data,\nand that some scientists’ radiation readings are higher than others.\nWe know that this doesn’t work:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "%%sql\n", "SELECT person, count(reading), round(avg(reading), 2)\n", "FROM Survey\n", "WHERE quant = 'rad';" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-26", "source": "because the database manager selects a single arbitrary scientist’s name\nrather than aggregating separately for each scientist.\nSince there are only five scientists,\nwe could write five queries of the form:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-27", "source": [ "%%sql\n", "SELECT person, count(reading), round(avg(reading), 2)\n", "FROM Survey\n", "WHERE quant = 'rad'\n", "AND person = 'dyer';" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-28", "source": "but this would be tedious,\nand if we ever had a data set with fifty or five hundred scientists,\nthe chances of us getting all of those queries right is small.
\nWhat we need to do is\ntell the database manager to aggregate the hours for each scientist separately\nusing a GROUP BY
clause:
GROUP BY
does exactly what its name implies:\ngroups all the records with the same value for the specified field together\nso that aggregation can process each batch separately.\nSince all the records in each batch have the same value for person
,\nit no longer matters that the database manager\nis picking an arbitrary one to display\nalongside the aggregated reading
values.
\n\n\n\n
GROUP BY
is basically just a pivot table for Excel users, it lets you build\nnice summary tables which aggregate your results.And if you didn’t already know the Excel equivalent, now you know what to\nlook for when you need it!
\n
Just as we can sort by multiple criteria at once,\nwe can also group by multiple criteria.\nTo get the average reading by scientist and quantity measured,\nfor example,\nwe just add another field to the GROUP BY
clause:
Note that we have added quant
to the list of fields displayed,\nsince the results wouldn’t make much sense otherwise.
Let’s go one step further and remove all the entries\nwhere we don’t know who took the measurement:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-33", "source": [ "%%sql\n", "SELECT person, quant, count(reading), round(avg(reading), 2)\n", "FROM Survey\n", "WHERE person IS NOT NULL\n", "GROUP BY person, quant\n", "ORDER BY person, quant;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-34", "source": "Looking more closely,\nthis query:
\nSurvey
table where the person
field was not null;person
and quant
values in each subset were the same;person
, and then within each sub-group by quant
; andreading
in each, and chose a person
and quant
value from each (it doesn’t matter which ones, since they’re all equal).\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-35", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-36", "source": "Question: Counting Temperature Readings\nHow many temperature readings did Frank Pabodie record,\nand what was their average value?
\n\n👁 View solution
\n\n\nSELECT count(reading), avg(reading) FROM Survey WHERE quant = 'temp' AND person = 'pb';\n
\n\n
\n\n \n\n\ncount(reading) \navg(reading) \n\n \n\n2 \n-20.0 \n
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-37", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-38", "source": "Averaging with NULL
\nThe average of a set of values is the sum of the values\ndivided by the number of values.\nDoes this mean that the
\navg
function returns 2.0 or 3.0\nwhen given the values 1.0,null
, and 5.0?\n👁 View solution
\n\nThe answer is 3.0.\n
\nNULL
is not a value; it is the absence of a value.\nAs such it is not included in the calculation.You can confirm this, by executing this code:
\n\nSELECT AVG(a) FROM (\n SELECT 1 AS a\n UNION ALL SELECT NULL\n UNION ALL SELECT 5);\n
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-39", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-40", "source": "Question: What Does This Query Do?\nWe want to calculate the difference between\neach individual radiation reading\nand the average of all the radiation readings.\nWe write the query:
\n\nSELECT reading - avg(reading) FROM Survey WHERE quant = 'rad';\n
What does this actually produce, and can you think of why?
\n\n👁 View solution
\n\nThe query produces only one row of results when we what we really want is a result for each of the readings.\nThe
\navg()
function produces only a single value, and because it is run first, the table is reduced to a single row.\nThereading
value is simply an arbitrary one.To achieve what we wanted, we would have to run two queries:
\n\nSELECT avg(reading) FROM Survey WHERE quant='rad';\n
This produces the average value (6.5625), which we can then insert into a second query:
\n\nSELECT reading - 6.5625 FROM Survey WHERE quant = 'rad';\n
This produces what we want, but we can combine this into a single query using subqueries.
\n\nSELECT reading - (SELECT avg(reading) FROM Survey WHERE quant='rad') FROM Survey WHERE quant = 'rad';\n
This way we don’t have execute two queries.
\nIn summary what we have done is to replace
\navg(reading)
with(SELECT avg(reading) FROM Survey WHERE quant='rad')
in the original query.
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-41", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-42", "source": "Question: Ordering When Concatenating\nThe function
\ngroup_concat(field, separator)
\nconcatenates all the values in a field\nusing the specified separator character\n(or ‘,’ if the separator isn’t specified).\nUse this to produce a one-line list of scientists’ names,\nsuch as:\nWilliam Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth\n
Can you find a way to order the list by surname?
\n
In order to submit our data to a web site\nthat aggregates historical meteorological data,\nwe might need to format it as\nlatitude, longitude, date, quantity, and reading.\nHowever,\nour latitudes and longitudes are in the Site
table,\nwhile the dates of measurements are in the Visited
table\nand the readings themselves are in the Survey
table.\nWe need to combine these tables somehow.
This figure shows the relations between the tables:
\nThe SQL command to do this is JOIN
.\nTo see how it works,\nlet’s start by joining the Site
and Visited
tables:
JOIN
creates\nthe cross product\nof two tables,\ni.e.,\nit joins each record of one table with each record of the other table\nto give all possible combinations.\nSince there are three records in Site
\nand eight in Visited
,\nthe join’s output has 24 records (3 * 8 = 24) .\nAnd since each table has three fields,\nthe output has six fields (3 + 3 = 6).
What the join hasn’t done is\nfigure out if the records being joined have anything to do with each other.\nIt has no way of knowing whether they do or not until we tell it how.\nTo do that,\nwe add a clause specifying that\nwe’re only interested in combinations that have the same site name,\nthus we need to use a filter:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-45", "source": [ "%%sql\n", "SELECT * FROM Site JOIN Visited ON Site.name = Visited.site;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-46", "source": "ON
is very similar to WHERE
,\nand for all the queries in this lesson you can use them interchangeably.\nThere are differences in how they affect [outer joins][outer],\nbut that’s beyond the scope of this lesson.\nOnce we add this to our query,\nthe database manager throws away records\nthat combined information about two different sites,\nleaving us with just the ones we want.
Notice that we used Table.field
to specify field names\nin the output of the join.\nWe do this because tables can have fields with the same name,\nand we need to be specific which ones we’re talking about.\nFor example,\nif we joined the Person
and Visited
tables,\nthe result would inherit a field called id
\nfrom each of the original tables.
We can now use the same dotted notation\nto select the three columns we actually want\nout of our join:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-47", "source": [ "%%sql\n", "SELECT Site.lat, Site.long, Visited.dated\n", "FROM Site JOIN Visited\n", "ON Site.name = Visited.site;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-48", "source": "If joining two tables is good,\njoining many tables must be better.\nIn fact,\nwe can join any number of tables\nsimply by adding more JOIN
clauses to our query,\nand more ON
tests to filter out combinations of records\nthat don’t make sense:
We can tell which records from Site
, Visited
, and Survey
\ncorrespond with each other\nbecause those tables contain\nprimary keys\nand foreign keys.\nA primary key is a value,\nor combination of values,\nthat uniquely identifies each record in a table.\nA foreign key is a value (or combination of values) from one table\nthat identifies a unique record in another table.\nAnother way of saying this is that\na foreign key is the primary key of one table\nthat appears in some other table.\nIn our database,\nPerson.id
is the primary key in the Person
table,\nwhile Survey.person
is a foreign key\nrelating the Survey
table’s entries\nto entries in Person
.
Most database designers believe that\nevery table should have a well-defined primary key.\nThey also believe that this key should be separate from the data itself,\nso that if we ever need to change the data,\nwe only need to make one change in one place.\nOne easy way to do this is\nto create an arbitrary, unique ID for each record\nas we add it to the database.\nThis is actually very common:\nthose IDs have names like “student numbers” and “patient numbers”,\nand they almost always turn out to have originally been\na unique record identifier in some database system or other.\nAs the query below demonstrates,\nSQLite [automatically numbers records][rowid] as they’re added to tables,\nand we can use those record numbers in queries:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-51", "source": [ "%%sql\n", "SELECT rowid, * FROM Person;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-52", "source": "\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-53", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-54", "source": "Question: Listing Radiation Readings\nWrite a query that lists all radiation readings from the DR-1 site.
\n\n👁 View solution
\n\n\nSELECT Survey.reading\nFROM Site JOIN Visited JOIN Survey\nON Site.name = Visited.site\nAND Visited.id = Survey.taken\nWHERE Site.name = 'DR-1'\nAND Survey.quant = 'rad';\n
\n\n
\n\n \n\n\nreading \n\n \n9.82 \n\n \n7.8 \n\n \n\n11.25 \n
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-55", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-56", "source": "Question: Where's Frank?\nWrite a query that lists all sites visited by people named “Frank”.
\n\n👁 View solution
\n\n\nSELECT DISTINCT Site.name\nFROM Site JOIN Visited JOIN Survey JOIN Person\nON Site.name = Visited.site\nAND Visited.id = Survey.taken\nAND Survey.person = Person.id\nWHERE Person.personal = 'Frank';\n
\n\n
\n\n \n\n\nname \n\n \n\nDR-3 \n
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-57", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-58", "source": "Question: Reading Queries\nDescribe in your own words what the following query produces:
\n\nSELECT Site.name FROM Site JOIN Visited\nON Site.lat < -49.0 AND Site.name = Visited.site AND Visited.dated >= '1932-01-01';\n
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-59", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-60", "source": "Question: Who Has Been Where?\nWrite a query that shows each site with exact location (lat, long) ordered by visited date,\nfollowed by personal name and family name of the person who visited the site\nand the type of measurement taken and its reading. Please avoid all null values.\nTip: you should get 15 records with 8 fields.
\n\n👁 View solution
\n\n\nSELECT Site.name, Site.lat, Site.long, Person.personal, Person.family, Survey.quant, Survey.reading, Visited.dated\nFROM Site JOIN Visited JOIN Survey JOIN Person\nON Site.name = Visited.site\nAND Visited.id = Survey.taken\nAND Survey.person = Person.id\nWHERE Survey.person IS NOT NULL\nAND Visited.dated IS NOT NULL\nORDER BY Visited.dated;\n
\n\n
\n\n \n\n\nname \nlat \nlong \npersonal \nfamily \nquant \nreading \ndated \n\n \nDR-1 \n-49.85 \n-128.57 \nWilliam \nDyer \nrad \n9.82 \n1927-02-08 \n\n \nDR-1 \n-49.85 \n-128.57 \nWilliam \nDyer \nsal \n0.13 \n1927-02-08 \n\n \nDR-1 \n-49.85 \n-128.57 \nWilliam \nDyer \nrad \n7.8 \n1927-02-10 \n\n \nDR-1 \n-49.85 \n-128.57 \nWilliam \nDyer \nsal \n0.09 \n1927-02-10 \n\n \nDR-3 \n-47.15 \n-126.72 \nAnderson \nLake \nsal \n0.05 \n1930-01-07 \n\n \nDR-3 \n-47.15 \n-126.72 \nFrank \nPabodie \nrad \n8.41 \n1930-01-07 \n\n \nDR-3 \n-47.15 \n-126.72 \nFrank \nPabodie \ntemp \n-21.5 \n1930-01-07 \n\n \nDR-3 \n-47.15 \n-126.72 \nFrank \nPabodie \nrad \n7.22 \n1930-01-12 \n\n \nDR-3 \n-47.15 \n-126.72 \nAnderson \nLake \nsal \n0.1 \n1930-02-26 \n\n \nDR-3 \n-47.15 \n-126.72 \nFrank \nPabodie \nrad \n4.35 \n1930-02-26 \n\n \nDR-3 \n-47.15 \n-126.72 \nFrank \nPabodie \ntemp \n-18.5 \n1930-02-26 \n\n \nMSK-4 \n-48.87 \n-123.4 \nAnderson \nLake \nrad \n1.46 \n1932-01-14 \n\n \nMSK-4 \n-48.87 \n-123.4 \nAnderson \nLake \nsal \n0.21 \n1932-01-14 \n\n \nMSK-4 \n-48.87 \n-123.4 \nValentina \nRoerich \nsal \n22.5 \n1932-01-14 \n\n \n\nDR-1 \n-49.85 \n-128.57 \nValentina \nRoerich \nrad \n11.25 \n1932-03-22 \n
A good visual explanation of joins can be found in the SQL Join Visualizer
\nNow that we have seen how joins work, we can see why the relational\nmodel is so useful and how best to use it. The first rule is that\nevery value should be atomic, i.e., not\ncontain parts that we might want to work with separately. We store\npersonal and family names in separate columns instead of putting the\nentire name in one column so that we don’t have to use substring\noperations to get the name’s components. More importantly, we store\nthe two parts of the name separately because splitting on spaces is\nunreliable: just think of a name like “Eloise St. Cyr” or “Jan Mikkel\nSteubart”.
\nThe second rule is that every record should have a unique primary key.\nThis can be a serial number that has no intrinsic meaning,\none of the values in the record (like the id
field in the Person
table),\nor even a combination of values:\nthe triple (taken, person, quant)
from the Survey
table uniquely identifies every measurement.
The third rule is that there should be no redundant information.\nFor example,\nwe could get rid of the Site
table and rewrite the Visited
table like this:
id | \nlat | \nlong | \ndated | \n
---|---|---|---|
619 | \n-49.85 | \n-128.57 | \n1927-02-08 | \n
622 | \n-49.85 | \n-128.57 | \n1927-02-10 | \n
734 | \n-47.15 | \n-126.72 | \n1930-01-07 | \n
735 | \n-47.15 | \n-126.72 | \n1930-01-12 | \n
751 | \n-47.15 | \n-126.72 | \n1930-02-26 | \n
752 | \n-47.15 | \n-126.72 | \nNone | \n
837 | \n-48.87 | \n-123.40 | \n1932-01-14 | \n
844 | \n-49.85 | \n-128.57 | \n1932-03-22 | \n
In fact,\nwe could use a single table that recorded all the information about each reading in each row,\njust as a spreadsheet would.\nThe problem is that it’s very hard to keep data organized this way consistent:\nif we realize that the date of a particular visit to a particular site is wrong,\nwe have to change multiple records in the database.\nWhat’s worse,\nwe may have to guess which records to change,\nsince other sites may also have been visited on that date.
\nThe fourth rule is that the units for every value should be stored explicitly.\nOur database doesn’t do this,\nand that’s a problem:\nRoerich’s salinity measurements are several orders of magnitude larger than anyone else’s,\nbut we don’t know if that means she was using parts per million instead of parts per thousand,\nor whether there actually was a saline anomaly at that site in 1932.
\nStepping back,\ndata and the tools used to store it have a symbiotic relationship:\nwe use tables and joins because it’s efficient,\nprovided our data is organized a certain way,\nbut organize our data that way because we have tools to manipulate it efficiently.\nAs anthropologists say,\nthe tool shapes the hand that shapes the tool.
\n\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-61", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-62", "source": "Question: Identifying Atomic Values\nWhich of the following are atomic values? Which are not? Why?
\n\n
\n- New Zealand
\n- 87 Turing Avenue
\n- January 25, 1971
\n- the XY coordinate (0.5, 3.3)
\n\n👁 View solution
\n\nNew Zealand is the only clear-cut atomic value.
\nThe address and the XY coordinate contain more than one piece of information\nwhich should be stored separately:
\n\n
\n- House number, street name
\n- X coordinate, Y coordinate
\nThe date entry is less clear cut, because it contains month, day, and year elements.\nHowever, there is a
\nDATE
datatype in SQL, and dates should be stored using this format.\nIf we need to work with the month, day, or year separately, we can use the SQL functions available for our database software\n(for exampleEXTRACT
orSTRFTIME
for SQLite).
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-63", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-64", "source": "Question: Identifying a Primary Key\nWhat is the primary key in this table?\nI.e., what value or combination of values uniquely identifies a record?
\n\n\n
\n\n \n\n\nlatitude \nlongitude \ndate \ntemperature \n\n \n\n57.3 \n-22.5 \n2015-01-09 \n-14.2 \n\n👁 View solution
\n\nLatitude, longitude, and date are all required to uniquely identify the temperature record.
\n
So far we have only looked at how to get information out of a database,\nboth because that is more frequent than adding information,\nand because most other operations only make sense\nonce queries are understood.\nIf we want to create and modify data,\nwe need to know two other sets of commands.
\nThe first pair are [CREATE TABLE
][create-table] and [DROP TABLE
][drop-table].\nWhile they are written as two words,\nthey are actually single commands.\nThe first one creates a new table;\nits arguments are the names and types of the table’s columns.\nFor example,\nthe following statements create the four tables in our survey database:
We can get rid of one of our tables using:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-67", "source": [ "%%sql\n", "DROP TABLE Survey;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-68", "source": "Be very careful when doing this:\nif you drop the wrong table, hope that the person maintaining the database has a backup,\nbut it’s better not to have to rely on it.
\nDifferent database systems support different data types for table columns,\nbut most provide the following:
\ndata type | \nuse | \n
---|---|
INTEGER | \na signed integer | \n
REAL | \na floating point number | \n
TEXT | \na character string | \n
BLOB | \na “binary large object”, such as an image | \n
Most databases also support Booleans and date/time values;\nSQLite uses the integers 0 and 1 for the former,\nand represents the latter as text or numeric fields.
\nAn increasing number of databases also support geographic data types,\nsuch as latitude and longitude.\nKeeping track of what particular systems do or do not offer,\nand what names they give different data types,\nis an unending portability headache.
\n\n\n\nSQLite is fantastic for small databases or embedded into applications where\nyou want to be able to use SQL to query and process data.
\nHowever for any real analysis PostgreSQL is usually the best choice, it\nscales incredibly well and can meet a wide range of use cases. It has good\ndata type support.
\n
\n\n\nUse Postgres. The PostGIS library is fantastic and industry standard for storing geographic data in a database.
\n
When we create a table,\nwe can specify several kinds of constraints on its columns.\nFor example,\na better definition for the Survey
table would be:
Once again,\nexactly what constraints are available\nand what they’re called\ndepends on which database manager we are using.
\nOnce tables have been created,\nwe can add, change, and remove records using our other set of commands,\nINSERT
, UPDATE
, and DELETE
.
Here is an example of inserting rows into the Site
table:
We can also insert values into one table directly from another:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-73", "source": [ "%%sql\n", "CREATE TABLE JustLatLong(lat real, long real);\n", "INSERT INTO JustLatLong SELECT lat, long FROM Site;" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-74", "source": "Modifying existing records is done using the UPDATE
statement.\nTo do this we tell the database which table we want to update,\nwhat we want to change the values to for any or all of the fields,\nand under what conditions we should update the values.
For example, if we made a mistake when entering the lat and long values\nof the last INSERT
statement above, we can correct it with an update:
Be careful to not forget the WHERE
clause or the update statement will\nmodify all of the records in the database.
Deleting records can be a bit trickier,\nbecause we have to ensure that the database remains internally consistent.\nIf all we care about is a single table,\nwe can use the DELETE
command with a WHERE
clause\nthat matches the records we want to discard.\nFor example,\nonce we realize that Frank Danforth didn’t take any measurements,\nwe can remove him from the Person
table like this:
But what if we removed Anderson Lake instead?\nOur Survey
table would still contain seven records\nof measurements he’d taken,\nbut that’s never supposed to happen:\nSurvey.person
is a foreign key into the Person
table,\nand all our queries assume there will be a row in the latter\nmatching every value in the former.
This problem is called referential integrity:\nwe need to ensure that all references between tables can always be resolved correctly.\nOne way to do this is to delete all the records\nthat use 'lake'
as a foreign key\nbefore deleting the record that uses it as a primary key.\nIf our database manager supports it,\nwe can automate this\nusing cascading delete.\nHowever,\nthis technique is outside the scope of this chapter.
\n\n\nMany applications use a hybrid storage model\ninstead of putting everything into a database:\nthe actual data (such as astronomical images) is stored in files,\nwhile the database stores the files’ names,\ntheir modification dates,\nthe region of the sky they cover,\ntheir spectral characteristics,\nand so on.\nThis is also how most music player software is built:\nthe database inside the application keeps track of the MP3 files,\nbut the files themselves live on disk.
\n
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-79", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-80", "source": "Question: Replacing NULL\nWrite an SQL statement to replace all uses of
\nnull
in\nSurvey.person
with the string'unknown'
.\n👁 View solution
\n\n\nUPDATE Survey SET person = 'unknown' WHERE person IS NULL;\n
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-81", "source": [ "%%sql\n", "-- Try solutions here!" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "sql" ], "id": "" } } }, { "id": "cell-82", "source": "", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Use aggregation functions to combine multiple values.\n", "- Aggregation functions ignore `null` values.\n", "- Aggregation happens after filtering.\n", "- Use GROUP BY to combine subsets separately.\n", "- If no aggregation function is specified for a field, the query may return an arbitrary value for that field.\n", "- Use JOIN to combine data from two tables.\n", "- Use table.field notation to refer to fields when doing joins.\n", "- Every fact should be represented in a database exactly once.\n", "- A join produces all combinations of records from one table with records from another.\n", "- A primary key is a field (or set of fields) whose values uniquely identify the records in a table.\n", "- A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.\n", "- We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.\n", "- The most common join condition is matching keys.\n", "- Every value in a database should be atomic.\n", "- Every record should have a unique primary key.\n", "- A database should not contain redundant information.\n", "- Units and similar metadata should be stored with the data.\n", "- Use CREATE and DROP to create and delete tables.\n", "- Use INSERT to add data.\n", "- Use UPDATE to modify existing data.\n", "- Use DELETE to remove data.\n", "- It is simpler and safer to modify data when every record has a unique primary key.\n", "- Do not create dangling references by deleting records that other records refer to.\n", "- General-purpose languages have libraries for accessing databases.\n", "- To connect to a database, a program must use a library specific to that database manager.\n", "- These libraries use a connection-and-cursor model.\n", "- Programs can read query results in batches or all at once.\n", "- Queries should be written using parameter substitution, not string formatting.\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/sql-advanced/tutorial.html#feedback) and check there for further resources!\n" ] } ] }Question: Backing Up with SQL\nSQLite has several administrative commands that aren’t part of the\nSQL standard. One of them is
\n.dump
, which prints the SQL commands\nneeded to re-create the database. Another is.read
, which reads a\nfile created by.dump
and restores the database. A colleague of\nyours thinks that storing dump files (which are text) in version\ncontrol is a good way to track and manage changes to the database.\nWhat are the pros and cons of this approach? (Hint: records aren’t\nstored in any particular order.)\n👁 View solution
\n\nAdvantages
\n\n
\n- A version control system will be able to show differences between versions\nof the dump file; something it can’t do for binary files like databases
\n- A VCS only saves changes between versions, rather than a complete copy of\neach version (save disk space)
\n- The version control log will explain the reason for the changes in each version\nof the database
\nDisadvantages
\n\n
\n- Artificial differences between commits because records don’t have a fixed order
\n