{ "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "r" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.1.0" } }, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
\n\n# SQL with R\n\nby [The Carpentries](https://training.galaxyproject.org/hall-of-fame/carpentries/), [Helena Rasche](https://training.galaxyproject.org/hall-of-fame/hexylena/), [Avans Hogeschool](https://training.galaxyproject.org/hall-of-fame/avans-atgm/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n- How can I access databases from programs written in R?\n\n**Objectives**\n\n- Write short programs that execute SQL queries.\n- Trace the execution of a program that contains an SQL query.\n- Explain why most database applications are written in a general-purpose language rather than in SQL.\n\n**Time Estimation: 45M**\n
\n", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-0", "source": "
\n
Comment
\n

This tutorial is significantly based on the Carpentries Databases and SQL lesson, which is licensed CC-BY 4.0.

\n

Abigail Cabunoc and Sheldon McKay (eds): “Software Carpentry: Using Databases and SQL.” Version 2017.08, August 2017,\ngithub.com/swcarpentry/sql-novice-survey, https://doi.org/10.5281/zenodo.838776

\n

Adaptations have been made to make this work better in a GTN/Galaxy environment.

\n
\n
\n
Agenda
\n

In this tutorial, we will cover:

\n
\n

For this tutorial we need to download a database that we will use for the queries.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "download.file(\"http://swcarpentry.github.io/sql-novice-survey/files/survey.db\", destfile=\"survey.db\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "r" ], "id": "" } } }, { "id": "cell-2", "source": "

Programming with Databases - R

\n

Let’s have a look at how to access a database from\na data analysis language like R.\nOther languages use almost exactly the same model:\nlibrary and function names may differ,\nbut the concepts are the same.

\n

Here’s a short R program that selects latitudes and longitudes\nfrom an SQLite database stored in a file called survey.db:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-3", "source": [ "library(RSQLite)\n", "connection <- dbConnect(SQLite(), \"survey.db\")\n", "results <- dbGetQuery(connection, \"SELECT Site.lat, Site.long FROM Site;\")\n", "print(results)\n", "dbDisconnect(connection)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "r" ], "id": "" } } }, { "id": "cell-4", "source": "

The program starts by importing the RSQLite library.\nIf we were connecting to MySQL, DB2, or some other database,\nwe would import a different library,\nbut all of them provide the same functions,\nso that the rest of our program does not have to change\n(at least, not much)\nif we switch from one database to another.

\n

Line 2 establishes a connection to the database.\nSince we’re using SQLite,\nall we need to specify is the name of the database file.\nOther systems may require us to provide a username and password as well.

\n

On line 3, we retrieve the results from an SQL query.\nIt’s our job to make sure that SQL is properly formatted;\nif it isn’t,\nor if something goes wrong when it is being executed,\nthe database will report an error.\nThis result is a dataframe with one row for each entry and one column for each column in the database.

\n

Finally, the last line closes our connection,\nsince the database can only keep a limited number of these open at one time.\nSince establishing a connection takes time,\nthough,\nwe shouldn’t open a connection,\ndo one operation,\nthen close the connection,\nonly to reopen it a few microseconds later to do another operation.\nInstead,\nit’s normal to create one connection that stays open for the lifetime of the program.

\n

Queries in real applications will often depend on values provided by users.\nFor example,\nthis function takes a user’s ID as a parameter and returns their name:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "library(RSQLite)\n", "\n", "connection <- dbConnect(SQLite(), \"survey.db\")\n", "\n", "getName <- function(personID) {\n", " query <- paste0(\"SELECT personal || ' ' || family FROM Person WHERE id =='\",\n", " personID, \"';\")\n", " return(dbGetQuery(connection, query))\n", "}\n", "\n", "print(paste(\"full name for dyer:\", getName('dyer')))\n", "\n", "dbDisconnect(connection)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "r" ], "id": "" } } }, { "id": "cell-6", "source": "

We use string concatenation on the first line of this function\nto construct a query containing the user ID we have been given.\nThis seems simple enough,\nbut what happens if someone gives us this string as input?

\n
dyer'; DROP TABLE Survey; SELECT '\n
\n

It looks like there’s garbage after the user’s ID,\nbut it is very carefully chosen garbage.\nIf we insert this string into our query,\nthe result is:

\n
SELECT personal || ' ' || family FROM Person WHERE id='dyer'; DROP TABLE Survey; SELECT '';\n
\n

If we execute this,\nit will erase one of the tables in our database.

\n

This is called an SQL injection attack,\nand it has been used to attack thousands of programs over the years.\nIn particular,\nmany web sites that take data from users insert values directly into queries\nwithout checking them carefully first.\nA very relevant XKCD that explains the\ndangers of using raw input in queries a little more succinctly:

\n

\"A

\n

Since an unscrupulous parent might try to smuggle commands into our queries in many different ways,\nthe safest way to deal with this threat is\nto replace characters like quotes with their escaped equivalents,\nso that we can safely put whatever the user gives us inside a string.\nWe can do this by using a prepared statement\ninstead of formatting our statements as strings.\nHere’s what our example program looks like if we do this:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "library(RSQLite)\n", "connection <- dbConnect(SQLite(), \"survey.db\")\n", "\n", "getName <- function(personID) {\n", " query <- \"SELECT personal || ' ' || family FROM Person WHERE id == ?\"\n", " return(dbGetPreparedQuery(connection, query, data.frame(personID)))\n", "}\n", "\n", "print(paste(\"full name for dyer:\", getName('dyer')))\n", "\n", "dbDisconnect(connection)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "r" ], "id": "" } } }, { "id": "cell-8", "source": "

The key changes are in the query string and the dbGetQuery call (we use dbGetPreparedQuery instead).\nInstead of formatting the query ourselves,\nwe put question marks in the query template where we want to insert values.\nWhen we call dbGetPreparedQuery,\nwe provide a dataframe\nthat contains as many values as there are question marks in the query.\nThe library matches values to question marks in order,\nand translates any special characters in the values\ninto their escaped equivalents\nso that they are safe to use.

\n
\n
Question: Filling a Table vs. Printing Values
\n

Write an R program that creates a new database in a file called\noriginal.db containing a single table called Pressure, with a\nsingle field called reading, and inserts 100,000 random numbers\nbetween 10.0 and 25.0. How long does it take this program to run?\nHow long does it take to run a program that simply writes those\nrandom numbers to a file?

\n
\n
\n
Question: Filtering in SQL vs. Filtering in R
\n

Write an R program that creates a new database called\nbackup.db with the same structure as original.db and copies all\nthe values greater than 20.0 from original.db to backup.db.\nWhich is faster: filtering values in the query, or reading\neverything into memory and filtering in R?

\n
\n

Database helper functions in R

\n

R’s database interface packages (like RSQLite) all share\na common set of helper functions useful for exploring databases and\nreading/writing entire tables at once.

\n

To view all tables in a database, we can use dbListTables():

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "connection <- dbConnect(SQLite(), \"survey.db\")\n", "dbListTables(connection)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "r" ], "id": "" } } }, { "id": "cell-10", "source": "

To view all column names of a table, use dbListFields():

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-11", "source": [ "dbListFields(connection, \"Survey\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "r" ], "id": "" } } }, { "id": "cell-12", "source": "

To read an entire table as a dataframe, use dbReadTable():

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "dbReadTable(connection, \"Person\")" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "r" ], "id": "" } } }, { "id": "cell-14", "source": "

Finally to write an entire table to a database, you can use dbWriteTable().\nNote that we will always want to use the row.names = FALSE argument or R\nwill write the row names as a separate column.\nIn this example we will write R’s built-in iris dataset as a table in survey.db.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "dbWriteTable(connection, \"iris\", iris, row.names = FALSE)\n", "head(dbReadTable(connection, \"iris\"))" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "r" ], "id": "" } } }, { "id": "cell-16", "source": "

And as always, remember to close the database connection when done!

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "dbDisconnect(connection)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "r" ], "id": "" } } }, { "id": "cell-18", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Data analysis languages have libraries for accessing databases.\n", "- To connect to a database, a program must use a library specific to that database manager.\n", "- R's libraries can be used to directly query or read from a database.\n", "- Programs can read query results in batches or all at once.\n", "- Queries should be written using parameter substitution, not string formatting.\n", "- R has multiple helper functions to make working with databases easier.\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/sql-r/tutorial.html#feedback) and check there for further resources!\n" ] } ] }