How to Use SQLite from C/C++ in Xcode

This is a brief tutorial on how to get SQLite up and running in Xcode in a C++ project. If you are going to use it in an actual project, you will obviously want to wrap it up in a class and make it more friendly to use. This is just a small example to see the basics of how SQLite works in C++ (really C).

First, you’ll want to install the library. This is easiest to do using a package manager. I use MacPorts, but Brew will work just as well (with changes to the source directories in your Xcode project.) To install, run:

sudo port install sqlite3

Let’s create a basic test database:

$ sqlite3 test.db
SQLite version 3.10.2 2016-01-20 15:27:19
Enter ".help" for usage hints.
sqlite> .mode column
sqlite> .headers on
sqlite> create table test (id integer primary key, col1 float, col2 float);
sqlite> insert into test values (1, 1.0, 2.0);
sqlite> insert into test values (2, 2.0, 3.0);
sqlite> select * from test;
id          col1        col2
----------  ----------  ----------
1           1.0         2.0
2           2.0         3.0

Next, create a command-line C++ project in Xcode. Then, add the SQLite3 library by clicking on your project, then Build Phases, then expand Link Library With Libraries. Click the + (plus) and add the file /opt/local/lib/libsqlite3.a.

Screenshot 2016-02-18 20.29.35

Next, under Build Settings, edit Header Search Paths to include /opt/local/include.

Screenshot 2016-02-18 20.31.57

Now, we can actually get to the code. All this code does is open the database, prepare a basic SELECT statement, loop through the results (outputting the column names and data for each column) and then finalizing the statement and closing the connection. It assumes the first column is an integer and the rest are doubles:

#include <iostream>
#include <sqlite3.h>

int main(int argc, const char * argv[]) {

    sqlite3 *db;
    int rc;
    
    rc = sqlite3_open("/path/to/db/test.db", &db);
    if (rc) {
        std::cout << "Can't open database: " << sqlite3_errmsg(db) << std::endl;
        return 1;
    }
    
    sqlite3_stmt *stmt;
    
    rc = sqlite3_prepare_v2(db, "SELECT * FROM test", -1, &stmt, NULL);
    if (rc != SQLITE_OK) {
        std::cout << "SQL Error: " << sqlite3_errstr(rc) << std::endl;
    }
    
    while (sqlite3_step(stmt) == SQLITE_ROW) {
        int cnt = sqlite3_column_count(stmt);
        
        for (int i=0; i < cnt; i++) {
            std::cout << sqlite3_column_name(stmt, i) << std::endl;
        }
        std::cout << std::endl;
        
        for (int i=0; i < cnt; i++) {
            switch (i) {
                case 0:
                    std::cout << sqlite3_column_int(stmt, i) << std::endl;
                    break;
                    
                default:
                    std::cout << sqlite3_column_double(stmt, i) << std::endl;
                    break;
            }

        }
        
    }
    
    sqlite3_finalize(stmt);
    
    sqlite3_close(db);
    
    return 0;
}

And the results:

id
col1
col2

1
1
2
id
col1
col2

2
2
3

Generating Linear Weights from the Retrosheet Database

Similar to the previous article on generating a run expectancy matrix, generating linear weights from the Retrosheet database is relatively straightforward (and builds on the previous article).

First, we generate the run expectancies for all base/out situations (here we are just running data from 1993-2015 — the modern expansion era):

# Only process complete innings. Exclude home innings in 9th inning or later.
DROP TABLE IF EXISTS tmp_complete_innings;
CREATE TEMPORARY TABLE tmp_complete_innings
  SELECT game_id, inn_ct, bat_home_id
    FROM events
   WHERE inn_end_fl = 'T'
         AND (event_outs_ct + outs_ct = 3)   # make sure it's a complete half-inning
         AND NOT (inn_ct >= 9 and bat_home_id=1) # exclude home innings from 9th on 
;
ALTER TABLE tmp_complete_innings ADD INDEX (game_id, inn_ct, bat_home_id);

DROP TABLE IF EXISTS tmp_re;
CREATE TABLE tmp_re
  SELECT start_bases_cd,
         cds.bases_long_desc, 
         outs_ct, 
         AVG(event_runs_ct + fate_runs_ct) AS RE
    FROM events e
    JOIN tmp_complete_innings inns USING (game_id, inn_ct, bat_home_id)
    JOIN bases_cds cds ON e.start_bases_cd = cds.bases_cd
   WHERE year_id BETWEEN 1993 AND 2015
GROUP BY start_bases_cd, cds.bases_long_desc, outs_ct
;
ALTER TABLE tmp_re ADD INDEX (start_bases_cd, outs_ct);

Now, we use those run expectancies to find the average value of each event type. For now, we are averaging across all base/out situations, but this method can be easily modified to generate linear weights for each event in each base/out situation as well:

  SELECT e.event_cd,
         ecds.event_name,
         AVG(postevent.RE - preevent.RE + e.event_runs_ct) AS linwgt
    FROM events e
    JOIN tmp_complete_innings inns USING (game_id, inn_ct, bat_home_id)
    # find the RE value before this event
    JOIN tmp_re preevent USING (start_bases_cd, outs_ct)
    # find the RE value after this event (notice that we increment any outs)
    JOIN tmp_re postevent ON e.end_bases_cd = postevent.start_bases_cd 
                       AND (e.outs_ct + e.event_outs_ct) = postevent.outs_ct
    # get pretty names for event codes
    JOIN event_cds ecds USING (event_cd)
   WHERE e.event_cd IN ( 20, 21, 22, 23, 16, 9, 8, 6, 3, 2, 4, 14, 15 ) 
         AND year_id BETWEEN 1993 AND 2015
GROUP BY event_cd
;

With the results:

+----------+------------------+-------------+
| event_cd | event_name       | linwgt      |
+----------+------------------+-------------+
|        2 | Generic out      | -0.26870669 |
|        3 | Strikeout        | -0.29438586 |
|        4 | Stolen Base      |  0.18298537 |
|        6 | Caught Stealing  | -0.55215834 |
|        8 | Pickoff          | -0.24726686 |
|        9 | Wild Pitch       |  0.26448573 |
|       14 | Walk             |  0.31750011 |
|       15 | Intentional Walk |  0.17602767 |
|       16 | Hit by Pitch     |  0.34045109 |
|       20 | Single           |  0.46835426 |
|       21 | Double           |  0.77295931 |
|       22 | Triple           |  1.06242835 |
|       23 | Home Run         |  1.39543685 |
+----------+------------------+-------------+

A few notes:

  1. This was a quick and dirty run. From the strikeouts != regular outs and other variations, you can tell that I need to filter out some events to get “cleaner” results (e.g. some strikeouts result on players on base).
  2. The year range you run this over will cause variations between these results and other published linear weights.

Recreating Tango’s Run Expectancy Matrix

Here’s a simple example of how to use your own personal Retrosheet database: recreating Tango’s run expectancy matrix from 1999-2002.

First, build a temporary table of all innings for which we want to process data. We only want to include complete innings and we want to exclude home innings in the 9th inning or later.

DROP TABLE tmp_complete_innings;
CREATE TEMPORARY TABLE tmp_complete_innings
  SELECT game_id, inn_ct, bat_home_id
    FROM events
   WHERE inn_end_fl = 'T'
         # make sure it's a complete half-inning
         AND (event_outs_ct + outs_ct = 3) 
         # exclude home innings from 9th on
         AND NOT (inn_ct >= 9 and bat_home_id=1)  
;
ALTER TABLE tmp_complete_innings 
            ADD INDEX (game_id, inn_ct, bat_home_id);

Now, we simply find the averages for every base/out situation in the innings we specified above.

  SELECT cds.bases_long_desc, 
         outs_ct, 
         AVG(event_runs_ct + fate_runs_ct) AS RE24
    FROM events e
    JOIN tmp_complete_innings inns USING (game_id, inn_ct, bat_home_id)
    JOIN bases_cds cds ON e.start_bases_cd = cds.bases_cd
   WHERE year_id BETWEEN 1999 AND 2002
GROUP BY start_bases_cd, cds.bases_long_desc, outs_ct
;

And the results:

+-----------------+---------+--------+
| bases_long_desc | outs_ct | RE24   |
+-----------------+---------+--------+
| Bases Empty     |       0 | 0.5553 |
| Bases Empty     |       1 | 0.2972 |
| Bases Empty     |       2 | 0.1166 |
| 1B              |       0 | 0.9529 |
| 1B              |       1 | 0.5725 |
| 1B              |       2 | 0.2509 |
| 2B              |       0 | 1.1885 |
| 2B              |       1 | 0.7243 |
| 2B              |       2 | 0.3444 |
| 1B & 2B         |       0 | 1.5711 |
| 1B & 2B         |       1 | 0.9706 |
| 1B & 2B         |       2 | 0.4649 |
| 3B              |       0 | 1.4841 |
| 3B              |       1 | 0.9834 |
| 3B              |       2 | 0.3874 |
| 1B & 3B         |       0 | 1.9024 |
| 1B & 3B         |       1 | 1.2400 |
| 1B & 3B         |       2 | 0.5359 |
| 2B & 3B         |       0 | 2.0513 |
| 2B & 3B         |       1 | 1.4673 |
| 2B & 3B         |       2 | 0.6341 |
| Bases Loaded    |       0 | 2.4163 |
| Bases Loaded    |       1 | 1.6499 |
| Bases Loaded    |       2 | 0.8128 |
+-----------------+---------+--------+

Setting Up a Personal Retrosheet Database

Setting up a basic Retrosheet database is fairly straightforward (with a little knowledge of databases and scripting). Here’s how I did it:

    1. Install MySQL (this is left as an exercise to the reader).
    2. Install Python 3.X (ditto).
    3. Install the Chadwick Tools for processing Retrosheet data. I’m using version 0.6.5 https://sourceforge.net/projects/chadwick/files/chadwick-0.6/chadwick-0.6.5/ To compile yourself on OS X/Linux:
./configure --prefix="/usr/local"
make
sudo make install
    1. Create a database for the data. e.g.:
mysql -u root -p -e "create schema retrosheet"
  1. Create a directory to do all of your processing (something like ~/Projects/retrosheet on OS X/Linux, or C:\Projects\retrosheet on Windows)
  2. Download the season event files by decade from here: http://www.retrosheet.org/game.htm (or, you can get them from github here: https://github.com/chadwickbureau/retrosheet)
  3. Unzip all of the event files into a subdirectory called data (i.e. ~/Projects/retrosheet/data).
  4. Run the following script which will convert all of the event files into two big files, playbyplay.csv and substitution.csv.
    from glob import glob
    from os import chdir, getcwd
    from subprocess import call
    
    data_dir = "/Users/hvs/Projects/retrosheet/data"
    prev_dir = getcwd()
    
    chdir(data_dir)
    files = glob("*.EV[ANP]")
    
    for filename in files:
    	print("Processing {}".format(filename))
    	year = filename[:4]
    	call("cwevent -q -y {} -f 0-96 -x 0-62 {} >> playbyplay.csv".format(year, filename), shell=True)
    	call("cwsub -q -y {} {} >> substitutions.csv".format(year, filename), shell=True)
    
    chdir(prev_dir)
    
  5. Create the events table in your database with this script: https://github.com/cantpitch/baseball/blob/master/retrosheet_schema.sql
  6. Load your playbyplay.csv into the events table with this script: https://github.com/cantpitch/baseball/blob/master/retrosheet_load.sql (you’ll need to update the file location in that script).
  7. Some helpful lookup tables can be built with this script: https://github.com/cantpitch/baseball/blob/master/lookup_tables.sql