postgres logo

Let’s imagine you have a dataset in CSV format you would like to run some SQL queries against.

The dataset is small enough to store locally on your machine, but large enough to make importing it into a database non trivial.

In this post, we will take a ~10GB CSV file and setup a local environment that lets us query it using SQL.

Approach

As always, there are a myriad of different ways to achieve our objective.

Our plan is:

  • Download the CSV file
  • Install and run PostgreSQL
  • Install and run pgAdmin
  • Create a table with the appropriate columns using pgAdmin
  • Import the CSV into the table using pgAdmin
  • Query the table using pgAdmin
  • Do the above using Docker and Docker Compose

Get the data

Our dataset is the Seattle Public Library’s collection inventory, courtesy of City of Seattle Open Data and Kaggle:

  • Login to Kaggle
  • Navigate to the dataset via the link
  • Download the zip
  • Unzip the folder
  • Should see an 11.8GB library-collection-inventory.csv,

BibNum,Title,Author,...

3011076,"A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield, Frederick Gardner, Megan Petasky, and Allen Tam.","O'Ryan, Ellie",...

Import the data using Docker

In this section, we’ll adapt accordingly this guide describing how to setup locally PostgreSQL with Docker.

Pull the images

  • Download the PostgreSQL image docker pull postgres
  • Check the image is available docker images | grep postgres
  • Download the pgAdmin image docker pull dpage/pgadmin4
  • Check the image is available docker images | grep pgadmin

Start the containers

PostgreSQL

docker run \
--mount 'type=volume,source=seattle-dev-vol,target=/var/lib/postgresql/data' \
--detach \
--name seattle-dev-db \
--rm \
--env POSTGRES_PASSWORD=mypostgrespw \
--publish 5432:5432 \
postgres

Check the container is running docker ps.

Check a local volume seattle-dev-vol has been created docker volume ls.

This volume is mounted in the container at /var/lib/postgresql/data.

We could have done a bind instead of a volume mount, but volume mounts are the preferred mechanism for persisting data generated by and used by Docker containers.

pgAdmin

docker run \
--name seattle-dev-db-admin \
--publish 80:80 \
--env '[email protected]' \
--env 'PGADMIN_DEFAULT_PASSWORD=mypgadminpw' \
--rm \
--detach \
--mount 'type=bind,source=/path/to/csv/dir,target=/var/lib/pgadmin4/storage/mypgadmin_email.com' \
dpage/pgadmin4

Check the container is running docker ps.

pgAdmin lets you upload files but ours is too large (we can increase the maximum file size setting, but the request times out).

To make our CSV available to pgAdmin, we use a bind mount which is preferred to a volume mount because we only need to do the mount once (the first time we run the container).

The correct container path is actually /var/lib/pgadmin/storage/mypgadmin_email.com but we don’t have permissions to write there (we’ll take care of this later).

Test the containers

PostgreSQL

  • Open a Bourne shell inside the container docker exec --interactive --tty seattle-dev-db sh
  • Open a PostgreSQL shell psql --username postgres
  • Display databases \l
  • Display current database \c
  • Display tables \dt
  • Exit the PostgreSQL shell exit
  • Exit the Bourne shell exit

pgAdmin

  • Navigate to http://localhost in a browser
  • Login using email [email protected] and password mypgadminpw

Create a table

  • Get the internal IP address <internal-ip> of the database, docker inspect seattle-dev-db -f "{{json .NetworkSettings.Networks }}" | grep IPAddress
  • Navigate to http://localhost in a browser
  • Open the create server dialogue by clicking on servers in the sidebar
  • Enter myserver for the server name, the host as <internal-ip>, username postgres, password mypostgrespw
  • In the sidebar, Servers -> myserver -> Databases -> postgres -> Schemas -> public -> Tables -> Create -> Table to open the create table dialogue
  • Create a table book with the columns below

Alt Text

Populate the table

  • docker exec --interactive --tty seattle-dev-db-admin sh
  • cp /var/lib/pgadmin4/storage/mypgadmin_email.com/library-collection-inventory.csv /var/lib/pgadmin/storage/mypgadmin_email.com/library-collection-inventory.csv (takes a few moments)
  • From book in the sidebar, open the Import/Export dialogue:
    • Filename - select the CSV via the file browser
    • Format csv
    • Encoding utf-8
    • Header Yes
    • Delimiter ,
    • Quote "
    • Escape "

After a few moments (minutes!):

Alt Text

Query the data using Docker Compose

Create a directory with one file docker-compose.yml

version: "3.9"
services:
  db:
    image: postgres
    volumes:
      - seattle-dev-vol:/var/lib/postgresql/data
    environment:
      - POSTGRES_PASSWORD=mypostgrespw
    ports:
      - "5432:5432"
  db-admin:
    image: dpage/pgadmin4
    environment:
      - [email protected]
      - PGADMIN_DEFAULT_PASSWORD=mypgadminpw
    ports:
      - "80:80"
    depends_on:
      - db
volumes:
  seattle-dev-vol:
    external: true

Start the services docker-compose up.

Navigate to http://localhost; if you get CSRF errors, make sure all previous tabs of pgAdmin have been closed.

If you need to recreate a database server in pgAdmin, you can set the host to be the Docker Compose service name db rather than the internal IP address like before.

In pgAdmin, you can run queries via the query editor, e.g.

SELECT "Title" FROM book LIMIT 3;

You can also run queries by opening a Bourne shell in the database container docker-compose exec db sh then a Postgres shell psql --username postgres.

To stop the containers, docker compose stop.

To stop the containers and remove them, docker compose down.

If you run docker compose down then docker-compose up, you will have to recreate the database server in pgAdmin.

Cleanup

  • Delete the CSV library-collection-inventory.csv
  • Delete the Docker volume docker volume rm seattle-dev-vol
  • Stop and remove the containers docker-compose down
  • Remove the images docker image rm postgres dpage/pgadmin4