The pleasure and perils of DuckDB.
tl;dr - Come on in, the water is fine… and, also, here there be dragons.
Over the past months, I’ve been experimenting with DuckDB, an “in-process SQL OLAP database management system.” As an analytics database, DuckDB may well be the best tool out there. Not only can it be installed via pip, it can read and write Parquet, Pandas, and Arrow, it can also read JSON (including globbing multiple files at once), and (with the httpfs extension) work with remote files in HTTPS or other systems which implement the S3 API. I tell you, it’s a thing of beauty and has become by go-to tool for those times when you get a data file from a client and have to try to answer the, “What is in this and how well structured is it,” question.
Ironically, it’s DuckDB’s success which has bitten me. Given my success with it as an analytics tool, and the fact that it’s so easy to make it part of a python application, I started using it as an operational database for low volume data application I built to help me track resume versions and applications I submitted as part of a job hunt. Like a good DBA, I built out tables for primary entities, linking them together with foreign key relations, and then I got to work populating the database as I went.
All was well until I needed to update the version of the Resume I had recorded as being sent with some Applications I submitted. That’s when I ran into known issues with DuckDB’s indexing. The upshot of those issues is that you cannot update the value of a column which is a foreign key to another table.
In my case, with a table structure like shown below, the update update Applications set application_date = '2024-01-01' where id <= 2;
succeeds, becuase application_date is not a reference to any other table. But update Applications set resume_id = 2 where id = 3;
will fail with an error message saying …
Error: Constraint Error: Violates foreign key constraint because key “application_id: 3” is still referenced by a foreign key in a different table.
Since application_id
is only on the ApplicationEvents table, I thought maybe I would be able to apply the update to Applications where there had been no ApplictionEvent record (i.e., I hadn’t heard anything back yet). However, even updating Applications where there was no matching ApplicationEvent record failed, though with a different error:
Error: Constraint Error: Duplicate key “id: 5” violates primary key constraint. If this is an unexpected constraint violation please double check with the known index limitations section in our documentation (docs - sql - indexes).
Now, granted, DuckDB is currently in version 0.9.2, so it’s unreasonable to expect perfection. It’s explictly not ready for production use. But foreign and primary key support is pretty fundamental to a well structured database and this issue has been known since at least 0.8.1. I’m really having to rethink my use of DuckDB. It’s still great for investigation and initial analytics work but, for now, I don’t have faith I can use it for more than a developer tool.
Table structure of my db
/* Basicaly an enum of the responses I track for an Application. */
create or replace table EventTypes (
name varchar primary key,
description varchar,
is_terminal boolean
);
/* A listing of resume versions, to allow tracking conversion rates for different formats or versions tailored to a role type. */
create or replace table Resumes (
id integer primary key default nextval('resume_id'),
name varchar,
filepath varchar,
notes varchar
);
/* A submission to a role at a company. */
create or replace table Applications (
id integer primary key default nextval('application_id'),
company_name varchar not null,
url varchar not null,
role varchar not null,
salary_min_usd integer,
salary_max_usd integer,
resume_id integer not null,
application_date date default current_date,
/* foreign key definitions */
foreign key (resume_id) references Resumes (id)
);
/* Follow up events related to an Application, like a call back from the recruiter, a rejection, or an offer. */
create or replace table ApplicationEvents (
id integer primary key default nextval('applicationevent_id'),
application_id integer not null,
eventtype_name varchar not null,
event_date date default current_date,
notes text,
/* foreign key definitions */
foreign key (application_id) references Applications (id),
foreign key (eventtype_name) references EventTypes (name)
);