Handling very large (>2GB) files

As we gear up to tackle the Provider DAG Stability milestone, @stacimc and I were looking over the existing issues and came across this one: Filesize exceeds postgres integer column maximum size. The specific details can be seen in the linked issue, but the summary is that we were trying to ingest a Wikimedia record which referenced an audio clip that was over 8 hours in length. The filesize for this record exceeded the maximum value size allowed for a Postgres integer column and broke the ingestion.

After discussing this, we came up with a few options:

  1. Reject any records that exceed this 2GB size limit at the MediaStore level. It seems unlikely that users would want audio records this large, especially considering that we don’t make any distinction on length beyond “> 10 minutes” in the search filters.
  2. Set values greater than this column maximum to NULL. Records that exceed that size will not have filesize information stored, but all other information will be available.
  3. Alter the column to use a Postgres bigint type. This will require migrations on both the catalog and the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., and could be extremely cumbersome.

Of the 3 options above, Staci and I thought that the first would be most appropriate and easiest to execute. It also wouldn’t preclude us from accepting larger file sizes in the future, should we wish to take a different approach to including them in the catalog.

What do folks think, does that seem like a suitable next step? Are there other alternatives we haven’t considered that might be worth pursuing?

#data-cleaning, #postgres