WordPress.org

Welcome!

Openverse is a search engine for openly-licensed media.

The OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. team builds the Openverse Catalog, APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways., and front-end application, as well as integrations between Openverse and WordPress. Follow this site for updates and discussions on the project.

You can also come chat with us in #openverse on the Make WP Chat. We have a weekly developer chat at 15:00 UTC on Mondays.

If you’re a new contributor, welcome! Have a look at our good first issues or our guide for new contributors.

Today we were able to merge some massive and significant changes contributed by @beccawidom to the iNaturalist DAG! This PR includes a number of changes, namely:

The transformation steps have changed from “CSV -> Postgres -> TSV -> Postgres” now to “CSV -> Postgres -> Postgres”. This significantly reduces disk space, time, and processing overhead, and was a necessary change in order to process all of the iNaturalist data in a reasonable timeframe. It also serves as a proof-of-concept for future bulk data imports, since the transformation & data cleaning steps are happening entirely in SQL (an OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. first!).
Images are now connected with the Catalog of Life, which provides English vernacular names. This should help improve search relevancy over the current scientific names.

I want to take a moment to celebrate this huge accomplishment, and the tremendous effort @beccawidom poured into this effort. Thank you!

Now that this DAG is ready to be run once again, we’re faced with the impressive and daunting notion that we could, in a matter of days, increase the size of the image catalog by ~137 million (a roughly 23.3% increase in size). With that information, it’s important to consider the implications of including this data.

We have a weekly image data refresh process which transfers images from the catalog into our APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. for public use. Presently, this data refresh takes around 47 hours without the popularity recalculation and 60 hours with the popularity recalculation. If we are to assume these times are linear, we can expect those times to become 58 hours and 74 hours respectively. Since these are run weekly, this still gives us about 100 hours left in the week before we start having data refreshes queued while previous ones are running.

Here are some steps we can take to monitor the process:

Take a manual database snapshot of the catalog prior to enabling the iNaturalist DAG.
Enable the DAG shortly after the weekly data refresh has completed. This will allow iNaturalist to run without other significant database operations occurring.
Disable the DAG after the run while we verify the following steps.
Monitor the next scheduled image data refresh closely for significant aberrations in step duration.
Make a number of searches after the data refresh is complete to see how results are affected. We can make a number of searches which we would expect to return iNaturalist data (e.g. cat, mushroom, alligator) and some we expect should not (e.g. computer, transistor, book).
Re-enable the iNaturalist DAG.

One of our big-picture goals for 2023 is search relevancy, and a key piece required for making improvements in that area is understanding how our existing document scoring works. I’m not sure that we can predict how adding this much data will affect our result relevancy. In the case where we notice result relevancy is negatively impacted (e.g. unrelated queries are flooded with iNaturalist results), there are a few actions we can take to mitigate this:

Alter the weight of the provider in the API (@sarayourfriend had mentioned this as an option).
Set the authority boost of the provider in the ingestion server and reindex the images.
Disable the iNaturalist provider in the API.

We would like to do all we can to avoid the last option. I don’t presume that the iNaturalist data will require taking the above actions, but I wanted to outline them and open up space in case other folks have mitigation ideas.

We’re incredibly excited for the addition of this data!

#catalog #database

Ah! Cool. I’d love to understand more about how the popularity metrics relate in importance to the title vs tags too.

FWIW, we don’t get individual titles or descriptions from iNaturalist. So the “title” that we’ll see is the most specific taxonomic information available, and the “tags” (less important to search results) are more general taxonomic categories. For example, this photo on OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. will have a title of “Trifolium hybridum”, and these tags: Angiospermae, Fabaceae, Fabales, Faboideae, Flowers, Plants, Tracheophyta, Trifolieae, Trifolium. Catalog of life does not have vernacular names for all of the species we get from iNaturalist, and the iNat public dataset does not have all of the vernacular names that they use (“Alsike Clover” in this case). We could try using this name dataset from the US government instead, but Catalog of Life does have some non-English names, and I haven’t looked at how comprehensive this alternative might be.

One risk of setting title and tags up this way is that it’s possible that lower quality images will have less specific classifications (e.g. “Flower” as the title) on average, and more common search terms. But this was the best approach I could think of with the available metadata from iNaturalist. Maybe in future we should consider using the specificity of the taxonomyTaxonomy A taxonomy is a way to group things together. In WordPress, some common taxonomies are category, link, tag, or post format. https://codex.wordpress.org/Taxonomies#Default_Taxonomies. as a stand-in for popularity? But at this point it’s pure conjecture.

It will be interesting to see what comes of this new data collection!

stacimc 12:36 am on January 14, 2023

So exciting! What an incredible addition 🙂
Nate Angell 12:46 am on January 14, 2023

Huge congrats and thanks for this work! It seems like many technical issues were overcome and the end result is an even more comprehensive and useful openverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org..
sarayourfriend 4:30 am on January 14, 2023

As Staci said, super exciting!

We do not currently have a way to accurately measure the actual impact of iNaturalist on real user usage of our APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.. I wrote an issue that I’d like to propose we complete and have ready for at least two weeks before the iNaturalist data refresh occurs: https://github.com/WordPress/openverse-api/issues/1084

I suggest two weeks of that analysis being available beforehand because it seems like a minimally disruptive amount of time for us to get the “before” data that would be needed to tell what the effect was. It’s not necessarily statistically sufficient but anything that would be would be disruptive.
- Krystle Salazar 9:32 pm on January 25, 2023
  
  Thanks for adding this @sarayourfriend, with these changes, we will have a more quantitative and objective way of measuring how iNaturalist affects the search results. I agree with the two weeks period as well, sounds reasonable.
beccawidom 3:06 pm on January 14, 2023

Thanks so much, @madison! All the maintainers (looking at you too @stacimc, @sarayourfriend, Krystle, Olga…), have been fantastic to work with. Thank you all so much for your kindness and generosity with your time and expertise!

In pay-it-forward fashion, I’d love to invite other folks to contribute here, and I wonder if these might be good first issues: https://github.com/WordPress/openverse-catalog/issues/810 (could be useful to get done sooner rather than later, in case it could help with thumbnail caching?), and https://github.com/WordPress/openverse-catalog/issues/966 . I’d be happy to help in any way that I can!

I’m also curious about how different metadata might play in, for example, there is a creator we’ll show as “russiannaturalistbrazil”. Will their images start showing up under searches for Russian? There’s an interesting mix currently with that search, if that creator’s iNaturalist photos started showing up there, I don’t know if it would be good or bad / relevant or irrelevant.
- sarayourfriend 9:25 pm on January 14, 2023
  
  Regarding the username you mentioned, it should not affect search in the way you wondered. The creator field is only searched if the field is specifically queried, but not in a general query. More details here: https://wordpress.github.io/openverse-api/reference/search_algorithm.html#general-query-searching
  - beccawidom 1:11 am on January 15, 2023
    
    Ah! Cool. I’d love to understand more about how the popularity metrics relate in importance to the title vs tags too.
    
    FWIW, we don’t get individual titles or descriptions from iNaturalist. So the “title” that we’ll see is the most specific taxonomic information available, and the “tags” (less important to search results) are more general taxonomic categories. For example, this photo on OpenverseOpenverse Openverse is a search engine for openly-licensed media, including images and audio. Find Openverse on GitHub and at https://openverse.org. will have a title of “Trifolium hybridum”, and these tags: Angiospermae, Fabaceae, Fabales, Faboideae, Flowers, Plants, Tracheophyta, Trifolieae, Trifolium. Catalog of life does not have vernacular names for all of the species we get from iNaturalist, and the iNat public dataset does not have all of the vernacular names that they use (“Alsike Clover” in this case). We could try using this name dataset from the US government instead, but Catalog of Life does have some non-English names, and I haven’t looked at how comprehensive this alternative might be.
    
    One risk of setting title and tags up this way is that it’s possible that lower quality images will have less specific classifications (e.g. “Flower” as the title) on average, and more common search terms. But this was the best approach I could think of with the available metadata from iNaturalist. Maybe in future we should consider using the specificity of the taxonomyTaxonomy A taxonomy is a way to group things together. In WordPress, some common taxonomies are category, link, tag, or post format. https://codex.wordpress.org/Taxonomies#Default_Taxonomies. as a stand-in for popularity? But at this point it’s pure conjecture.
    
    It will be interesting to see what comes of this new data collection!