https://lnkd.in/gjURFX8K >> Secondary Indices and Materialized Views are the async by-products of the tables in lakehouse. ClickHouse, Materialize, RisingWave, Lance, and StarRocks have serious support for various types of indices to accelerate queries already. >> The faster lakehouse requires optimized storage layout + unique index to accelerate Merge-On-Write and query acceleration. >> Not aware of any cloud database is using in-memory (or local block storage) to deduplicate and Merge-On-Read with committed Iceberg/Delta files. If you know any company or project in this kind, please kindly leave a comment here. #lakehouse #partition #index #lsm #bucket #merge #deduplicate #paimon #starrocks #puffin #hyperspace #compact #kafka
I believe Apache Amoro is doing this using the Mixed Iceberg format, it uses a tiered combination of Merge On Read equality and positional deletes and a smart compaction service to enforce a primary key constraint allowing for deduplication in streaming and batch scenarios. https://github.com/apache/amoro
Not a cloud database, but in Upsolver we ingest and dedupe streams into Iceberg. We leverage equality deletes instead of position deletes to speed up writes. We did contribute MoR performance improvements to Trino and Presto to take advantage of equality deletes and saw massive read improvements. More detail here - https://youtu.be/j0ax6bwMYrQ?si=uV4gGJ5OSqZGlRJQ
Add Spice AI - https://github.com/spiceai/spiceai to that list.
Apache Hudi PMC member
3wI'd like to point out that Apache Hudi has a very rich set of indexing support covering both read and write. For e.g., there is record-level index for fast point look-up https://hudi.apache.org/blog/2023/11/01/record-level-index/, writer indexes cater for different data patterns https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-410, and functional indexes for flexible indexing needs https://github.com/apache/hudi/blob/master/rfc/rfc-63/rfc-63.md