Eric Sun’s Post

Data Expert with System Insights

3w Edited

https://lnkd.in/gjURFX8K >> Secondary Indices and Materialized Views are the async by-products of the tables in lakehouse. ClickHouse, Materialize, RisingWave, Lance, and StarRocks have serious support for various types of indices to accelerate queries already. >> The faster lakehouse requires optimized storage layout + unique index to accelerate Merge-On-Write and query acceleration. >> Not aware of any cloud database is using in-memory (or local block storage) to deduplicate and Merge-On-Read with committed Iceberg/Delta files. If you know any company or project in this kind, please kindly leave a comment here. #lakehouse #partition #index #lsm #bucket #merge #deduplicate #paimon #starrocks #puffin #hyperspace #compact #kafka

Improve Ingest Latency and Query Efficiency of Data Lake — Partition and Index

eric-sun.medium.com

5 Comments

Shiyan Xu

Apache Hudi PMC member

I'd like to point out that Apache Hudi has a very rich set of indexing support covering both read and write. For e.g., there is record-level index for fast point look-up https://hudi.apache.org/blog/2023/11/01/record-level-index/, writer indexes cater for different data patterns https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-410, and functional indexes for flexible indexing needs https://github.com/apache/hudi/blob/master/rfc/rfc-63/rfc-63.md

6 Reactions

Shaeq Ahmed

Co-Founder @ Matano (YC W23) - Cloud Native SIEM

I believe Apache Amoro is doing this using the Mixed Iceberg format, it uses a tiered combination of Merge On Read equality and positional deletes and a smart compaction service to enforce a primary key constraint allowing for deduplication in streaming and batch scenarios. https://github.com/apache/amoro

3 Reactions

Roy Hasson

Product @ Upsolver | Data engineer | Advocate for better data

Not a cloud database, but in Upsolver we ingest and dedupe streams into Iceberg. We leverage equality deletes instead of position deletes to speed up writes. We did contribute MoR performance improvements to Trino and Presto to take advantage of equality deletes and saw massive read improvements. More detail here - https://youtu.be/j0ax6bwMYrQ?si=uV4gGJ5OSqZGlRJQ

6 Reactions

Luke Kim

Founder and CEO of Spice AI - we're hiring!

Add Spice AI - https://github.com/spiceai/spiceai to that list.

2 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

StarRocks

1,119 followers
3w
Report this post
Check out Eric Sun's latest article, packed with detailed tips on boosting ingest latency and query efficiency in Data Lakes using StarRocks and other solutions. Thanks for sharing your StarRocks experience, Eric! 🌟 #DataAnalytics #DataEngineering #DataLakeAnalytics #DataLake #DataLakehouse

Eric Sun

Data Expert with System Insights
3w Edited

https://lnkd.in/gjURFX8K >> Secondary Indices and Materialized Views are the async by-products of the tables in lakehouse. ClickHouse, Materialize, RisingWave, Lance, and StarRocks have serious support for various types of indices to accelerate queries already. >> The faster lakehouse requires optimized storage layout + unique index to accelerate Merge-On-Write and query acceleration. >> Not aware of any cloud database is using in-memory (or local block storage) to deduplicate and Merge-On-Read with committed Iceberg/Delta files. If you know any company or project in this kind, please kindly leave a comment here. #lakehouse #partition #index #lsm #bucket #merge #deduplicate #paimon #starrocks #puffin #hyperspace #compact #kafka

Improve Ingest Latency and Query Efficiency of Data Lake — Partition and Index

eric-sun.medium.com
Like Comment
To view or add a comment, sign in
CelerData

7,296 followers
3w Edited
Report this post
Check out Eric Sun's latest article, packed with detailed tips on boosting ingest latency and query efficiency in Data Lakes using StarRocks and more. Thanks for sharing your StarRocks experience, Eric! 🌟 🔗 https://lnkd.in/g3wdQsnj #DataAnalytics #DataEngineering #DataLakeAnalytics #DataLake #DataLakehouse

Eric Sun

Data Expert with System Insights
3w Edited

https://lnkd.in/gjURFX8K >> Secondary Indices and Materialized Views are the async by-products of the tables in lakehouse. ClickHouse, Materialize, RisingWave, Lance, and StarRocks have serious support for various types of indices to accelerate queries already. >> The faster lakehouse requires optimized storage layout + unique index to accelerate Merge-On-Write and query acceleration. >> Not aware of any cloud database is using in-memory (or local block storage) to deduplicate and Merge-On-Read with committed Iceberg/Delta files. If you know any company or project in this kind, please kindly leave a comment here. #lakehouse #partition #index #lsm #bucket #merge #deduplicate #paimon #starrocks #puffin #hyperspace #compact #kafka

Improve Ingest Latency and Query Efficiency of Data Lake — Partition and Index

eric-sun.medium.com
Like Comment
To view or add a comment, sign in
Venkatesh S.

Principal Data Cloud Architect - Industry Solutions Development | AI | Architect | Developer @Snowflake
5mo
Report this post
Just the Gist: Capturing LLM interactions in Snowflake, with Langchain Capturing logs of LLM applications offers insights for various aspects like cost, understanding user prompts, adding guardrails, prompt effectiveness and more. Tools like Langsmith and LLMStudio offer solutions that capture these and give insights. However data is potentially stored in yet another datastore. The user prompts could contain PII information. And the captured logs need to be secured and provided with governed access. Snowflake as a data cloud platform offers these core capabilities. Hence in this article, I am sharing a prototype implementation of Langchain Callbacks for Snowflake, which captures the log in your Snowflake. https://lnkd.in/gbZxMPJA #Snowflake #Langchain #LLM #LLMOPS

Just the Gist: Capturing LLM interactions in Snowflake, with Langchain

medium.com
Like Comment
To view or add a comment, sign in
David Swanson

Account Executive @ Upsolver with expertise in enterprise software sales and cloud computing
3mo
Report this post
Iceberg curious? Tomorrow, April 10th, 1pm EST, 10am PST... #Upsolver #apacheiceberg #aws #awscloud #dataengineering #dataingestion #kafka #confluent #kinesis #databasecdc #snowflake

Building Iceberg Lakehouse with Spark and Upsolver: Technical Deep Dive | Upsolver

upsolver.com
Like Comment
To view or add a comment, sign in
Gershon kumar

Engineer | Kafka| Kubernetes | IoT
2mo
Report this post
Delta Lake How a simple design by adding transaction log with metadata which tells the data objects(parquet files) that are part of the table and column min/max statistics addressed consistency and performance in cloud object stores. Handling of concurrent writes to the transaction log by allowing only single writer to create log ,other writers to retry operation on failure - optimistic concurrency control. Throughout the paper I could see their decisions is to make delta lake available to other engines with little development as possible. find the paper at https://lnkd.in/gEMVZF7G #Databricks #DataLake #SystemDesign

p3411-armbrust.pdf

vldb.org
Like Comment
To view or add a comment, sign in
Danielle LeBlanc

Founder @ DiKayo Data | MBA, Finance/Analytics
2mo
Report this post
With #IcebergSummit happening tomorrow online, I took a trip down memory lane by listening to the #DataFemme episode featuring Alex Merced of Dremio. This episode is packed full with information, not just about #Apache Iceberg, but about the best ways to get started as a contributor to #opensource projects. Alex is an expert content creator with extensive knowledge about Apache products and their role in technology. This is a must listen for everyone heading towards Iceberg Summit tomorrow! https://lnkd.in/guqYdzG2 Tabular Amazon Web Services (AWS) bodo.ai Cloudera Confluent PuppyGraph Snowflake Starburst Upsolver

Technical Empathy Comes in Waves — DIKAYO DATA

dikayodata.com
Like Comment
To view or add a comment, sign in
Hazelcast

12,167 followers
11mo
Report this post
Hazelcast and Redis differ primarily in their real-time data processing capabilities. While Hazelcast serves as a comprehensive, unified data platform, Redis lacks the stream processing functionality offered by Hazelcast. If you aim to act instantly upon real-time data and have your applications respond to crucial data changes, Hazelcast is the ideal technology to meet your requirements. Learn more 👇 https://ow.ly/trh250PzCPl #UnifiedRealTimeDataPlatform #StreamProcessing #RealTime #ActInstantly

Hazelcast Versus Redis: A Practical Comparison

https://hazelcast.com
Like Comment
To view or add a comment, sign in
Azure Cosmos DB

7,275 followers
8mo
Report this post
Learn how to optimize #AzureCosmosDB performance with autoscale per partition region! 🚀 Dive into this comprehensive guide to make the most of your database scaling capabilities. https://lnkd.in/eDe7pggm

Per-region and per-partition autoscale (preview) - Azure Cosmos DB

learn.microsoft.com
Like Comment
To view or add a comment, sign in
Alluxio

3,738 followers
7mo Edited
Report this post
#Presto users and developers often face challenges like slow, inconsistent #query performance and high #API and egress costs when using cloud #storage like #S3. Check out this latest session from #PrestoCon, where Beinan Wang, senior staff engineer @ Alluxio, and Hope Wang, developer advocate @ Alluxio, share how to overcome these challenges using #caching in Presto. Watch Now: https://lnkd.in/gZ_kRUC6

Presto Optimization with Distributed Caching on Data Lake - Hope Wang & Beinan Wang, Alluxio

https://www.youtube.com/
Like Comment
To view or add a comment, sign in

3,679 followers

View Profile Follow

Eric Sun’s Post

Improve Ingest Latency and Query Efficiency of Data Lake — Partition and Index

eric-sun.medium.com

More from this author

Are We Taking Only Half Of The Advantage Of Columnar File Format?

Open Data Processing Service from Aliyun - the real Big Data as Service

Explore topics

Eric Sun’s Post

More Relevant Posts

Presto Optimization with Distributed Caching on Data Lake - Hope Wang & Beinan Wang, Alluxio

https://www.youtube.com/

More from this author

Are We Taking Only Half Of The Advantage Of Columnar File Format?

Open Data Processing Service from Aliyun - the real Big Data as Service

Explore topics