Sean Sun

United States Contact Info

Sign in to view Sean’s full profile

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

372 followers 359 connections

View mutual connections with Sean

Welcome back

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Join to view profile

Google

Carnegie Mellon University School of Computer Science

Activity

Picklesburgh is kind of a big dill around here, so we relish the chance to demonstrate some pickle science! 🥒🧪 Carnegie Mellon University Mellon…

Picklesburgh is kind of a big dill around here, so we relish the chance to demonstrate some pickle science! 🥒🧪 Carnegie Mellon University Mellon…

Liked by Sean Sun
Farewell to csrankings.org Back in 2021, a friend kindly sent me the snapshot and congratulated me for being ranked high in www.csrankings.org for a…

Farewell to csrankings.org Back in 2021, a friend kindly sent me the snapshot and congratulated me for being ranked high in www.csrankings.org for a…

Liked by Sean Sun
Introducing ChatGPT Team: A new plan for teams of all sizes with access to advanced models and tools, business-grade data privacy & security, and the…

Introducing ChatGPT Team: A new plan for teams of all sizes with access to advanced models and tools, business-grade data privacy & security, and the…

Liked by Sean Sun

Join now to see all activity

Experience & Education

Google

******* ***

**** ******* ******
******** ****** **********

******** *********
******** ****** ********** ****** ** ******** *******

******'* ****** ************* **** ******* *.**/*.*

2020 - 2021
********** ** *********-*******

******** ** ******* - ** ******** ******* *.**/*.*

2017 - 2019

View Sean’s full experience

See their title, tenure and more.

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

LMDX: Language Model-based Document Information Extraction and Localization

ACL 2024 Findings May 28, 2024

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have…

Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have been the absence of layout encoding within LLMs, critical for a high quality extraction, and the lack of a grounding mechanism ensuring the answer is not hallucinated. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to adapt arbitrary LLMs for document information extraction. LMDX can do extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. In particular, we apply LMDX to the PaLM 2-S LLM and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers

See publication
ACED: Accelerated Computational Electrochemical systems Discovery

NeurIPS Climate Change and AI Workshop 2020 October 30, 2020

Large-scale electrification is vital to addressing the climate crisis, but many engineering challenges remain to fully electrifying both the chemical industry and transportation. In both of these areas, new electrochemical materials and systems will be critical, but developing these systems currently relies heavily on computationally expensive first-principles simulations as well as human-time-intensive experimental trial and error. We propose to develop an automated workflow that accelerates…

Large-scale electrification is vital to addressing the climate crisis, but many engineering challenges remain to fully electrifying both the chemical industry and transportation. In both of these areas, new electrochemical materials and systems will be critical, but developing these systems currently relies heavily on computationally expensive first-principles simulations as well as human-time-intensive experimental trial and error. We propose to develop an automated workflow that accelerates these computational steps by introducing both automated error handling in generating the first-principles training data as well as physics-informed machine learning surrogates to further reduce computational cost. It will also have the capacity to include automated experiments "in the loop" in order to dramatically accelerate the overall materials discovery pipeline.

See publication
Assessing Graph-based Deep Learning Models for Predicting Flash Point

Molecular Informatics February 20, 2020

Graph-based machine learning models have been widely used, but there's few studies on applying graph-based models on predicting molecular properties, especially for flash point, for organic molecules. In this paper, we collected more than 11,000 flash point data and assessed graph-based machine learning models on predicting flash point.

See publication
DocumentNet: Bridging the Data Gap in Document Pre-Training

EMNLP 2023

Document understanding tasks, in particular, Visually-rich Document Entity Retrieval (VDER), have gained significant attention in recent years thanks to their broad applications in enterprise AI. However, publicly available data have been scarce for these tasks due to strict privacy constraints and high annotation costs. To make things worse, the non-overlapping entity spaces from different datasets hinder the knowledge transfer between document types. In this paper, we propose a method to…

Document understanding tasks, in particular, Visually-rich Document Entity Retrieval (VDER), have gained significant attention in recent years thanks to their broad applications in enterprise AI. However, publicly available data have been scarce for these tasks due to strict privacy constraints and high annotation costs. To make things worse, the non-overlapping entity spaces from different datasets hinder the knowledge transfer between document types. In this paper, we propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets, making it universally applicable to all VDER tasks. The current DocumentNet consists of 30M documents spanning nearly 400 document types organized in a four-level ontology. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training for both classic and few-shot learning settings. With the recent emergence of large language models (LLMs), DocumentNet provides a large data source to extend their multi-modal capabilities for VDER.

See publication
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

-

See publication

Courses

Artificial Intelligence

CS 540
Cloud Computing

15619
Data Management for Data Science

CS 639
Database Management Systems: Design and Implementation

CS 564
Deep Learning Systems: Algorithms and Implementation

10714
Foundations of Computational Data Science

11637
Interactive Data Science

05839
Intro to Algorithms

CS 577
Intro to Deep Learning

11785
Introduction to Optimization

CS 524
Machine Learning

10601
Machine Learning with Large Datasets

10605
Multimodal Machine Learning

11777
Neural Network for NLP

11747
Operating Systems

CS 537

Projects

Debias QA Model

Mar 2021 - May 2021

Debiasing state-of-the-art(SOTA) models in NLP is getting more and more important. But debiasing question-answering SOTA models has been understudied. In this project, we aim at measuring and reducing stereotypes in QA systems without losing too much of the performance. Specifically, we choose one of the most widely studied QA datasets, SQuAD, as our base QA dataset. Correspondingly, we choose the current SOTA model for SQuAD dataset, LUKE. Our result shows that with the proposed debias…

Debiasing state-of-the-art(SOTA) models in NLP is getting more and more important. But debiasing question-answering SOTA models has been understudied. In this project, we aim at measuring and reducing stereotypes in QA systems without losing too much of the performance. Specifically, we choose one of the most widely studied QA datasets, SQuAD, as our base QA dataset. Correspondingly, we choose the current SOTA model for SQuAD dataset, LUKE. Our result shows that with the proposed debias structure, the performance only decreases 3% while the bias score decreases 36%

See project
Listen Anime Subtitle: An LAS-Based Automatic Subtitle Generator

Mar 2021 - May 2021

In this work, we explore transfer learning and learning representations in end-to-end automatic speech recognition (ASR) models. We apply our methods to the Kimi no Na wa dataset, which entails transcribing movie dialogue for subtitles. Our core contributions include (1) transfer learning of a Listen-Attend-Spell (LAS) model from the Wall Street Journal dataset and (2) transfer learning of a large transformer model using the wav2vec 2.0 self-supervised learning representation for .WAV files…

In this work, we explore transfer learning and learning representations in end-to-end automatic speech recognition (ASR) models. We apply our methods to the Kimi no Na wa dataset, which entails transcribing movie dialogue for subtitles. Our core contributions include (1) transfer learning of a Listen-Attend-Spell (LAS) model from the Wall Street Journal dataset and (2) transfer learning of a large transformer model using the wav2vec 2.0 self-supervised learning representation for .WAV files. Specifically, we achieve an evaluation Levenshtein distance of 15.04 using transfer learning on spectrograms with our LAS model, and a distance of 6.93 using raw audio files with wav2vec 2.0 and a pretrained transformer model.

See project
Interactive Happy Moment Analysis

Oct 2020 - Dec 2020

Happiness is an important status for most of us, but it is hard to say what makes people happy. Understanding the factors that bring people happiness is the key to improve people’s happy experiences and overall life satisfaction. To guide users to explore a deeper understanding about their happiness source, we built an interactive website on happy moment (text) analysis in a storytelling fashion.

To generate analysis content, we trained multiple traditional machine learning and deep…

Happiness is an important status for most of us, but it is hard to say what makes people happy. Understanding the factors that bring people happiness is the key to improve people’s happy experiences and overall life satisfaction. To guide users to explore a deeper understanding about their happiness source, we built an interactive website on happy moment (text) analysis in a storytelling fashion.

To generate analysis content, we trained multiple traditional machine learning and deep learning models for classifying happiness sources base on the HappyDB corpus. Out of accuracy and interpretability consideration, we chose logistic regression model (F1 score 0.836) for the website. For each prediction, we generated LIME plots for it to help explain how the classifier made the decision. For the entire HappyDB corpus, bar plots with example sentences were created to provide a summary of top words in each category.

The overall setting of the website would be summarizing happy experiences of 2020 as Christmas is coming. When users browse the website, they could input their own happy moment and see how "Santa" (pretended by model) guesses the happiness source. Then they could test their understanding of happiness by classifying others' happy experience and comparing with "Santa"'s analysis. Additional activities include exploring the summary barplots and playing with a bonus sentiment classification. Overall, we hope users could elicit new thoughts about their happy moments and a cheery Christmas holds lots of happiness for them.

See project
Twitter Analytics on the Cloud

Sep 2020 - Dec 2020

- Build a performant and reliable web service on the cloud using self-managed VMs within a specified budget by combining the skills developed in this course.
- Design, develop, deploy, test and optimize functional web-servers that can handle a high load (~ tens of thousands of requests per second).
- Implement Extract, Transform and Load (ETL) on a large data set (~ 1 TB) with budget constraints and load the data into MySQL and HBase databases.
- Design a suitable schema for a specific…

- Build a performant and reliable web service on the cloud using self-managed VMs within a specified budget by combining the skills developed in this course.
- Design, develop, deploy, test and optimize functional web-servers that can handle a high load (~ tens of thousands of requests per second).
- Implement Extract, Transform and Load (ETL) on a large data set (~ 1 TB) with budget constraints and load the data into MySQL and HBase databases.
- Design a suitable schema for a specific problem as well as optimize MySQL and HBase databases to increase throughput when responding to a large scale of requests.
- Explore methods to identify the potential bottlenecks in a cloud-based web service to improve system performance.

More activity by Sean

Introduce 𝐌𝐨𝐛𝐢𝐥𝐞 𝐀𝐋𝐎𝐇𝐀🏄 -- Learning! With 50 demos, our robot can autonomously complete complex mobile manipulation tasks: - cook and…

Introduce 𝐌𝐨𝐛𝐢𝐥𝐞 𝐀𝐋𝐎𝐇𝐀🏄 -- Learning! With 50 demos, our robot can autonomously complete complex mobile manipulation tasks: - cook and…

Liked by Sean Sun
Carnegie Mellon University has unveiled the striking design of its new Richard King Mellon Hall of Sciences, a 338,900 square-foot building that will…

Carnegie Mellon University has unveiled the striking design of its new Richard King Mellon Hall of Sciences, a 338,900 square-foot building that will…

Liked by Sean Sun
"Large Language Models can Learn Rules" is our first step to improve multi-step reasoning in customer specific domains. It draws the inspiration from…

"Large Language Models can Learn Rules" is our first step to improve multi-step reasoning in customer specific domains. It draws the inspiration from…

Liked by Sean Sun
#LLMs generally suck at visually complex Doc AI extraction 🫢 ......but wait LMDX from Google might change that......👀 👇👇👇👇👇 LDMX is a new…

#LLMs generally suck at visually complex Doc AI extraction 🫢 ......but wait LMDX from Google might change that......👀 👇👇👇👇👇 LDMX is a new…

Liked by Sean Sun

View Sean’s full profile

See who you know in common
Get introduced
Contact Sean directly

Join to view full profile

Sign in

Stay updated on your professional world

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Other similar profiles

Explore more posts

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Sean Sun in United States

65 others named Sean Sun in United States are on LinkedIn

See others named Sean Sun

Add new skills with these courses

See all courses

Activity

Picklesburgh is kind of a big dill around here, so we relish the chance to demonstrate some pickle science! 🥒🧪 Carnegie Mellon University Mellon…

Liked by Sean Sun

Farewell to csrankings.org Back in 2021, a friend kindly sent me the snapshot and congratulated me for being ranked high in www.csrankings.org for a…

Liked by Sean Sun

Introducing ChatGPT Team: A new plan for teams of all sizes with access to advanced models and tools, business-grade data privacy & security, and the…

Liked by Sean Sun

Experience & Education

Google

******** ********, ******* ********

View Sean’s full experience

See their title, tenure and more.

Publications

ACL 2024 Findings May 28, 2024

NeurIPS Climate Change and AI Workshop 2020 October 30, 2020

Molecular Informatics February 20, 2020

EMNLP 2023

-

Courses

Artificial Intelligence

CS 540

Cloud Computing

15619

Data Management for Data Science

CS 639

Database Management Systems: Design and Implementation

CS 564

Deep Learning Systems: Algorithms and Implementation

10714

Foundations of Computational Data Science

11637

Interactive Data Science

05839

Intro to Algorithms

CS 577

Intro to Deep Learning

11785

Introduction to Optimization

CS 524

Machine Learning

10601

Machine Learning with Large Datasets

10605

Multimodal Machine Learning

11777

Neural Network for NLP

11747

Operating Systems

CS 537

Projects

Mar 2021 - May 2021

Mar 2021 - May 2021

Oct 2020 - Dec 2020

Twitter Analytics on the Cloud

Sep 2020 - Dec 2020

More activity by Sean

Introduce 𝐌𝐨𝐛𝐢𝐥𝐞 𝐀𝐋𝐎𝐇𝐀🏄 -- Learning! With 50 demos, our robot can autonomously complete complex mobile manipulation tasks: - cook and…

Liked by Sean Sun

Carnegie Mellon University has unveiled the striking design of its new Richard King Mellon Hall of Sciences, a 338,900 square-foot building that will…

Liked by Sean Sun

"Large Language Models can Learn Rules" is our first step to improve multi-step reasoning in customer specific domains. It draws the inspiration from…

Liked by Sean Sun

#LLMs generally suck at visually complex Doc AI extraction 🫢 ......but wait LMDX from Google might change that......👀 👇👇👇👇👇 LDMX is a new…

Liked by Sean Sun

View Sean’s full profile

Sign in

Other similar profiles

Yar Khine Phyo

Jayant Sravan Tamarapalli

Weiyi Zhang

Parth Rajwade

Nidhi Dhar

Xianjie Zheng

Parth Sangani

Aditya Bansal

Jiajun (Jayson) Bao

Anusha Rao

Yuqing Qiu

Tianwei Yue