Sean Sun

United States Contact Info
372 followers 359 connections

Join to view profile

Activity

Join now to see all activity

Experience & Education

  • Google

View Sean’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Publications

  • LMDX: Language Model-based Document Information Extraction and Localization

    ACL 2024 Findings

    Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have…

    Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have been the absence of layout encoding within LLMs, critical for a high quality extraction, and the lack of a grounding mechanism ensuring the answer is not hallucinated. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to adapt arbitrary LLMs for document information extraction. LMDX can do extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. In particular, we apply LMDX to the PaLM 2-S LLM and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers

    See publication
  • ACED: Accelerated Computational Electrochemical systems Discovery

    NeurIPS Climate Change and AI Workshop 2020

    Large-scale electrification is vital to addressing the climate crisis, but many engineering challenges remain to fully electrifying both the chemical industry and transportation. In both of these areas, new electrochemical materials and systems will be critical, but developing these systems currently relies heavily on computationally expensive first-principles simulations as well as human-time-intensive experimental trial and error. We propose to develop an automated workflow that accelerates…

    Large-scale electrification is vital to addressing the climate crisis, but many engineering challenges remain to fully electrifying both the chemical industry and transportation. In both of these areas, new electrochemical materials and systems will be critical, but developing these systems currently relies heavily on computationally expensive first-principles simulations as well as human-time-intensive experimental trial and error. We propose to develop an automated workflow that accelerates these computational steps by introducing both automated error handling in generating the first-principles training data as well as physics-informed machine learning surrogates to further reduce computational cost. It will also have the capacity to include automated experiments "in the loop" in order to dramatically accelerate the overall materials discovery pipeline.

    See publication
  • Assessing Graph-based Deep Learning Models for Predicting Flash Point

    Molecular Informatics

    Graph-based machine learning models have been widely used, but there's few studies on applying graph-based models on predicting molecular properties, especially for flash point, for organic molecules. In this paper, we collected more than 11,000 flash point data and assessed graph-based machine learning models on predicting flash point.

    See publication
  • DocumentNet: Bridging the Data Gap in Document Pre-Training

    EMNLP 2023

    Document understanding tasks, in particular, Visually-rich Document Entity Retrieval (VDER), have gained significant attention in recent years thanks to their broad applications in enterprise AI. However, publicly available data have been scarce for these tasks due to strict privacy constraints and high annotation costs. To make things worse, the non-overlapping entity spaces from different datasets hinder the knowledge transfer between document types. In this paper, we propose a method to…

    Document understanding tasks, in particular, Visually-rich Document Entity Retrieval (VDER), have gained significant attention in recent years thanks to their broad applications in enterprise AI. However, publicly available data have been scarce for these tasks due to strict privacy constraints and high annotation costs. To make things worse, the non-overlapping entity spaces from different datasets hinder the knowledge transfer between document types. In this paper, we propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets, making it universally applicable to all VDER tasks. The current DocumentNet consists of 30M documents spanning nearly 400 document types organized in a four-level ontology. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training for both classic and few-shot learning settings. With the recent emergence of large language models (LLMs), DocumentNet provides a large data source to extend their multi-modal capabilities for VDER.

    See publication

Courses

  • Artificial Intelligence

    CS 540

  • Cloud Computing

    15619

  • Data Management for Data Science

    CS 639

  • Database Management Systems: Design and Implementation

    CS 564

  • Deep Learning Systems: Algorithms and Implementation

    10714

  • Foundations of Computational Data Science

    11637

  • Interactive Data Science

    05839

  • Intro to Algorithms

    CS 577

  • Intro to Deep Learning

    11785

  • Introduction to Optimization

    CS 524

  • Machine Learning

    10601

  • Machine Learning with Large Datasets

    10605

  • Multimodal Machine Learning

    11777

  • Neural Network for NLP

    11747

  • Operating Systems

    CS 537

Projects

  • Debias QA Model

    -

    Debiasing state-of-the-art(SOTA) models in NLP is getting more and more important. But debiasing question-answering SOTA models has been understudied. In this project, we aim at measuring and reducing stereotypes in QA systems without losing too much of the performance. Specifically, we choose one of the most widely studied QA datasets, SQuAD, as our base QA dataset. Correspondingly, we choose the current SOTA model for SQuAD dataset, LUKE. Our result shows that with the proposed debias…

    Debiasing state-of-the-art(SOTA) models in NLP is getting more and more important. But debiasing question-answering SOTA models has been understudied. In this project, we aim at measuring and reducing stereotypes in QA systems without losing too much of the performance. Specifically, we choose one of the most widely studied QA datasets, SQuAD, as our base QA dataset. Correspondingly, we choose the current SOTA model for SQuAD dataset, LUKE. Our result shows that with the proposed debias structure, the performance only decreases 3% while the bias score decreases 36%

    See project
  • Listen Anime Subtitle: An LAS-Based Automatic Subtitle Generator

    -

    In this work, we explore transfer learning and learning representations in end-to-end automatic speech recognition (ASR) models. We apply our methods to the Kimi no Na wa dataset, which entails transcribing movie dialogue for subtitles. Our core contributions include (1) transfer learning of a Listen-Attend-Spell (LAS) model from the Wall Street Journal dataset and (2) transfer learning of a large transformer model using the wav2vec 2.0 self-supervised learning representation for .WAV files…

    In this work, we explore transfer learning and learning representations in end-to-end automatic speech recognition (ASR) models. We apply our methods to the Kimi no Na wa dataset, which entails transcribing movie dialogue for subtitles. Our core contributions include (1) transfer learning of a Listen-Attend-Spell (LAS) model from the Wall Street Journal dataset and (2) transfer learning of a large transformer model using the wav2vec 2.0 self-supervised learning representation for .WAV files. Specifically, we achieve an evaluation Levenshtein distance of 15.04 using transfer learning on spectrograms with our LAS model, and a distance of 6.93 using raw audio files with wav2vec 2.0 and a pretrained transformer model.

    See project
  • Interactive Happy Moment Analysis

    -

    Happiness is an important status for most of us, but it is hard to say what makes people happy. Understanding the factors that bring people happiness is the key to improve people’s happy experiences and overall life satisfaction. To guide users to explore a deeper understanding about their happiness source, we built an interactive website on happy moment (text) analysis in a storytelling fashion.

    To generate analysis content, we trained multiple traditional machine learning and deep…

    Happiness is an important status for most of us, but it is hard to say what makes people happy. Understanding the factors that bring people happiness is the key to improve people’s happy experiences and overall life satisfaction. To guide users to explore a deeper understanding about their happiness source, we built an interactive website on happy moment (text) analysis in a storytelling fashion.

    To generate analysis content, we trained multiple traditional machine learning and deep learning models for classifying happiness sources base on the HappyDB corpus. Out of accuracy and interpretability consideration, we chose logistic regression model (F1 score 0.836) for the website. For each prediction, we generated LIME plots for it to help explain how the classifier made the decision. For the entire HappyDB corpus, bar plots with example sentences were created to provide a summary of top words in each category.

    The overall setting of the website would be summarizing happy experiences of 2020 as Christmas is coming. When users browse the website, they could input their own happy moment and see how "Santa" (pretended by model) guesses the happiness source. Then they could test their understanding of happiness by classifying others' happy experience and comparing with "Santa"'s analysis. Additional activities include exploring the summary barplots and playing with a bonus sentiment classification. Overall, we hope users could elicit new thoughts about their happy moments and a cheery Christmas holds lots of happiness for them.

    See project
  • Twitter Analytics on the Cloud

    -

    - Build a performant and reliable web service on the cloud using self-managed VMs within a specified budget by combining the skills developed in this course.
    - Design, develop, deploy, test and optimize functional web-servers that can handle a high load (~ tens of thousands of requests per second).
    - Implement Extract, Transform and Load (ETL) on a large data set (~ 1 TB) with budget constraints and load the data into MySQL and HBase databases.
    - Design a suitable schema for a specific…

    - Build a performant and reliable web service on the cloud using self-managed VMs within a specified budget by combining the skills developed in this course.
    - Design, develop, deploy, test and optimize functional web-servers that can handle a high load (~ tens of thousands of requests per second).
    - Implement Extract, Transform and Load (ETL) on a large data set (~ 1 TB) with budget constraints and load the data into MySQL and HBase databases.
    - Design a suitable schema for a specific problem as well as optimize MySQL and HBase databases to increase throughput when responding to a large scale of requests.
    - Explore methods to identify the potential bottlenecks in a cloud-based web service to improve system performance.

More activity by Sean

View Sean’s full profile

  • See who you know in common
  • Get introduced
  • Contact Sean directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Others named Sean Sun in United States

Add new skills with these courses