Which language do you use at work?

Published in

Tapad Engineering

6 min readFeb 1, 2019

by Ju Yang, Data Scientist

“Which language do you use at work?” I get this question quite often from my blog readers. My short answer is usually, “Python for research and Scala for production,” but there is more to than that. This post offers a detailed look at the various languages we use at Tapad, occasions for their use, and how the team collaborates to complete projects using these languages.

SQL for data preprocessing

The first step to researching a new idea or testing a hypothesis is to collect relevant data.

Engineers at Tapad have built well-established ETL (Extract, Transform and Load) pipelines and various sophisticated aggregation jobs to process and store data in Google BigQuery. After rounds of group discussion, literature research, and planning meetings, we kick off a project. To get started, I prefer to have in-person discussions with members of the engineering team who are familiar with the data schema and specifications to learn about the types of data I could use, and where I can find this data.

Once I learn where the data is being stored and how it is structured, I usually conduct an exploratory data analysis (EDA) on BigQuery using SQL. EDA provides me with an overview of the data, including distribution, trends, and outliers, and in the simplest format, showcases the numbers of properties of interest. For example, data ingestion errors in the table could result in empty values (NaN). This then may lead to an unusually high number of the value “others” in the table, suggesting this value may be the default “fall-back” placeholder.

When in doubt, I investigate and discuss it with colleagues until I am certain that I understand the data and its meaning.

Over the course of a project, I often work on more than one dataset. In fact, I am often interested in the combination of several datasets. To do so, I write SQL queries to join, filter, and select data of interest, and run a “sanity check” on the results.

In addition, for preliminary research, I don’t often need the whole dataset (which easily surpasses 10 TB!), so I create a randomly sampled dataset using SQL and export the sampled dataset (less than 10 GB) to Google Cloud Storage. From there, I download the data to a local computer for further research.

In addition, for transparency, reproducibility, and visibility, I copy-and-paste all the queries I have used in data preprocessing in a Google Doc and record intermediate results in a Google Spreadsheet. I then organize the queries and results in a readable order, summarize the analysis results and visualization on the company’s internal wikipage, Confluence, and accompany these results with a problem statement or background description. Current and future colleagues can easily pick up where I left-off and continue working on the project.

Git for version control

Messing up a master branch and losing a code revision history are two of the worst crimes an engineer could commit. Git version control is essential at safeguarding against these two issues. Git allows teammates to work on separate branches simultaneously without interrupting the underlying master framework, and provides the team with a full copy of the code history. After I am done working on a new feature, it is gratifying to create a pull request, complete code reviews, and merge my branch to the master, conflict-free!

In addition, I use bash commands on Terminal to download, upload, and organize files, run shell scripts, and modify basic configuration.

Python and R for research

I use Python as my playground during research and development to test and explore new ideas. In particular, I like using Jupyter Notebook with all kinds of (un)common data science packages (pandas, numpy, scipy, matplotlib, sklearn, etc). At this stage, my main goal is to explore the data, write decent working code, implement machine learning models, test new machine learning algorithms and feature engineering methods published from the latest research. The data I use at this stage is usually a sample of the original data from BigQuery, and the size of the sampled data is relatively small (a few GB) so that I can run the code on my local computer quickly.

A lot of exciting data mining, modeling, and machine learning research happen at this stage. The versatility and ease of use of Python make it possible to get preliminary results within a few days (or even hours), and Jupyter Notebook with visualization makes it easier to share analysis and results with coworkers for feedback and further iteration.

On occasion, team members may decide to use R in addition to Python. Usually, this happens if a team member wants to try some new packages.

Scala for research and production

While Python and sklearn are convenient for research and prototyping, the results from a small subset of data may not be representative of the whole dataset. Python and sklearn are also not scalable enough to handle large datasets, and are prone to crashing in the middle of training jobs if input data is more than a few GB.

This is why our current machine learning pipeline is written in Scala, a scalable functional programming language that was invented particularly to handle large data, along with the Spark framework.

In practice, after conducting an initial assessment and analysis of a research topic in Python, I compose the code in Scala and integrate the new code to the current pipeline. I then can run Spark jobs to generate features and train machine learning models. I started to use Scala in production in June 2018, and have found it to be an extremely powerful and enjoyable tool to use for data science projects. The “map” function, in particular, has been useful as it can write loops for you! Scala is also statically typed and type safe, which is a great feature for production.

This whole process involves collaboration and code pairing with engineers, debugging, unit test, code reviews, refactoring, and more debugging.

We also have machine learning modules written in Python which provide a more flexible and diverse selection of machine learning algorithms. My internship project was indeed written in Python, and run at scale on the cluster as Spark jobs. However, Scala is still the dominant language used for production at Tapad.

Effectively communicating beyond the screen

Even more important than choosing the right programming language for each step of a project, is being able to communicate effectively with team members about the project(s) you are working on. In fact, using the same language as your audience, and conveying ideas and results clearly, is vitally important to any project’s success.

For instance, if I am talking with data science colleagues about technical issues, it is generally okay to use some jargon and acronyms. However, I don’t assume that everyone is acquainted with these shorthand terms, even if I’m speaking with very senior, experienced data scientists. After all, nobody knows everything and often jargon or acronyms can be specific to individual projects rather than being recognized across the industry.

It is doubly important to avoid jargon or verbal shortcuts if I am talking with coworkers from other teams. Although some people think that using technical acronyms and “data science-y” terms makes them look smart and professional, the fact is that no one likes to feel stupid for not understanding terminology. It can be more powerful and efficient to speak clearly, using language that everyone understands from the onset.

The purpose of a conversation, in any format, one-on-one or in a group meeting, is not to show off how much we know, but to listen, learn, collaborate, and share feedback. If the people we are speaking with feel shut down, lost, or don’t feel able to contribute, the meeting turns into a monologue, which can hinder teamwork, and will hurt our own potential for growth. As Albert Einstein put it, “If you can’t explain it to a six year old, you don’t understand it yourself.”