Skip to main content
The 2024 Developer Survey results are live! See the results

Which R is the "best": base, Tidyverse or data.table?

Created
Active
Viewed 4k times
35 replies
45

Base, Tidyverse, and data.table face off in a boxing ring

As the R ecosystem continues to evolve, a question that keeps being asked is "Should I use base R, Tidyverse, or data.table?"

There may be some aspects in which one approach may be objectively better than the others. For example, see this main site Q&A: data.table vs dplyr: can one do something well the other can't or does poorly?

In the end, the decision of which ecosystem to use involves a trade-off between competing factors. Failing a personalized recommendation, which one is right for the average user?

Your colleague is getting into data science or statistics for the long term. Which ecosystem should they focus on and why?

Base R, Tidyverse or data.table?

Some points to consider:

  • Ease of learning
  • Code readability (conveying the purpose of your code to yourself and others)
  • Performance (computational speed and memory efficiency)
  • Concision

35 replies

Sorted by:
77091037
43
  • 93.8k
  • 12
  • 144
  • 228

Learn all of them, choose one, or two, or all of them, or the one your coworker uses, or the one that you think might not break your code in a month, or Python, or DYOR by reading all the other X-vs-Y arguments scattered across the internet already.

Do we have enough moderators to stop all the wars that are going to break out in these discussion forums if its going to be questions like this?

77105984
11
  • 16.8k
  • 7
  • 55
  • 78

As long as they do not leak out from one discussion thread, why stop them? Don't have to click on a clearly inflammatory thread, but I for one found this thread very interesting for many reasons. First and foremost I am teaching students, and I need to explain to them why I teach X and Y but not Z.

77091205
17
  • 1.1k
  • 1
  • 12
  • 30

My recommendation is to start with the fundamentals of base R and then switch to Tidyverse at an early stage. Tidyverse functions are usually easier to understand for new users, more readable, and often more performant than base R. The Tidyverse also borrows a number of concepts and function names from other languages, such as SQL, which facilitates venturing beyond R at some point. Just learning key concepts of base R before adopting the Tidyverse is a similar to how many people learn Javascript for front end development: learn just some basic vanilla Javascript and adopt a framework (React, Svelte etc.) at an early stage.

Any casual R user may want to stick with the Tidyverse. I would only deviate from that when exploring more advanced tasks, like package development, web application development, or large data handling.

For package and web application development, it is a good idea to keep the number of dependencies low, corresponding to a heavy use of base R.

Large data tasks make data.table an attractive choice. The syntax and thereby code readability is horrendous, but the performance, in terms of both computational speed and memory efficiency, is definitely better than that of base R and the Tidyverse. There is a Tidyverse wrapper for data.table, but anyone considering to use data.table in the long run might want to use the package's own syntax.

In conclusion, I think that the Tidyverse is the best choice for most R users. For the average user, there is no strong reason to use base R, which is still stuck in the 90s, or data.table, which has an aweful syntax.

77096600
2

"Any casual R user may want to stick with the Tidyverse. I would only deviate from that when exploring more advanced tasks, .... or large data handling." I do not understand why that favors Tidyverse, I mean asume some basic operations and you learn Tidyverse, then your data set increases... so then you argue it is a good thing that you have to enter a learning curve twice and have to switch your library for basic data manipulation not because your complexity increased, just because you have more rows and columns to handle?

77097166
7
  • 1.1k
  • 1
  • 12
  • 30

@Merjin van Tilborg, most users never get to these advanced tasks. For the common new R user, it is the first programming language. I personally favor data.table. But seeing how much my students struggle with base R, the Tidyverse, and the basics of programming, recommending data.table to a new user just makes that person more likely to give up on data science. Even the bulk of more tenured empirical researchers barely writes basic code. And with data sets spanning at most a few million rows, they do not feel any need to explore this field any farther. Collaborating with them using the Tidyverse is much more pleasant than trying to force them into another system, which simply leads to them not writing code or making a ton of mistakes that you have to correct later.

The good aspect of the Tidyverse is that people with little affinity to programming or people from other languages look at the syntax and quickly grasp what it does. It offers a good trade-off between ease of use and performance for the average user. Just like most people do not need to learn to write C in Vim, there is not need to complicate data analysis in R for most social scientists, geographers, and biostatisticians.

77097959
0

@Chr I am not talking about more complex or advanced tasks, simply talking about a simple script in Tidyverse that works great on a small dataset, then I want the exact same simple tasks on a bigger data set you run into issues. Believe me, I have been there, I almost gave up R when much of what I learned I had to rewrite and understood there were alternatives to keep my runtime low and achieve the same.

77098072
1
  • 1.1k
  • 1
  • 12
  • 30

@Merjin van Tilborg, yes, though a large share of users never works with data of a size where there is a substantial difference in execution time. I personally work on a project where the rows do not fit into a single data.frame/ data.table, but that is much larger than what the average user works with. The question explicity asks about the average user.

77099190
3

@Merjin van Tilborg Have you tried the dtplyr package? You can stick with dplyr code, which gets translated to data.table before execution. I have timed it on reasonably large datasets, the translation costs no relevant time and the runtime is virtually the same as when using data.table directly.

77104628
3

@Wolf, prior to switching years ago to data.table that would have been in option if I was in that stage. Now I learned to appreciate the syntax of data.table actually. I went through that learning curve already, so no need to change back again to tidyverse for things I learned in data.table. Sure for those who are not familiar with data.table yet and encounter performance issues after learning dtplyr can sure be a good option.

In general, downvotes because I do not consider data.table to have an aweful syntax or disagree on the fact it is more efficient to learn twice a package because one thought never to jump into bigger data. It is a discussion where all opinions are valid, so downvoting for disagreeing is kinda odd.

78038007
0
  • 28.2k
  • 10
  • 69
  • 101

The good aspect of the Tidyverse is that people with little affinity to programming or people from other languages look at the syntax and quickly grasp what it does.

I just want to point out that this is not entirely true, at least in my experience. I agree that novice programmers usually find tidyverse syntax "legible", but more seasoned programmers told me that they find it confusing.

77091276
17
  • 159.3k
  • 7
  • 83
  • 166

What is an "average user"? Is it an undergrad or even high-school student taking their first semester of statistics with no prior programming experience? Perhaps one approach to this could be a tailored list, e.g. "if you are learning statistics for a semester/year, then ______" versus "if you are working with larger data and/or need to scrape as much speed as possible, then data.table" versus something else.

77091622
4
  1. they aren’t mutually exclusive. You can use all of them.
  2. R is fundamentally fragmented, because it’s so vast. If you’re choosing to pick one of these things to get away from fragmentation you’re probably doing it wrong.
  3. if R were re-contrived from being RCPP centric to being LLVM centric that might be pretty surprising opportunity right there. What would it look like if it ran natively on a GPU? LLVM reliably gives O(1000)x speedups versus many libraries. Matlab, python, R. You can also run on a variety of hardware, and tends to adapt to things like core count and memory pretty well.
77091775
16

While learning data analysis everything seems to fit into a data frame. And both data.table and tidyverse are objectively better at handling data frames compared with base R. So it might be tempting to stick with one of those and forget base.

However, in the real world - not everything is a data frame:

Not to mention images, graphs, etc. Those things are matrices, arrays, lists, S4 objects. And then knowing base R is unavoidable.

So my best advice would be to get to know some base R, and then, for data frame specific tasks, choose which flavour of data frame manipulation looks most appealing.

A very similar view is expressed here by Brian Caffo.

And also by the creator of data.table - Matt Dowle here.

77106022
2
  • 16.8k
  • 7
  • 55
  • 78

Bioconductor (mind the spelling) implements its own type of data frames (DataFrame) which is incompatible with tidyverse tibbles. And there are plenty of DataFrames in Bioconductor. I do agree that knowing base R is necessary, but I would strongly argue that knowing at least tidyverse or data.table also is. As an Omics expert who uses BioConductor daily I can't imagine my work without tidyverse.

77091906
8

I do not think it is a fair question as tidyverse includes so much more than just the data manipulation, as it includes ggplot2 and lubridate as well. But for the things data.table can do I wish I had learned right from the start.

I learned tidyverse first with smaller data and all I learned I had to redo after a while when I had to deal with larger data files and more complex manipulation on many columns.

Syntax and readability is really a non issue for me, a larger learning curve maybe but when you get used to it I feel writing data.table is faster ánd more readable. And also more intuitive, I do not have to follow a chain to read that I filter and then perform an action. I rather read on one line / statement that I perform an action on filtered data. If others need to read it, we still can comment code.

Another thing I noticed although stringr in practice is just a wrapper around stringi, on bulk data it seriously comes at a very high cost of processing time (and yes sometimes base can beat it). Also for reading large csv files fread is so much faster.

I hope it makes some sense, beyond code readability but for me it meant rewriting away from tidyverse a reduction from over an hour processing time towards a few minutes for my user cases.

77093050
2

There is no simple answer. There is a compromise, and depending on which factors are most important in each case, or for each user, my recomendation would be different. In addition, personal preference, is a valid reason for choosing among those computer languages or "ecosystems" that are good enough for the job. We are most productive if we feel confortable and enjoy the work we do, so it matters a lot that the tools we use feel good to us. However, one important point is that Tidyverse and data.table are built on top of R, which means that one needs a good understanding of R for debugging problems with either of them. So, starting by properly learning R is worthwhile the effort.

For someone doing data analysis for a living, updates and deprecations maybe tolerable because it is easier to keep up to date with the changes. For the occasional user like many researchers in acatemia or part-time maintainers of R packages, such changes can be a pain. I very much appreciate stabilty and backwards compatibility (I love TeX and LaTeX) but still use the Tidyverse for some data analysis scripts, but rarely pure "Tidyverse". I used data.table for a while some years back, but for the rather small data sets I work with, it did not feel worth the effort of learning it in depth.

On the other hand, trying data.table and learning to use the Tidyverse, even if nowadays I use it rather selectively, has taught me new ways of solving problems and of thinking about data analysis. So, spending time with the three "ecosystems" was time well spent. One thing that I learnt by using R for the last 25 years is that many of the apparent performance bottlenecks in R can be avoided within R itself if one knows how. Performance of R has also improved over the years.

What I wrote above is just my opinion at this time. It is based in how I use these ecosystems. Without actual planning it, my routine has become to use base R unless there is a clear advantage in using the Tidyverse packages and functions. I do not currently use data table, but I would at least try it again if I had to improve performance for large data sets. @Chr I disagree with you about casual users, but mainly because of the constant evolution of the Tidyverse. R is mostly in the 90's but the design is no longer in flux. Again a compromise. @Spacedman I agree with you, I think. These "ecosystems" are tools. Any tool needs to match the task, but also fit the user of the tool.

77095824
5

For my daily work, I use only data.table : faster on heavy dataset, and, as far as I (and my collegues) need it for heavy dataset, the syntax is good also for small ones .. so no need for us to use tidyverse syntax .. base R is, in any case, unavoidable : one needs to know about base R objects in any case. Drawback of this choice : not so easy for us to understand tidyverse syntax on answers in S.O .. :)

77097300
4

I prefer Tidyverse syntax because it's easier to both write and read. I value being able to translate my ideas into code in a quick way, and I also appreciate how I can show Tidyverse code to a non-programmer team mate and most often they'll be able to understand what's going on. I believe writing readable code is really important for data scientists, as we often work with other professionals and social scientists. Therefore, Tidyverse syntax allows us to write readable, understandable and clean code, which are a must in order to avoid giving out the impression that code is some obscure and inaccessible thing to the rest of our team.

77098346
3
  • 5.6k
  • 1
  • 15
  • 44

Maybe a point to consider about data.table: https://github.com/Rdatatable/data.table/issues/5656

77105330
2

The title of that Github issue is needlessly dramatic - I'd encourage you to review issues such as #5676 and #5686 that reflect the broad group of contributors and users who are invested in the continued success of the data.table project.

77099243
3

Although this is already a neverending discussion, we should also add polars and arrow to this arena. They are the promising newcomers in the R ecosystem.

Also, there are some packages that join, IMHO, best of both worlds. For instance, dtplyr, tidytable or tidypolars.

77104681
3

I'm with Norm Madoff who has a comprehensive critique.

77106363
0
  • 16.8k
  • 7
  • 55
  • 78

I think his critique is very clearly directed at a specific teaching style, and it is hard to disagree with him on many of the points he rises. However, I think in other points he is wrong.

For example, when he writes that it is not possible to do `x[col,row] subset(blood.glucose

77106189
5
  • 16.8k
  • 7
  • 55
  • 78

Basic R operations are important and everybody should be familiar with them. But base R code in connection to data science quickly starts to result in monstrous, unmaintainable code. Think about that: how do you do turn a long data set into a wide data set without pivot_wider?

I have been using only pure base R code for about fifteen years of my life in countless projects. I was very reluctant to start using tidyverse (or even ggplot2) because my learning curve predates these packages. Then I turned to data.table, but I never really liked it. Eventually, a couple of years ago, I grudgingly accepted that coding with pipes and all the goodies that come out of the box in tidyverse packages are really useful and nowadays I can't live without them. I still try to keep tidyverse at minimum (preferably 0) in my own packages, but I do use it a lot in the code I write for scientific projects.

In teaching, I start with basics, so in line with what Norm Matloff writes my students definitely know [, $, logical subscripts and all that. However, we quickly introduce them to basic tidyverse functions for reading and writing files, filtering the data, mutating columns, selecting columns, tidying output of statistical objects, sanitizing data frames and more. We cherry pick the stuff that I consider to be most useful. We teach pipes very late or not at all, but we strongly recommend choosing tibbles over data.frame and tidyverse file reading functions over the builtin functions, because of their consistency and verbosity.

What I really dislike in tidyverse is variable selection. I know the rationale for that, I accept that it makes sense from a certain point of view, but I really hate it. Not least because it is such a pain to code functions around it, and because it puzzles my students.

77328438
4

"Think about that: how do you do turn a long data set into a wide data set without pivot_wider?"

There's the stats::reshape function included in the base install of R which works just fine and can do what you ask in 1 or 2 lines depending on the dataset. There's literally hundreds (maybe thousands?!) of questions here on Stackoverflow showing how to do this in base R.

77115414
2
  • 3.2k
  • 2
  • 22
  • 38

I try to steer people away from base R when first starting out. I find the tidyverse to be a much easier entry point for new users. I think data.table is slick by being so compact but harder to read out of the gate. There is dplyr compatibility with data.table via dtplyr so users can get the best of both worlds but still I'd start with the tidyverse syntax. I think a lot of stackoverflow answers and tutorials in the wild focus on the tidyverse and that's another reason that I point people that direction.

77193711
4

Some people like Tidyverse, some people like base R. I use both. But I think that it's a mistake to steer people away from the basis of a language, especially at the beginning.

When you learn a language you have to understand the basics otherwise you will miss lot of things to understand. I solved lot of issue in Tidyverse with my understanding of base R. And when I started to learn R and Tidyverse and get a tidyverse issue, I tried to complete my lack of knowledge in base R.

77748872
3
  • 4.9k
  • 2
  • 30
  • 45

I hire people for statistics roles, and I make a point to ask about tidyverse vs BASE when they say R programming is a strength. The best answer I'm looking for from candidates is "I can do either or." The worst answer is along the lines of complaining about BASE. In part, that's because a lot of what we develop in house is in BASE. The worst possible new hire is someone with a limited range of ability and a chip on their shoulder. Unfortunately, my take on dplyr is that it's a completely different programming language - when candidates say they only use dplyr, I don't feel that they know R. dplyr is more like Bash than it is like R in terms of how arguments are piped through several iterations to arrive at a result. At the end of the day, a great programmer needs to be able to explain how they got their results, and check their steps carefully, so a case could be made for any language there. If you merge datasets, do you check that the merge is unique, and if it's not, are you getting the desired outputs (SAS will flat join and BASE R will Cartesian join, what does dplyr do?). I mention this to nuance how you point people in the future.

77117698
2
  • 781
  • 7
  • 15

I am using R for statistics and econometrics. So, for my purpose, I find easystats set of packages more useful than tidyverse.

77269055
7
  • 78.1k
  • 6
  • 24
  • 84

In my opinion, you need all of them. It's analogous to a triangle; you move towards whichever vertex you require at a given moment. Personally, I prefer data manipulation with tidyverse, but I switch to data.table when speed is a priority. One advantage of using base R is its independence from other packages, especially when creating a package of your own.

77689334
0

The new kid: tidytable !

Allowing us to use dplyr syntax with data.table speed https://www.rdocumentation.org/packages/tidytable/versions/0.10.2

77748930
0
  • 4.9k
  • 2
  • 30
  • 45

Stack Overflow is a dplyr community. If you ask this question here, you're going to get only one kind of answer.

I admit I don't like dplyr, but that's personal preference. I've managed R and SAS programmers, and a saying in the SAS community that facilitates collaboration is often lost on R programmers: small steps for small minds. When a production programmer gets stumped and needs me to step in to solve issues, it's problematic when their work is buried 10 or 12 levels deep, whether those levels are pipe operands %>% or brackets or parentheses enclosing lapply/apply/sapply loops.

Anyone aspiring to be a professional programmer should do themselves a favor and search job ads in their area. I expect that the posting will be candid about what "flavor" of R they want their programmers to use.

78689932
0

You should learn at least base R and tidyverse as a minimum. data.table is great, because it is fast with great performance, but the syntax is difficult to grasp and the majority of people do not know data.table, but they do tidyverse.

And of course you should know base R. Learning an ecosystem or a framework without understanding the basics of the language does not make sense to me.

78696591
0
  • 1.1k
  • 1
  • 12
  • 25

Actually the question makes little sense.

Do not forget that tidyverse is not only wraggling packages like dplyr, tidyr or purrr. There are also packages like ggplot2, stringr and lubridate. We can compare all this to base R since base R can also work with strings and dates and can also produce plots. But data.table can't. So obviously you can't use just data.table.

And you can't use only tidyverse either, because there is no tidyverse equivalent for each r base function (there is not set.seed(), mean, rnorm, ...). So no matter how hard you fell in love with tidyverse, you will always use base R language, too.

78768193
0

I've never studied data.table thoroughly, but reading around it seems unquestioned that it is a clear performance winner for large dataset operations.

As for the other two, dplyr and the tidyverse are very terse and readable, but with small to average datasets I noted that base R is almost always much faster than dplyr, up to 50x in my tests.

My guess is as follows:

  • Ease of learning and code readability: tidyverse

  • Performance, normal datasets (up to hundreds of entries): base R

  • Performance, large datasets (thousands of entries): data.table

  • Concision: not a huge difference