Announcing a change to the data-dump process

Question

UPDATE July 26, 2024

Thank you for your patience as we worked to get clarity around a few issues. We have two updates.

First, our product team has regretfully informed us that it will not be possible, within the existing time constraints, to bundle the dumps for a site and its meta together. So if you want to get the download for a site and its meta, you will need to do the request on both sites. We know the initial announcement implied that the two files would be bundled together, and we’re sorry that won’t be the case right now. That bundling was initially assumed to be a low lift, but it turned out not to be, so was taken out of scope for this release.

The second update we have is regarding the “Click to agree” language. Many users have found that language to be unnecessarily bureaucratic and difficult to parse and understand, within the context of a CC BY-SA 3.0 license. Philippe committed to asking our lawyers to take a look, and they did.

Both the product and legal teams have signed off on this modified language:

"I understand that this file is being provided to me for my own use and for projects that do not include training a large language model (LLM), and that should I distribute this file for the purpose of LLM training, Stack Overflow reserves the right to decline to allow me access to future downloads of this data dump."

This makes it easier to understand the agreement they’re making by downloading the file and, most importantly, clarifies that there is a separate path for someone who would like to use the data to train an LLM (using the “contact us” link below the heading). The “Become a partner” form allows someone to specify interest in non-commercial LLM-training use as well as commercial use. This updated image of the data dump access page (below) reflects the new language.

We’ll update this post again in the coming weeks when the data dump access page page is available for use.

Original Announcement

12 July, 2024

Today, we are announcing some changes to the data dump process. I’m going to start with an important statement: this is primarily only a change in location for where the data dump is accessed. Moving forward, we’ll be providing the data dump from a section of the site user profile on a Stack Exchange profile.

There are a number of reasons for this: first, this is an attempt to put commercial pressure on LLM manufacturers to join us and our existing partners in the “socially responsible AI” usage that we’re advocating for - to get them to give back to the communities whose data they consume.

Second, we want to help make the process of accessing data dumps quicker and more efficient. While the Internet Archive has been a great partner to us, as you may know, both internally and externally, people have encountered challenges with uploading and downloading the dumps with any reasonable speed.

Lastly with the heightened ease of receiving a dump, we’re curious to see what community members will build with this information.

I want to emphasize a few things:

You may download the dumps, free of charge, as you always have, just in a different and we hope more convenient location. The CC BY-SA license is unchanged.
The dump file will be provided in an “instant” format - we will generate a URL on the backend for the data from the site to be downloaded.
We are requesting that if you intend to use the dump for a commercial purpose, you consider joining the socially responsible AI movement and giving back to the community. For commercial purposes, there will be a link on the form to route you to the Stack Overflow team to discuss more information and next steps.
We are requiring that all partners in socially responsible AI comply with the CC BY-SA attribution requirements, attributing content to the community members who contributed it. All reuses of the data require attribution and it’s a part of the license.

Data dump access

You can access this site's data for non-commercial use. For commercial purposes, please contact us.

A new data dump is available every three months. Learn more about the data dump process.

(checkbox) I agree that I will use this file for non-commercial use. I will not use it for any other purpose, and I will not transfer it to others without permission from Stack Overflow. I certify that I am not downloading this file on behalf of my employer, for use in a for-profit enterprise. I have read and agree to the Terms of Service and the Acceptable Use Policy, and have read and understand Stack Overflow's privacy notice.

(button) Download data

(Above: The proposed location of the data dump access within the account settings; please note that some of the placement/layout/language is subject to change.)

At the same time, we are also reviewing our stance on data scrapers who pull down the site, including those who then use the data to impersonate Stack Overflow. We are adding additional checks for bots, and are hardening that infrastructure. This has been a longstanding request from the community, and we’re pleased to be able to resource this change. More details to come on this one.

Unfortunately, this new process will not be ready until mid-August. Because this is an important change, we want to use it for the July data dump release, which means that we will miss our previous commitment date of July 31 for the data dump. If you have an urgent and critical need for an updated version of the data, please contact us using the contact form, and we will see if there’s a way to accommodate you on the original schedule. The meta post will be updated once the new data dump process is available.

FAQ

Why are you updating the data dump process?

We are attempting to protect the long-term viability of the Stack Exchange network and to ensure that it is a comprehensive and well-attributed resource for generations to come.

We want to modernize the data dumps process and make it more efficient for users in the community. By adding the data dump process within the Stack Exchange site profile, users will be able to quickly access and track the quarter’s data dump.

Simultaneously, we know that companies have scraped or otherwise ingested Stack Overflow and Stack Exchange data to train models without proper attribution — models that they are monetizing or using for commercial purposes. We know this is happening because the companies themselves, or independent researchers, have disclosed this information. Assuring long-term viability of the community requires resources, and the commercialization process exists to assure that those resources flow back into this community. This also allows us to advocate on behalf of the contributing community members for appropriate attribution.

It’s important to note that this is not about preventing the everyday person from accessing the data dumps, nor is it about putting up a wall that is impervious to workarounds (which would be futile). This is part of laying out the pathway for companies to go through the proper channels and be above board in how they access the data. These companies want to be doing it the “right” way. In the rapidly changing legal and ethical landscape, being compliant and having a defined partnership for data access is important to both their brand and their bottom line.

This change should have little practical impact on the everyday person accessing the data dumps. It is ultimately a change in service of a very important goal. This is a trade-off in response to a rapidly changing world - a world in which we are now dealing with things like LLMs, which were not even a consideration when the CC BY-SA license was originally adopted.

How do users access data dump files?

To get the data file for a Stack Exchange site, there will be a location within the settings page on each Stack Exchange site. Users will be provided with the link to the data file, and can also continue to access the data through the Stack Exchange API or Data Explorer.

Please note that users will only be provided the data for the specific Stack Exchange site and its Meta site for the corresponding Stack Exchange site profile. If you wish to access the data for additional Stack Exchange sites, you will need to go through the same process for each site.

How is a request for commercial use made?

Commercial use requests can be made by filling out the “Become a partner” form on this page. That will be passed to our team who is responsible for working with partners.

Will Stack Overflow be uploading the data dump to archive.org?

Stack Overflow is no longer uploading the data dump to archive.org. We initially began to host the data dump on archive.org when our prior host went out of business. Now, we have developed the in-house capacity to host these ourselves. We would really rather users do not upload the file to archive.org or similar data pile sites. Assuring the viability of the Network takes resources, and companies that profit off the back of the work of this community should feel an obligation (“socially responsible AI”) to give back to the communities whose work they use to create the models that they are marketing commercially. Our hope is that because the process for individuals to request the dumps is lightweight and quick, you won’t feel the need to undermine these efforts to encourage commercial re-users to contribute back.

Can a user give the file to their employer for them to train a language model on?

When organizations are able to skip out on their obligations to contribute back, the whole internet suffers. Without Stack Overflow as a resource, many of the world’s millions of technologists would not be able to find the answers to their programming questions. And we know that these organizations are not contributing voluntarily - so putting this light commercial pressure on them is the best method to encourage socially conscious behavior. It’s important to say that when you breach the agreement that you make when downloading the dumps file, we do have the option to decline to provide you with future versions of the data dumps. But we really don’t want to have to do that.

A user has an unusual need for the full data dump file. How can they get that fulfilled?

We're scoping and building a process to allow for this, but it will not be in the initial release of the feature. Look for this in v2. For now, if you have a critical need for the full file, please write to us using the contact form and we will try to assist.

Note that the "I will not transfer it to others without permission from Stack Overflow" does not comply with CC BY-SA 4.0, specifically the "No additional restrictions" part which says "You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits." specifically the "ShareAlike" clause does not allow that restriction. — Abdul Aziz Barkat, Commented Jul 12 at 13:56
What happens if an asteroid hits one of your data centres and the other centre is flooded by a subsequent tsunami? A third party host was a nice safeguard for the community to ensure that we will be able to continue accessing the library of knowledge we build - no matter what happens to SO. — samcarter_is_at_topanswers.xyz, Commented Jul 12 at 14:08
So... why not just post this on Monday, then, if you know you aren't going to be releasing the data dump on July 31st anyway? Or just accept it is late and state that July 31st dump will be the last one under the latest process. — TylerH, Commented Jul 12 at 14:29
@Philippe Is there a reason that the company can't meet its previous commitments by posting the Q2 data dump to Archive.org by the end of the month? Something other than the SLT's desire not to? — AMtwo, Commented Jul 12 at 17:59
Given that this has been talked about internally since early 2023, and work was only just prioritized to begin ....essentially today. I'm really struggling to understand the importance of not delivering the latest dump to the Internet Archive on schedule while work continues. If it was that important to not post to the Archive anymore, it begs the question as to why work only just started this week after a year of discussion. And further the decision to delay delivery appears arbitrary given lack of explanation. — AMtwo, Commented Jul 12 at 18:47
@Philippe, you do realize that there is nothing you can do to prevent your community from simply maintaining the archive.org mirror for you? I guarantee that the moment the dumps are made available there will be dozens of people immediately downloading them and mirroring them to archive.org. This whole worthless exercise is just a waste of development time. — Maxwell175, Commented Jul 12 at 23:15
@Philippe so far I am only observing the opposite. It is going to be harder than ever to get this data due to the login requirement and whatever mechanisms you put in place to make it harder to download. SE data dumps are some of the most useful tools for new developers and data scientists to work with. Aggregating data across many sites, running complex queries, ingesting into some database engine. There are thousands of people who rely on this data to learn and do research. — Maxwell175, Commented Jul 13 at 0:42
In less than a month, you went from "we fixed some community's issues" (and people were hopeful) to "the license to your content is just ignorable suggestions" (and people are, rightfully, not happy). — Ismael Miguel, Commented Jul 13 at 4:41
You [SE] posted this in around a day after posting it to mods, so I didn't have the chance to reply internally. This revision is certainly better than rev. 1 and 2 presented internally, but the omission of a way to download the entire data dump is far too convenient to be accidental. This drastically increases the difficulty of archiving the data dump in its entirety, which makes it harder to preserve the future of the content if the company goes under (or prosus decides to axe the public Q&A, or anything results in the Q&A disappearing from the internet abruptly) — Zoe - Save the data dump, Commented Jul 13 at 17:13
There's no way around it anymore now that you [SE] have pushed this through in spite of internal warnings; you [SE] have made Stack Exchange, Inc. the biggest threat to the sustainability and future of the community. genAI is in second place, because the ongoing damage you [SE] are doing to the community, and to data archival and preservation efforts beats any damage genAI could do by scraping the data. While this hasn't been said anywhere, it's clear to all of us that the only reason you [SE] are doing this isn't to protect the community, but to protect your [SE's] revenue — Zoe - Save the data dump, Commented Jul 13 at 17:14
Here we go again. Every promise from the company is just a lie, every time. Nobody ever learns. — OrangeDog, Commented Jul 14 at 11:47
@Philippe "My hope is that over time, they see that there's no need, but that will take time, I understand" - that's a very funny thing to say when the SE management always manages to find new ways to make me less and less fond of this company. I thought we hit rock bottom years ago but it somehow manages to get worse, so over time it got more and more important to have an archive that is completely separate from the company. For me, it has reached the point where I feel the best thing that could happen to SE right now would be firing everyone from CEO down to your level immediately. — l4mpi, Commented Jul 15 at 8:47
The only way this ends is the community uploading all CC-BY-SA content to the Internet Archive anyway, this process only creates anger and frustration for all involved. So why do this? The content was explicitly CC-BY-SA from the start to prevent exactly this sort of plan from being effective. Moves like this make me deeply sad. This isn't what Stack was built to be. — Nick Craver, Commented Jul 16 at 2:37
I just find it very frustrating that yet another company announcement has been marred by PR-speak instead of a seemingly earnest attempt to interact with the community. You seem to have known that this would spark a lot of discontent, especially given the reassurance that you opened the post with– but this whole thing comes across like "creative rebranding"; you're trying to sell us on something that you knew we wouldn't like. I see the dumps as representative of the idea that our content transcends the company– so the company suddenly deciding who gets to use it feels very negative to me. — zcoop98, Commented Jul 16 at 21:26
@Philippe What exactly were you doing in this week of meetings? You've come back to us after two weeks with a worse version of what was announced, and a bit of wordsmithing that addresses none of the real concerns raised. And I can't believe you're telling us with a straight face you're unable to give us the dumps in a single dump due to technical reasons... despite already being able to do that. It's not new functionality, it's existing functionality. — ert, Commented Jul 26 at 19:54

AMtwo · Accepted Answer · 2024-07-26 18:20:31Z

On July 26, Stack Overflow updated the post to include different modifications to the CC BY-SA license than the prior version. This remains a breach of the license terms, as the CC BY-SA license does not allow additional restrictions, as previously outlined below in this answer and in other answers.

There is a lot going on here... I'll come out and say it that after first reading, it feels like a great big distraction because Senior Leadership is treating the data dump like a boogeyman, when that's wholly unfounded.

That said, let me try to organize my thoughts some.

Missing the July Dump deadline.

Just over a year ago when I was still staff at the company, I was personally in the unenviable position of having been instructed by the Stack Overflow CEO to disable the Data Dump, and to not re-enable it because he wanted to end the dump. That decision ultimately snowballed until Stack Overflow made commitments to continue the data dump quarterly. Data Superstar Aaron ultimately made some improvements and there was a shift made to the delivery schedule, to make it align better with quarterly boundaries. This is all excellent news for those of us who use the data dumps, and/or are proponents for equal data, and/or are defenders of the open data commitments made by and for the community.

Now, just one quarter after the company's most recent commitment to a schedule, it's shifting, again. For no reason. Apparently undoing the most recent schedule-shift by bumping (at least) a month.

WHY CAN'T THE DUMP BE POSTED TO ARCHIVE.ORG ONE MORE TIME?

There is no rational reason given as to why to delay the July dump. There are no blockers preventing the company from continuing the existing process until the new process is ready.

Stack Overflow has had plenty of time

Evidenced by the Data Dump's recent history, the company has had plenty of time to pursue changes. It certainly seems like you set an arbitrary deadline, missed it, and are going to now cause arbitrary delays because YOU did not prioritize the work well enough to meet your own deadline. I won't belabor why that's problematic.

If you want to build good will with the community, I'd suggest that the data dump promised to the community by July 31, 2024 still be delivered by that deadline, regardless of the state of the new process you want (but do not need) to use.

Stack Overflow will be violating the BY-SA license

Policing commercial use goes against the Creative Commons BY-SA license

The Creative Commons license is very clear (emphasis mine).

Share — copy and redistribute the material in any medium or format for any purpose, even commercially.

Ethical AI is a great talking point, but I'm not sure of the ethics behind preventing commercial use of something when it is legally, explicitly available for commercial use. The freedom to use it for commercial purposes obviously comes with the need to follow all license terms. The Creative Commons license explicitly says this:

The licensor cannot revoke these freedoms as long as you follow the license terms.

In essence, Stack Overflow is limited to ensuring that downstream users provide Attribution, and that the data continues to be shared under the same open license. Again, quoting from the BY-SA license:

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

For folks reading along, this last quote from the license does NOT disallow requiring a username/password to limit access, but does prohibit any sort of DRM, "watermarking", or other means to limit use.

This license quote DOES prohibit the checkbox included in the data download mockup. By requiring a user to "agree that [they] will use this file for non-commercial use...and [they] will not transfer it without permission..." Stack Overflow is violating the license terms. Period.

A breach of the license terms results in automatic termination

From the Creative Commons FAQ:

All CC licenses are non-exclusive: creators and owners can enter into additional, different licensing arrangements for the same material at any time (often referred to as “dual-licensing” or “multi-licensing”). However, CC licenses are not revocable once granted unless there has been a breach, and even then the license is terminated only for the breaching licensee.

If Stack Overflow violates the CC BY-SA license, the users who created the content (ie, "us" not the Company) can terminate the license granted to the Company. This could be DEVASTATING to the community. Particularly on smaller SE sites, a small set of users forcefully revoking Stack Overflow's ability to use the data under CC BY-SA could set a site back years.

I would be excited to see the Company helping to enforce BY attribution and SA share-alike licensing by others on the internet. However, it is incredibly disheartening for individual contributors to see the Company being the potential breacher, rather than defender.

The data dump is licensed under CC BY-SA explicitly.

Quoting from the Stack Overflow Terms of Service:

From time to time, Stack Overflow may make available compilations of all the Subscriber Content on the public Network (the “Creative Commons Data Dump”). The Creative Commons Data Dump is licensed under the CC BY-SA license. By downloading the Creative Commons Data Dump, you agree to be bound by the terms of that license.

Users grant Stack Overflow the CC BY-SA license. Stack Overflow uses that license to "remix" the individual posts into the compiled data dump, and appropriately covers the entire derivative work under the same CC BY-SA license. By requiring any individual to agree not to use it for commercial purposes, Stack Overflow would be violating the "no additional restrictions" clause, and be subject to automatic revocation of the CC BY-SA license from the grantor (users, authors, content creators).

"We would really rather users do not upload the file to archive.org or similar data pile sites"

I can appreciate that the company might rather this not happen. But unfortunately, the CC BY-SA license means that the Company can't restrict this. Stack Overflow could try changing the actual format, and licensing that anthology format differently, and place restrictions on that new archive product, but the content itself will always be free to be uploaded to someplace like Archive.org.

"When organizations are able to skip out on their obligations to contribute back..."

Unfortunately, this is not an obligation that is covered by the CC BY-SA license. And even more unfortunately, adding this restriction on top of the BY-SA license would be a breach of that original license, and thus Stack Overflow would be the entity in legal hot water, not the downstream users that Stack Overflow is trying to police.

Let's assume Stack Overflow proceeds...

even in a unique format to circumvent CC BY-SA...

Because many of us are technologists and software developers, I can almost guarantee that someone will create a process that:

downloads the Data Dump for individual, hobby use
creates a process to ingest the data dump, and reformat as XML if necessary
uploads the new data in a backwards-compatible format to archive.org

Stop Gaslighting me

It’s important to say that when you breach the agreement that you make when downloading the dumps file, we do have the option to decline to provide you with future versions of the data dumps. But we really don’t want to have to do that.

User-generated content is licensed under CC BY-SA to the entire world.
Stack Overflow compiles those CC BY-SA licensed creations into a data dump, and then licenses the product that is the data dump under the same license.
The "Creative Commons Data Dump," being licensed under the CC BY-SA license, permits any use, including commercial, so long as derivative works include the S and the A--"Share Alike" (continued BY-SA licensing) and "Attribution".
By attempting to enforce a non-commercial limitation on a BY-SA licensed creation, Stack Overflow is breaching the agreement with the community.

It’s important to say that when you breach the agreement that you make when creating the dumps file, we do have the option to revoke the license we granted when we posted content on the site. But we really don’t want to have to do that.

Dual licensing

Nearly everything I said above has a big caveat... At the time of posting, users granted the entire world a CC BY-SA license, and additionally granted a second license to Stack Overflow (emphasis mine):

You agree that any and all content, including without limitation any and all text, graphics, logos, tools, photographs, images, illustrations, software or source code, audio and video, animations, and product feedback (collectively, “Content”) that you provide to the public Network (collectively, “Subscriber Content”), is perpetually and irrevocably licensed to Stack Overflow on a worldwide, royalty-free, non-exclusive basis pursuant to Creative Commons licensing....and you grant Stack Overflow the perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit such Subscriber Content, even if such Subscriber Content has been contributed and subsequently removed by you

Stack Overflow could use their perpetual, irrevocable license to do anything they want with the data. Stack Overflow doesn't have to release the "Creative Commons Data Dump" under the CC BY-SA license -- except that it does. Stack Overflow could change the TOS to stop licensing the Data Dump under CC BY-SA, and then add any restrictions they like. This would essentially be a proprietary "box" filled with freely reusable data--folks would be bound by the Company's terms for the "box", but they could take the CC BY-SA contents out of the box, throw away the box, and use the contents of the box in any way that meets the terms of the CC BY-SA license (including putting that data into a CC BY-SA box).

However, at the end of the day, everyone in the world can use the post content however they like, so long as they continue to follow the CC BY-SA license restrictions, regardless of how they access that data.

My promise to Stack Overflow

I intend to vigorously defend my rights as a content creator on the Stack Exchange Network. I will ensure that my content (which I licensed to the entire world under CC BY-SA) continues to be used properly according to the license terms. If/when someone breaches the license terms, I will revoke that license and defend my content's copyright.

On one hand, glad I was right in my mistrust about management and their bad intentions. On the other hand, really sad to be right. Anyway, thanks for shedding light over the dark truth! — Shadow Wizard, Commented Jul 13 at 7:38
@ShadowWizard The company blatantly lied about its intentions last year when it turned off the data dump. The intention at the time was absolutely to discontinue it. Unfortunately, I wasn't in a position to say that quite so flatly, and was overly cautious with what I said publicly. — AMtwo, Commented Jul 13 at 16:13
@ShadowWizard I have the transcript of the DM conversation from Mar 28, 2023 with the CEO where I was instructed to turn off the data dump, and pushed back until I was promised the Community would be informed before the next due date. Of course they never did a proactive announcement... — AMtwo, Commented Jul 13 at 16:16
@TylerH The Data Dump page on Internet Archive has some ambiguous language around the license, but the site TOS explicitly defines the Data Dump itself as being licensed as a CC BY-SA creative work. I agree that they could produce A data dump in a different format with a different license, and/or change the TOS to stop licensing the Data Dump directly as CC BY-SA. But that's not part of the plan. — AMtwo, Commented Jul 16 at 17:03
All the commentary about licensing (in your comments @TylerH and in other answers and comments here) serve to illustrate what is probably the real intent of the language used in the UI mock-up: to muddy the waters sufficiently to discourage use and distribution, whether or not that is allowable legally or enforceable practically. Which is itself worthy of condemnation and derision: the licensing situation surrounding the content has already had a long history of being muddied by poor communication and planning - introducing further confusion is beyond foolish. — Shog9, Commented Jul 16 at 17:30
@endolith The current "without limitation" version has been in place since May 2018 (meta.stackexchange.com/q/309746/132874). Before that, it was "You grant Stack Exchange the perpetual and irrevocable right and license to use, copy, cache, publish, display, distribute, modify, create derivative works and store such Subscriber Content and, except as otherwise set forth herein, to allow others to do so in any medium now known or hereinafter developed (“Content License”) in order to provide the Services..." — goldPseudo, Commented Jul 16 at 22:06
@Araucaria-Nothereanymore. I'm confused. Isn't that the "BY" part of CC BY-SA? Where in here do you think AMTwo supports unattributed commercial use? — Michael come lately, Commented Jul 17 at 19:54
@Araucaria-Nothereanymore. Nothing in the post supports using your content in any other way than that allowed by the license you agreed to when you contributed it. (which requires attribution regardless of whether the use is commercial or not) — ColleenV, Commented Jul 18 at 11:18
@Michaelcomelately Honestly, that has been the plan all along--but the announcement goes out of its way to deceive folks into thinking otherwise. If the company is going to do something shitty, they need to at least be honest about it. — AMtwo, Commented Jul 18 at 13:35
@Araucaria-Nothereanymore. The company is claiming that they are going to attempt to enforce the attribution requirements for the AI models trained on your content, and this post supports the licensing which requires attribution and expresses that they will vigorously defend their rights as a content creator. I'm not sure why you're downvoting it. I am skeptical that these claims of genAI models being able to properly attribute things are true, but nothing in this post is supporting giving your content away without attribution. — ColleenV, Commented Jul 19 at 13:58
@Araucaria-Nothereanymore. So your argument is that because you want LLMs to attribute your contributions, the dumps should stop, so LLMs can't get them? By extension, nobody should be allowed to get them? If the dumps stop, the LLMs are just going to scrape the web site, and still get your contributions, and can choose whether or not to attribute. The existence of the dump doesn't affect their ability to do that, and web scrapers are not hard to build, so the only people hurt by this are users who want the dump for non-LLM reasons and the site that gets increasingly clobbered by scrapers. — testing-for-ya, Commented Jul 19 at 14:28
@Araucaria-Nothereanymore. Publishing the dump for all the legitimate users carries - and has always carried - the risk that someone might take the dump and use it in ways that violate the license. If you want to avoid that completely, not only do you have to stop producing the dump, you also have to take the site down, because the dump is not the only way LLMs can take the data and violate the license. What you’re saying is, because someone stole your TV once, nobody else can have a TV ever. — testing-for-ya, Commented Jul 19 at 20:18
@araucaria-Nothereanymore. The Company should be helping ensure that if the Data Dump is used by anyone, attribution and forward licensing is in compliance with the license. Instead they are breaching the terms themselves, AND ALSO ignoring attribution and forward licensing. — AMtwo, Commented Jul 20 at 18:51
@Araucaria-Nothereanymore. There's no stealing when downloading or publishing the dumps verbatim elsewhere, since the dumps themselves contain sufficient attribution. It's stealing from you zero times. If you are uncomfortable with your work being downloaded or copied in this way, or if you expect future recompense for commercial uses of your work, then you should not publish any of your work on this site. Every contribution is covered by CC BY-SA, which very much allows for this usage. Neither you, nor SE, can retroactively add restrictions to this license, no matter how much you want to. — Theo Bendit, Commented Jul 21 at 19:43
I will join the revocation pledge. I licensed my data to StackOverflow "pursuant to Creative Commons licensing terms (CC BY-SA 4.0)" as in the ToS. It is not clear to me that Stack Exchange has, via its ToS, received a separate additional license from me outside of the CC BY-SA provisions (3 and 4.0 depending on the time of contribution). Should the Data Dumps be made available under a license with additional restrictions, I too will see Stack Exchange as having breached the license I granted it, and I expect to have my contributions removed immediately. I hope it doesn't come to that. — Jeff Bowman, Commented Jul 23 at 5:10

Ryan M · Accepted Answer · 2024-07-12 14:06:19Z

223

+150

How do you plan to enforce "I agree that I will use this file for non-commercial use. I will not use it for any other purpose, and I will not transfer it to others without permission from Stack Overflow." when the CC BY-SA license explicitly forbids adding downstream restrictions?

No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.

edited Jul 12 at 14:06

Ryan M

28.1k9 gold badges77 silver badges129 bronze badges

answered Jul 12 at 13:55

goldPseudo

6,5641 gold badge32 silver badges28 bronze badges

7

This depends on if, when we contribute content, SE gets our contribution under 1 license or two. This is a question that has been asked and ignored. If it is two licenses, the second license may give them the"perpetual and irrevocable right and license to access, use, process, copy, distribute, export, display and to commercially exploit", which may include commercial distribution under alternative license terms that include downstream restrictions.
– Thomas Owens
Commented Jul 12 at 14:14
69

@ThomasOwens If they're planning to distribute it under a different license, then prefacing it with "the CC BY-SA license is unchanged" feels disingenuous.
– goldPseudo
Commented Jul 12 at 14:16
11

It is disingenuous, yes. Something here is wrong, but I'm not sure what it is.
– Thomas Owens
Commented Jul 12 at 14:19
18

@dan1st Once I get the data, assuming it's still licensed CC BY-SA, then I have the explicit license right to share it with whoever I want, however I want, as long as the license conditions are met. I shouldn't be forced to rely on the fact that it's also available on a third-party site that can go down at any time for any reason.
– goldPseudo
Commented Jul 12 at 14:24
20

goldPseudo, you do have that right. We are asking politely that you not, but realistically speaking, there's nothing we can do to prevent you from sharing it, once you have it in your hands. We recognize that.
– Philippe StaffMod
Commented Jul 12 at 14:36
131

@Philippe putting a check box and stating "I agree that I will use this file for non-commercial use. I will not use it for any other purpose, and I will not transfer it to others without permission from Stack Overflow." sounds very much like an additional restriction to me rather than "asking politely that you not". If you don't want that to come across like a legal agreement remove that checkbox and state it differently.
– Abdul Aziz Barkat
Commented Jul 12 at 14:41
55

TLDR the checkbox makes it seem as if there's more restrictions than there are. it's slimy.
– Kevin B
Commented Jul 12 at 14:45
98

@Philippe The checkbox is not "asking politely"; it's violating the terms of the license by making my agreement to that statement a requirement of getting access to the data. Also, you can enforce it by refusing to allow people (whose access to the dump you're tracking) to create accounts on your system. This is just flat-out evil. If the people who contributed to SE can't get out of the licensing agreement because they didn't understand that their content would be used to train AI, neither can the company because they didn't realize how much money they'd lose out on.
– ColleenV
Commented Jul 12 at 15:39
25

Again, folks, the question has been posed to the legal team. I'll revert back when I hear more from them. Until then, I hear you. This one is logged and will be followed up on.
– Philippe StaffMod
Commented Jul 12 at 15:50
64

@Philippe If you want to make that fact more visible, you could edit your post instead of leaving it buried in the comments.
– ColleenV
Commented Jul 12 at 15:59
24

@FedericoPoloni SE has a history of making illegal decisions. I suspect that they never ran it past the lawyers, again.
– OrangeDog
Commented Jul 14 at 11:36
14

@Philippe -- Any chance you've heard back from the lawyers yet?
– AMtwo
Commented Jul 17 at 22:29
10

Quick update: still talking to the lawyer-y folks. I'm looking for a resolution on those questions sometime early next week, most likely.
– Philippe StaffMod
Commented Jul 20 at 1:01
16

Even if the lawyers come back with a semi plausible legal reason why this approach is ok it's still unethical which is a mind bending level of hypocrisy.
– Flexo - Save the data dump
Commented Jul 20 at 13:48
15

Given the tight timeline on the development of this feature (being ready in the next 3 weeks?) the extremely long delays in feedback loops being closed is not helpful. It's hard to take the company's promises seriously when it takes nearly 2 weeks (and still waiting) to get an answer to a question that was asked multiple times by multiple people immediately following the announcement. It is an important (and dare I say, obvious) question that deserves either a transparent answer, or transparency into why it is so difficult to answer.
– AMtwo
Commented Jul 24 at 15:01

| Show 18 more comments

Shog9 · Accepted Answer · 2024-07-12 18:56:13Z

181

to get the data file for a Stack Exchange site, there will be a location within the settings page on each Stack Exchange site. Users will be provided with the link to the data file, and can also continue to access the data through the Stack Exchange API or Data Explorer.

At best, this is extremely inconvenient; at worst, it guarantees no one will ever again have a consistent "dump".

I'm going to guess: no one involved in making this decision has ever downloaded and worked with the full data dump. It's already slow and fairly inconvenient; the one bright spot is that a decent torrent client lets you start it and do other stuff while waiting. Best-case, you devote a fast enough pipe to this that the hundreds of extra clicks necessary are rewarded with shorter turnaround... But somehow, I doubt it. So instead, we get a tedious click game.

Oh well. I can script it. There'll probably be more failed or interrupted downloads, but I can script around that too. Annoying, but feasible.

We are adding additional checks for bots, and are hardening that infrastructure.

Oh... You thought of that too. So this is literally, "Let's make using the data our users have contributed over the years as painful as possible" then.

Thanks for all the lies.

answered Jul 12 at 18:56

Shog9

453k202 gold badges1.3k silver badges1.9k bronze badges

61

"no one involved in making this decision has ever downloaded and worked with the full data dump" ... As usual, you are painfully correct.
– AMtwo
Commented Jul 12 at 19:24
33

It’s about time folks stop believing the company has the community’s interests at heart. I am grateful you did not use a euphemism to describe their actions.
– George Stocker
Commented Jul 12 at 20:23
22

Yeah, I kinda suspect that's not too far off, @AMtwo - that this is a "for show" type of thing, to ease investor concerns and ... Maybe even discourage clueless customers from seeking alternatives. But... There are alternatives, and they tend to be a lot better (if not cheaper) than the data dumps, which are primarily of value to small fry (like... Me) to learn and experiment with. SO's best long-term hope is to provide something of real value for folks who can pay, something that can't be replicated by distributed scrapers who won't care or even notice any of this.
– Shog9
Commented Jul 13 at 3:58
6

I, too, would request that a magnet link remain. Having to manually download the dumps, and then create a .torrent out of them, and coordinate sharing of that magnet link just adds unnecessary tediousness.
– Ian Boyd
Commented Jul 17 at 20:01
9

@GeorgeStocker "It’s about time folks stop believing the company has the community’s interests at heart." Most people here stopped believing the company some time ago. Looking around everyone is saying the same more or less. It's just the the company is not impressed much by it. It's only impressed by a 1000 downvotes or a moderator strike or stuff on a similar magnitude, if at all.
– NoDataDumpNoContribution
Commented Jul 20 at 21:15

Add a comment |

Shadow Wizard · Accepted Answer · 2024-07-12 14:38:11Z

118

we want to help make the process of accessing data dumps quicker and more efficient

Umm. OK, I won't laugh out loud, but instead give some simple facts.

This is the process today:

Anyone can download all 180+ data dumps (one of each site) without having to create account on any of those sites.
The current data dumps are public and accessible to everyone.

This is how the process is going to be after the change:

In order to download a data dump of a certain site, one would be required to create an account there.
This is not public and restricts access to the data dumps.

Now some interesting quirks.

Suppose one is suspended on a site. Would it mean they won't be able to download the data dump?
In 6-8 months/years SE suddenly announces: "Hear hear, we decided to limit the download for users with over 50k reputation, this will surely boost their morale a lot".

So no, it is neither quicker nor more efficient, in the bigger picture. The total opposite.

answered Jul 12 at 14:38

Shadow Wizard

174k33 gold badges429 silver badges847 bronze badges

50

@Spevacus "Please note that users will only be provided the data for the specific Stack Exchange site and its Meta site for the corresponding Stack Exchange site profile. If you wish to access the data for additional Stack Exchange sites, you will need to go through the same process for each site."
– samcarter_is_at_topanswers.xyz
Commented Jul 12 at 14:44
"We're scoping and building a process to allow for this, but it will not be in the initial release of the feature." - The ability to download from all sites at once is something that's already being looked into.
– lyxal
Commented Jul 12 at 14:50
3

I believe that's the case. You do need to click through every site. Mods got a preview of a few previous versions and frankly, while annoying, is more of a series of potholes and speedbumps than a brick wall. Its not ideal but its a compromise that was the least worst that could be found.
– Journeyman Geek
Commented Jul 12 at 14:51
11

@lyxal sure. And how many features we had that were broken and we got a "we're looking into that" with no followup? What guarantee do we have SE won't just "forget" about finishing this?
– VLAZ
Commented Jul 12 at 14:54
1

@VLAZ SE has been pretty responsive in listening to feedback on this process so far (as JG said, the original proposed process was way worse than this). From what I've seen, it looks like it could already be in the works.
– lyxal
Commented Jul 12 at 14:58
13

@JourneymanGeek it's a wall standing in a place that had no wall before. Simple as that, and enough to decide it's a change for the worse, not for the good. If you believe SE is doing it with good intentions, great. I don't.
– Shadow Wizard
Commented Jul 12 at 15:05
1

@VLAZ on the short term, I've got some ideas on having community accessible backups - the simplest way would be to have someone request a full copy, for the explicit goal of having a backup of the entire database. Longer term, we can figure out hosting and distribution, but on the short term, even without the internet archive, it can serve the purposes of a non SE controlled backup.
– Journeyman Geek
Commented Jul 12 at 15:08
3

@ShadowWizard I mean, I don't like this, but I also get the system well enough to know obvious flaws.
– Journeyman Geek
Commented Jul 12 at 15:09
5

@JourneymanGeek, If you request the backup for that purpose, I'll see that we get it to you.
– Philippe StaffMod
Commented Jul 12 at 15:54
5

@VLAZ - sure, it has happened in the past that things got dropped. But I'll note that we just dedicated a sprint to locating and cataloguing and validating all the bug reports so that we know what all those things are, and we committed to four sprints a year to work on them. I'm not saying you should forget years of history because of that, but C'mon.... maybe a little bit of judicious suspension of disbelief?
– Philippe StaffMod
Commented Jul 12 at 15:58
18

@Philippe there's a lot of .. uh.. organisational ADD caused trauma floating around. Maybe we can split the difference and be pleasantly surprised instead of the other sort as these things happen. I'd also add a certain amount of complaining means, we just might hope things change for the better. Much like small dogs, and little children, its the quiet that's dangerous.
– Journeyman Geek
Commented Jul 12 at 16:03
5

@Philippe if you are that confident it will be resolved in such a short time, surely it wouldn't be much of a commitment for the company to keep publishing the existing dumps until such time the new process is actually in place and fully implemented. After all that won't be a problem unless it takes you 6 months to implement half a solution and 6 to 8 years to actually get to fixing day 0 bugs.
– user1937198
Commented Jul 20 at 13:15

Add a comment |

Restore The Data Dumps Again · Accepted Answer · 2024-07-12 17:25:21Z

108

Posted Tuesday 2023-06-13 22:32:13Z:

Much has been written lately of the company’s decision to pause the distribution of the anonymized data dump that has historically been posted.

Our intention was never to stop posting the data dump permanently, only to pause it while we begin to collect more information on how it was being used and by whom - especially in light of the rise of large language models (LLMs) and questions around how generative AI models are handling attribution.

You have been engaging on this topic disingenuously for a year.

It was your intention to turn off the dumps a year ago, and now you're trying to make them as inconvenient as possible.

answered Jul 12 at 17:25

Restore The Data Dumps Again

11.7k2 gold badges39 silver badges50 bronze badges

26

You are 100% correct. You've said concisely what took me an hour to write in my comment.
– AMtwo
Commented Jul 12 at 17:26
7

If you're speaking of me personally, you're wrong. I actually have no decision making role in this. My role is to advise, but this decision is made by others. So what I want is frankly irrelevant. (For the record, I'm a huge supporter of the internet archive, and I have friends there.) If by "you" you mean the company, I still disagree. Nobody is turning off the data dumps. They will still exist, and you will still have access to them.
– Philippe StaffMod
Commented Jul 12 at 17:56
73

@Philippe, I can confirm that this statement (from your post last year) is patently false: "Our intention was never to stop posting the data dump permanently" ...I was in the room, and you were not when PC told me the intent was to stop the dump--so perhaps whoever you are getting your information from is the disingenuous one.
– AMtwo
Commented Jul 12 at 18:02
43

@Philippe If by "you" in "you will still have access", you mean the same people that have access to them right now, and not just the subset of people who can access them and also have accounts on SE, that's not entirely accurate. People can't have an SE account if without agreeing to the TOS. I don't have to agree to any terms to download from archive.org.
– ColleenV
Commented Jul 12 at 18:03
32

@Philippe "my role is to advise" -skip a bit- "I'm a huge supporter of the internet archive"... well so looks like your advise doesn't have much, or any, impact on those who do make the decisions. No matter how you paint it, the decision they made is against "open for everyone" principle, big time.
– Shadow Wizard
Commented Jul 13 at 7:46
5

I'd argue - from the outside, and from the perspective of a lot of former staff who 'get' how the community feels, I see Punyon here, as well as comments from Nick that the actions so far are not that different from the time the company tried to cut off access to the data dump, and considering its been a year since the last time, and there's no actual sign of a release, only a delay, its not that different from our perspective.
– Journeyman Geek
Commented Jul 17 at 1:18

Add a comment |

Franck Dernoncourt · Accepted Answer · 2024-07-24 17:55:46Z

107

this is primarily only a change in location for where the data dump is accessed.

No, this is false and misleading. You are instead trying to quietly change the license of the data dump, and you disguise that change as a simple change of data host.

This is not the same as the previously agreed CC BY-SA license, which allows commercial use (e.g., in the event one needs to fork Stack Exchange if the ship sinks, as for example Yahoo! Answers did).

We usually avoid posting on Fridays, but with the data dump scheduled for the end of July, we wanted to share this information with the community as soon as possible.

No, this is also misleading. Here's the real reason: "The changes to the Data Dump were posted on Friday, because you needed to rush out the announcement so that Prashanth could brag about it at a conference [a few days later]!" [AMtwo].

with the heightened ease of receiving a dump

Once again, this is misleading. It's not heightened. It's significantly decreased, since the new way has no command line interface and it's for only 1 out of 184 network sites (which includes their corresponding metas). I've downloaded and uploaded the Stack Exchange dumps dozens of times from/to archive.org, works fine. Downloading is a one-liner:

ia download stackexchange --retries==100

Or to download all images:

ia download stack-exchange-images --retries==100

The new dump method instead requires 920 clicks (=5*184) and still doesn't even include images.

edited Jul 24 at 17:55

answered Jul 12 at 16:42

Franck Dernoncourt

41.9k6 gold badges61 silver badges163 bronze badges

13

The Data Dump has NEVER included images--It only includes the URLs to the images hosted elsewhere. The stack-exchange-images archive is created & maintained by data archivists, not by Stack Overflow. I suspect that after this change, the stackexchange archive will similarly continue to be maintained by data archivists without the help of the company.
– AMtwo
Commented Jul 12 at 17:18
13

@AMtwo thanks, yes I know I created stack-exchange-images, just wanted to point out limitations of the new dump that SE Inc. claims is better than using archive.org
– Franck Dernoncourt
Commented Jul 12 at 17:20
4

Oh yeah. It's definitely not better.... and I am furious about it. I spent my entire lunch hour writing up a lengthy answer on this post.
– AMtwo
Commented Jul 12 at 17:21
4

"and it's for only 1 out of around 380 network sites+metas" - strictly speaking, it's for 2. From the question: "Please note that users will only be provided the data for the specific Stack Exchange site and its Meta site for the corresponding Stack Exchange site profile" -- so you'd be getting the main + meta dump in one go, so it's 1/183 or so instead (strictly speaking, 184, but area51 has never been included in the data dump). It's still really bad that one download becomes 183, but it's at least slightly better than ~365 or so
– Zoe - Save the data dump
Commented Jul 14 at 13:55
3

Fun fact; the license only applies to the dump, not to the post contents within. If you were to process the dump and re-host all the content thereof; anybody hitting your site would not have inherited that license but only the CC-BY-SA license of the actual content.
– Joshua
Commented Jul 16 at 10:21
@Joshua how much process is needed? eg is adding a space enough?
– Franck Dernoncourt
Commented Jul 16 at 19:00
1

@FranckDernoncourt: More like; replace the packaging schema with one of your own design. When I wrote my content I was imagining building an HTML render of the whole as flat files.
– Joshua
Commented Jul 16 at 21:53
1

@Joshua ok converting to JSON: {"SE dump": files.xml} :)
– Franck Dernoncourt
Commented Jul 16 at 21:54

Add a comment |

Andras Deak -- Слава Україні · Accepted Answer · 2024-07-26 12:01:50Z

The company cannot be trusted to own the server hosting the data dumps.

You (read here and later: the company, just so I can skip a spammy round of "I didn't do anything") want to make access to the data dumps faster. Delightful! Regardless, you must keep the old, independently hosted archive path, because the company can, and the company WILL pull the plug on the data dumps. When that happens, we must not lose access to at least all past data dumps until that point.

Always worth coming back to Joel and Jeff's time:

Oh, expropriation of community content that... We created Stack Overflow to be against it. If there's anything that's more in the DNA of Stack Overflow than that, I don't know what it is. That's one of our most core things. You can see this all over the place in the design of Stack Overflow.

First of all, from day one, we use the CC-wiki license. And it's basically a license, it says that we don't own the content that's on there, which is why we make those database dumps that are available.

Because we wanted to make sure that if no matter what happens, literally no matter who we sell to, or raise money from, or turn the site over to, and even if they take Stack Overflow, and make it an evil site where you have to pay to look at things and there's pop-up ads and pop-under ads, and you know, dancing chariots of fire that cross the screen and punch the monkey, and, man, I can take so many evil things anyway. And it just becomes a big gigantic spam site.

Doesn't matter because just take the latest CC-wiki download that we provided and go start your own site saying, you know what, this is gonna be the clean version. And I think a lot of people will follow you. We very, very deliberately built Stack Overflow in a way that there wouldn't be any chance of locking and we're pretty much doing the same thing with Stack Exchange.

You are making it very easy to pull access to our own content that brings you profit. Even if we trusted the company now, this would make it not just possible, but trivial, for some future nefarious company leadership to backstab the community. And guess what: we already have the nefarious company leadership in the present.

Disabling data dumps uploaded to archive.org or some other third-party, truly independent service is unacceptable.

After reading much of this QA, I reached the same conclusion. How can anyone trust Stack Exchange management at this point? — End Antisemitic Hate, Commented Jul 14 at 23:19
I don't think one needs to look to the future for "nefafious company leadership" NOR to find the backstabbing. — AMtwo, Commented Jul 15 at 20:00
Wow, that is a brilliant quote to bring up. And so prescient! — Steve Bennett, Commented Jul 16 at 6:15
@SteveBennett credit goes to curious. They dug that up the last time the company tried this. — Andras Deak -- Слава Україні, Commented Jul 16 at 7:18
that quote reminds me of this blog: danfaggella.com/idiot pretty much nail what they're doing now (or have been doing all these years for that matter) — Nord The Star Wizard, Commented Jul 18 at 18:37
@NordTheStarWizard except the original founders indeed did it mostly altruistically, for the knowledge base, built on the community. Things turned south when the endless torrent of VC money started. — Andras Deak -- Слава Україні, Commented Jul 18 at 19:22

AMtwo · Accepted Answer · 2024-07-18 18:07:49Z

In my other answer, I mostly focus on licensing and why there are materially problematic issues with the current plan. In this answer, my focus is on why the announcement is problematic from a moral standpoint.

The story of how LLMs changed the Company

Github Co-Pilot, the start of the LLM revolution

When GitHub Co-Pilot came out in 2021, a lot of developers saw it as transformative. A portion of software developers no longer needed to Bingoogle their questions, because they had a Co-Pilot that could direct them without leaving the IDE. Stack Overflow, however, maintained their course--focused on Stack Overflow for Teams to monetize the product. The Company did not pivot to begin working on any sort of VS Code extension, or other LLM/AI-powered features.

ChatGPT, the commoditization of LLMs

When ChatGPT landed in November 2022, however, the Company noticed. And the Company realized that they were being left behind. This has lead to a frantic attempt to catch up, but also attempts at kneecapping the competition. Notably, I believe the only AI-powered or LLM-powered products delivered to date are limited to Stack Overflow for Teams, and are not publicly available for individual use.

On March 28, 2023, the Data Dump was disabled at the direction of Prashanth Chandrasekar, the Stack Overflow CEO. At the time, I was on the DBRE team at Stack Overflow, and I was the person who the CEO directed to disable the upload to Archive.org. That conversation ended as thus. This is a direct transcript of our conversation--I've removed a single back/forth for brevity, represented as ...:

PC: If we disable, how long does it take to re-establish the link if we want to open it back up?

Me: Disabling/re-enabling is just a simple switch. It's a scheduled job that runs once per quarter--so er can just turn it back on and/or kick it off ad-hoc anytime we want.

But if we don't upload on schedule, we're likely to have someone notice and ask about it on Meta. So we would need to be prepared to respond--or better proactively explain it on a Meta post of our own.

The Community Team probably just needs to be in the loop to not be caught off guard with the customer service side of it.

...

PC: And remind me, we know who is accessing the data in Archive.org?

Me: We don't really know. The internet archive doesn't require users to be logged in to download it.

PC: OK, then lets auto disable it for now so we don't forget and we can come up with a plan with the community team in the coming weeks.

Just over a month later, on May 10, 58 employees, including me, were laid off. I was no longer able to influence community notification.

Broken Promises

Data Dump Disaster (Data Dumpster Fire?)

A month after that, the Community was surprised when the Data Dump wasn't published per usual, despite the assurances made to me that the CM team be brought in. I take issue with the subsequent statements by Stack Overflow CTO and by Philippe that it was always going to be re-enabled, but I was satisfied with the outcome (the Data Dump was brought back).

In Philippe's announcement, he made two promises:

We will continue to work toward the creation of certain guardrails (for large AI/LLM companies) for both the dumps and the API, but again - we have no intention of restricting/charging community members or other responsible users of the dumps or the API from accessing them.

As part of this project, API users should be on the lookout for a very brief survey that will be coming out (it will be announced here and on stackapps.com) that asks about the features that you most use/would like to see in the API or data dumps moving forward so that we can plan for those, as well as collect general input.

Survey? What survey?

The company did send a survey about the API, without mention of the Data Dump. Speaking for myself--a consumer of the Data Dump, but not of the API--I didn't respond to the survey, and could only hope that another survey would be sent for the Data Dump. It wasn't. As such, the Community was never solicited for feedback on how we use the Data Dump--we've not had our chance to share our user stories that would drive changes to the Data Dump.

I can only assume that the reason Community Users were never surveyed about our Data Dump usage is because the company doesn't care how we use it. Their intent has been clear for the last year--they couldn't eliminate the Data Dump completely as they originally hoped, so they will restrict it as much as they can get away with. The direction to do so comes from the CEO.

Maybe someone just conflated the API and the Data Dump? Seems unlikely that anyone familiar with the two products would conflate the two of them given how dramatically different they are. I've noticed that the API often gets name-dropped when the Data Dump is being discussed, and vice versa. However, they serve very different purposes--one is great for "point lookups" on individual posts, while the Data Dump is best for set-based queries and analysis.

Fast Forward to today...

Senior Leadership has been thinking about this since at least March 2023. In June 2023, Philippe explicitly promised that they were working on "guardrails" for the data dump. More than a year later, that work hasn't been completed--the feature was only sent to engineering teams last week.

Despite the Company "working on" a new data dump process for over a year, it still isn't ready. Additionally, the assurance that the old process would stay in place until the new one is ready has gone out the window.

By turning off the old process before the new process is ready, the company has broken a promise made just last year.
By moving the data dump off of the Internet Archive, and onto Stack Overflow hosted infrastructure, the company has broken a promise made by the Company's Founders to archive the data with a 3rd party to ensure access to the data stays open.
By surveying only API use, and not Data Dump use, the company has broken another promise made just last year to "[ask] about the features that you most use/would like to see in the API or data dumps moving forward"--while this was done for the API, it wasn't done for the Data Dump.
By attempting to place additional restrictions (ex, non-commercial use), the company is violating their own Terms of Service which require the Data Dump to be licensed under the CC BY-SA license.
By failing to use Community feedback in creating this data dump plan over the course of the last year, the Company has alienated the content creators who contribute the data that makes Stack Overflow unique.

My proposal for moving forward.

Keeping in mind the company's core values...

The Company has fallen woefully short on the values of "Be transparent" and "Keep community at our center." I would propose that the company take the following steps, which would be more in line with the Company's core values:

Recognize that the Data Dump is a product intended to support open data, and the rights of the content creators who maintain ownership & copyright, while also licensing creations liberally and openly.
Embrace a problem-oriented solution, where features and decisions can be mapped to a user story.
Re-enable the existing Data Dump upload to the Internet Archive until such time that a suitable replacement is complete and ready. Ideally, the two methods would overlap for one quarter to allow a transition period.
Survey the users of the product (Community users, researchers, etc who use the Data Dump) for how we use the current Data Dump, and what current or future "features" are important to users.
Share with the community any requirements that the Company has, in addition to the end user requirements.
Improve transparency around licensing, and how licensing has impacted decision making. (Responses to this question show many contributors are deeply committed to open data, and permissive licensing of their content.)

Harsh reality

Every Question & Answer on Stack Overflow is licensed under the CC BY-SA license, which allows commercial use, and already requires both attribution and "share-alike" to maintain the openness of the data. The company seems focused on restricting commercial use, which is not only futile and ultimately an impossible game of whack-a-mole--but also a violation of license terms. The company's only rational, legal basis to prevent abuse of the data is to pursue use that violates the BY-SA license.

A Data Dump will end up on the Internet Archive, even if the Company doesn't post it.

The Company will get sued if they attempt to prevent commercial use of data that is explicitly licensed for commercial use. And the Company will lose such a lawsuit.

The Company has further hurt their relationship with content creators by continuing to repeat the same mistakes year-after-year.

Contributions are already on a precipitous decline, both overall, and from high-rep users; one can assume this change will hasten decline for contributors who are passionate about their creations' licensing & future use. I sometimes think of "votes cast" as a measure of engagement (people who care enough to give feedback), or helpfulness (This worked/didn't work), or even just presence--the number of votes cast is also in free-fall.

I agree with this post except that last query, I generally agree contributions have reduced but that query doesn't consider the part that its taking a users reputation from the current date but their reputation in the past would have been lower. Most high rep users have gained their reputation over time. — Abdul Aziz Barkat, Commented Jul 16 at 18:00
@AbdulAzizBarkat It's not a perfect measure, but it's a simple measure. If you have an alternate SEDE query that you think tells the story better, please drop a link in comment here. — AMtwo, Commented Jul 16 at 18:32
Reputation level can be a bit of a red herring for a number of reasons. The old GPT on the platform: Data, actions, and outcomes post seemed to care most about Users who post 3 or more answers in a given week produce about half the answers. If I were trying to make a point, I would support it with data on the users producing the most content %-wise if I could figure that out. That said, I appreciate all of this post. — ColleenV, Commented Jul 16 at 18:48
@ColleenV To be honest, it doesn't matter because the Company isn't going to engage meaningfully with my posts. That last sentence was a bit of a throw-away... I'm surprised that's the only comments here are all about my last throw-away comment. Though, if I had access to historical rep data, I could pull some interesting stats, but that doesn't get published in SEDE nor the Data Dump — AMtwo, Commented Jul 16 at 19:22
I think it was because all of the people actually reading it agree with the opinions and can't contest the facts. I have also given up on the company. As an engineer though I can't resist weighing in on a data problem :) At some point, people are going to have to realize the company they work for is not the good guy any longer. They need to move on before it steals too many pieces of their soul. I don't want them to discontinue the data dump, but at this point, it seems like it would be more humane for everyone to acknowledge the reality of things. — ColleenV, Commented Jul 16 at 19:30
I did like the idea in the post from last year about the company offering the "raw" data dump on archive.org, but also offering monetized data sets tailored for different purposes. They get paid for the real value they have to add as a company, and contributors can still feel confident their work is benefitting the public in general. — ColleenV, Commented Jul 16 at 19:33
@ColleenV I think a lot of the employees have realized that. From what I gather from talking to my friends who still work there, morale is low, and folks recognize the company is no longer what it used to be.... but it still pays the bills. For now. — AMtwo, Commented Jul 16 at 19:42
Nah they’re late. Not by the letter but definitely by the spirit. They’re changing things after the dump was supposed to be generated. July could’ve been on the archive a week ago. — Restore The Data Dumps Again, Commented Jul 16 at 21:44
Has the company lived by any of those values though? I can personally give examples of the company not living by 4 of those 6 values, and apparently failing at one. The only one I can't comment on is the first, because I'm not a customer. — Journeyman Geek, Commented Jul 17 at 0:11
"I'm surprised that's the only comments here are all about my last throw-away comment." I guess meta people are stickler for stats :P I do agree it detracts from the post though. To be somewhat accurate, the rep threshold will have to scale with expected amount of rep by account age or something like that. I have no idea how to do that with SEDE. — Passer By, Commented Jul 17 at 12:26
@ColleenV, I honestly don't think the Company cares about Public Platform users at all. Full stop. Senior leadership is myopically focused on revenue, and nothing more. If they can't monetize it, they aren't doing it. They don't even care about future Q&A or curation because they aren't looking that far out — AMtwo, Commented Jul 17 at 18:58
Folks who gave feedback on the final data point in my post -- I've reframed it and added a second data point to support the original point that engagement is falling off already, and trotting out bad ideas will only hurt that engagement more. — AMtwo, Commented Jul 18 at 18:09
@AMtwo asking if you have data about who is accessing the dump anonymously on a separate, not company controlled site — SPArcheon - on strike, Commented Jul 18 at 18:14
@SPArcheon-onstrike Oh yeah... That conversation was....enlightening — AMtwo, Commented Jul 18 at 18:54

bad_coder · Accepted Answer · 2024-07-17 19:10:08Z

69

A few things to point out:

The text in the screenshot around requesting access is not consistent with saying that the CC BY-SA license is unchanged. The Creative Commons has a FAQ entry about entering into separate or supplemental agreements with users. Specifically, see the section on supplemental agreements, which states that "problems arise when licensors design those terms or arrangements to serve not as separate, alternative licensing arrangements but as supplemental terms having the effect of changing the standard terms within the CC license." It goes on to say that, to avoid confusion, Creative Commons "must insist that in these instances licensors not use our trademarks, names, and logos in connection with their custom licensing arrangement." One solution would be to license the data dump CC BY-NC-SA. However, this would require an update to the Terms of Service. In addition, there are still open questions about whether creating the data dump is sufficiently creative to allow it to be licensed separately from the works contained in the data dump, which all allow commercial use.
There is no explanation as to how you require your "partners in socially responsible AI" to comply with the CC BY-SA license attribution requirements. Enforcement is the responsibility of the work's author, and no clauses in the terms of service or any other agreement authorize the company to be our agent and enforce our rights. In addition, Creative Commons has an entire section of their FAQ dedicated to AI, including a nice flowchart. The CC license, including the attribution clauses, are not triggered if an exception or limitation applies. Specifically, with respect to training generative AI, Creative Commons believes that training is fair use. Other groups also hold similar stances. Until there is judicial precedent or legislation that says otherwise, it's unclear why the company is taking a different stance.
When do you plan to provide more information about your current stance as well as the planned future stance on data scrapers is? Personally, I'm interested in this topic. For example, I've inquired about robots.txt and other technological measures to block crawlers and how changes are decided and communicated. It would also be good to know if this stance change will allow the company to take action against all abuses, such as the YouTube channels that are walking a fine line (and perhaps crossing the line) with compliance to the license and monetizing content. The current stance is that the company requires authors and content owners to protect their work, so it would be interesting to understand where the line is now and the basis for enforcing our rights.
Although I don't have a problem with SE hosting the data dumps in their own infrastructure, some improvements would make this better:
- The current implementation requires that a person have an account on each site that they obtain a data dump from. This can "leak" PII to network moderators for the entire network. Adding the ability to download dumps from the network profile should be added.
- In addition to moving single-site downloads to the network profile, allowing bulk downloads from 2 or more sites in the network or the entire network would be good.
- The API should be updated to allow for programmatically obtaining download URL(s) for data dumps.
- One of the reasons for posting the data dumps on a third-party location was to ensure their availability should something happen to the company or network. Consider making older data dumps available on third-party locations, such as the Internet Archive. After some length of time (perhaps 2 or 4 quarters), submitting the data dump to one or more third-parties and allowing users to do the same with no risk.

edited Jul 17 at 19:10

bad_coder

27.5k7 gold badges50 silver badges130 bronze badges

answered Jul 12 at 14:03

Thomas Owens

53k17 gold badges101 silver badges182 bronze badges

7

Somewhat in keeping with 4. - it was really convenient having torrents for the data dumps. Ensures the download is as fast as possible (assuming seeders are around which they usually are). Also, it ensures all users share the same data. With individualised downloads, neither can the download speed be ensured (I've downloaded archive.org via HTTP and it took days with about a megabyte per second down), nor can I really be sure the download I get is the same as somebody else's.
– VLAZ
Commented Jul 12 at 14:14
3

Thomas, let's pretend that I've written software that incorporates code that is licensed CC BY-SA. I'm not obligated to give my new code to everyone who asks, am I? I can vet who I want to provide it to. That's the same thing that the company is doing here. We aren't relicensing it. We're vetting using a process.
– Philippe StaffMod
Commented Jul 12 at 14:15
43

@Philippe You shouldn't use CC licenses for software. The GPL would be a better example. But, yes, if you distribute the software you are obligated to distribute the code to someone who asks. The CC BY-SA is viral. But that's not the same as the wording here, since you are, in fact, requiring downloaders to accept terms and conditions beyond what is in the CC license.
– Thomas Owens
Commented Jul 12 at 14:18
14

I understand. I've taken this to the legal team and will report back as soon as they've had a chance to evaluate and get back to me.
– Philippe StaffMod
Commented Jul 12 at 14:27
2

@Philippe Thanks. The direct links to the CC FAQ should help. It explains it quite well and links to other pages that explain their stance.
– Thomas Owens
Commented Jul 12 at 14:30
3

Side note about your previocomment: If you and you alone are writing code licensed in some way, you aren't bound by your own license. This should only be valid for works of other people. When using a GPL (or similar - I don'tknow how exactlyit works with CC-BY-SA) library, you'd be bound to the license terms of that library.
– dan1st
Commented Jul 12 at 14:40
2

@dan1st Yes, that's true. The license doesn't apply to the author and only applies to other people upon distribution, which allows the author to control who gets a copy - no copy, no license. But CC is special in that they at least ask that their names and trademarks aren't used in cases where their licenses are gated behind something else - the CC license is behind an agreement that "modifies or conflicts with the CC license". It's no longer under CC terms, so CC insists (unsure if they would take enforcement action) that their names and trademarks not be used.
– Thomas Owens
Commented Jul 13 at 9:57
So in other words, it wouldn't violate the license but it might violate the trademark? Are you saying I wouldn't be allowed to say something I made is CC-licensed if I only gave it to a limited amount of people? If so, where does it say that?
– dan1st
Commented Jul 13 at 10:36
3

@dan1st No. The checkbox about agreeing to only use the file for non-commercial use and not transferring it to others adds terms on top of CC BY-SA. The non-commercial part can be resolved by making the data dump CC BY-NC-SA. However, preventing the recipient from distributing however they see fit is incompatible with all the CC licenses - it's a supplemental agreement that has the effect of changing the standard terms within the CC license.
– Thomas Owens
Commented Jul 13 at 10:42
2

@dan1st The relevant CC FAQ entry explains it well. Everything from "Supplemental agreements:" to the end is relevant.
– Thomas Owens
Commented Jul 13 at 10:43
@ThomasOwens only the copyright holders (us) can add a dual license like that. Stack Exchange cannot retroactively change the licence without our consent. We've been through this.
– OrangeDog
Commented Jul 21 at 22:33
1

@OrangeDog What if we do dual-license it by posting it on this site according to the terms of service?
– dan1st
Commented Jul 22 at 4:38
@OrangeDog In addition to what dan1st says - there is an open question as to if the current Terms of Service introduces a dual license from contributors to the company that has yet to be addressed - it's also important to note that a collection or compilation is separate from the works. The company can release a CC BY-SA-NC collection full of CC BY-SA works. But I don't see what even that would accomplish.
– Thomas Owens
Commented Jul 22 at 11:07
4

@Philippe Are there any updates to these questions? In a different comment, you were expecting an answer early this week. If there are no answers yet, is there a timeline for when we can expect answers to these questions?
– Thomas Owens
Commented Jul 24 at 14:12
9

Yes - i've been at a meeting with our senior leadership team all week. I have some answers, and I'll be posting them today or tomorrow.
– Philippe StaffMod
Commented Jul 25 at 19:01

| Show 1 more comment

wizzwizz4 · Accepted Answer · 2024-07-15 01:11:09Z

63

Having assumed you were acting in good faith, I spent a full working day going through the third draft, pointing out wording that was likely to result in… well, *gestures at this webpage*. And I don't think I'm the mod who spent the most time on this.

You said you would read my notes carefully. Obviously, you didn't. In the section labelled “for PR purposes”, you stopped at the first bullet point! The second:

explicitly mention the measures you have in place "to ensure [our] contributions outlive [the company]".

anticipates Andras Deak -- Слава Україні's answer. There's a reason I wrote those things!!!

I'd make this post a systematic blow-by-blow of avoidable complaints, but going through the feedback from other moderators can't be done in a public space without their permission. Suffice to say, most of what's written in the other answers here is stuff that moderators raised weeks ago, that we were told would be addressed.

I also wrote:

Can we give feedback on the alt text for those images, even though the images aren't ready yet? I don't like editing official Stack Exchange announcements, but also, images need proper alt text.

Firstly: here's your alt text:

The proposed location of the data dump access within the account settings; please note that some of the placement/layout/language is subject to change.

This is not acceptable for what the image is – but secondly, this image is the only place in this post that says:

that people who download the dump are required to explicitly agree to something; and
the text of that agreement.

Here's the text:

Data dump access

You can access this site's data for non-commercial use. For commercial purposes, please contact us.

A new data dump is available every three months. Learn more about the data dump process.

I agree that I will use this file for non-commercial use. I will not use it for any other purpose, and I will not transfer it to others without permission from Stack Overflow. I certify that I am not downloading this file on behalf of my employer, for use in a for-profit enterprise. I have read and agree to the Terms of Service and the Acceptable Use Policy, and have read and understand Stack Overflow's privacy notice.

This is the exact problem that was raised on 2024-07-09. What was going through the pseudo-mind of your dysfunctional organisation when your colleagues in an actual decision-making role decided to rewrite the entire announcement to remove all mention of the federal crimes the company plans to commit, without changing the process, but include that information in a screenshot, which you hid from the moderators who you asked to review the process.

What was even the point of the review, then?

This wasn't even subterfuge. It was gross incompetence. To the extent that your colleagues' jobs are to steward the Stack Exchange network – a community resource – they are not good at their jobs.

To the extent their jobs are to protect the company's number-goes-up, they are also not good at their jobs. Number has gone down.

I was going to demand you schedule a meeting to read a thing to the people who made this decision, and then I noticed that you already did. Have you read its predecessor to them? You should.

Philippe, you care. Somebody clearly doesn't. Please stop letting these buffoons hide behind you. Let the cowards take responsibility for their own terrible decisions.

I keep believing you because you care. At least – as a favour to me –, make it clear when you're not in a position to call the shots, so I can avoid wasting my time. The alternative is that I stop believing you.

edited Jul 15 at 1:11

answered Jul 15 at 0:19

wizzwizz4

22.1k6 gold badges47 silver badges78 bronze badges

9

"Having assumed you were acting in good faith, I spent a full working day [...]" - after years and years and years of broken promises, flat-out lies, clueless mismanagement, and focus on projects that are actively against the wishes of the community, why the hell would you assume good faith from any part of SE management?
– l4mpi
Commented Jul 15 at 14:46
3

Your input on that draft was extremely thorough and well thought-out. Even though I'm certain this wasn't on his timeline, Philippe owes you a serious apology for not making it clear you'd be wasting your time.
– Bryan Krause
Commented Jul 15 at 20:09
15

@wizzwizz4 Ignored feedback that accurately foretold Community backlash is part of the DNA of Leadership at Stack Overflow these days.
– AMtwo
Commented Jul 15 at 20:10
1

@l4mpi Philippe is really excellent, except when acting in his capacity as part of SE management. (The one thing I was holding a grudge against him about – which required deliberate effort! –, I later learned he wasn't even consulted about.) The CM team often bring stuff to mods for review, and we've had really productive discussions about it. (Sometimes I've even contributed! Mostly not, though.) If it was just the CMs (including Philippe), devs, designers, and supporting staff (e.g. sysadmins, lawyers), I think everything would be alright. (Conclusion: management's sabotaging the company.)
– wizzwizz4
Commented Jul 16 at 1:43
1

As for why staff present stuff to mods internally, before showing it to the community? I think it's partly psychological. Posting stuff on the moderator team, you're embarrassing yourself in front of at most 527 people (plus Community), and – importantly – you're not committing to anything. The feedback we give is only rarely judgemental, and provides most of the perspectives available on meta. (There are a few meta users who consistently spot things neither mods nor staff do, but that's part of why we don't make decisions behind closed doors.)
– wizzwizz4
Commented Jul 16 at 2:04
1

When something official-looking gets posted on meta, many people (myself included) tend to take it as a promise or a threat. The usual staff reaction is to work harder on making announcements PR-safe, which (a) distracts from the actual content of those announcements, and (b) makes the community backlash stronger. In principle, the solution is for staff to write high-content low-PR stuff more often, and for us to read the PR-ified posts as intended and react less extremely / more nicely.
– wizzwizz4
Commented Jul 16 at 2:21
3

This is a lot of work from a lot of people, but it's the best way I know to fix the community / company communications breakdown, and hopefully remove the need for moderator pre-review. However, there's one problem: company leadership keeps pushing out PR-ified announcements of terrible things. This is a whacking great albatross made of Plutonium-244, tied around all of our necks. Not only does undo all the hard community-building work about a thousand times faster than it can be done, but it actively punishes those who try. (Both staff and non, in different ways.)
– wizzwizz4
Commented Jul 16 at 2:28
@BryanKrause All that to say: a lot of the time, it's both important and rewarding to review what staff present moderators with. I find it really hard to tell the difference between that, and this – and while morally, it might be on Philippe for not distinguishing between "helpful CM-ish Philippe" and "stooge for the evil corporate overlords Philippe", pragmatically I could just consult the other mods who can tell the difference. This is good practice for me not being a naïve, trusting fool; people have paid bigger prices than a day's labour for such lessons, and thought it worthwhile.
– wizzwizz4
Commented Jul 16 at 2:34
6

@wizzwizz4 that's like saying Prashant is amazing except for everything he did as CEO. As AMtwo describes, the datadump was intended to be cancelled last year, which directly contradicts Philippes statements. So he's either actively lying to us, or so uninformed and blindly accepting statements from other management that he's not even aware of the situation, or both. That makes him a liability and not an asset to this community. It's not like this is the first time either, so at this point the best case scenario is that he thinks trying to mislead us with PR BS is fine - which it is not.
– l4mpi
Commented Jul 16 at 8:48
1

@l4mpi Or he was uninformed last year, and just hasn't corrected the answer he wrote back then. That's somewhat negligent, but I've probably left incorrect stuff in answers too. It's not lying levels of bad. I don't think Philippe's indispensable the way Robert Cartaino, JNat or Catija are, but he does help with coordination work behind the scenes; whereas I think Prashanth's only the CEO. But yeah, by being complicit with the "PR BS" (even when he disagrees with it!!!), Philippe's doing way more harm than he does good.
– wizzwizz4
Commented Jul 16 at 14:39
23

After watching public meltdowns in a few different companies over the years, I've come to believe this is something of an occupational hazard for CMs, @wizzwizz4. The job should not require and does not benefit from this sort of duplicity, but... the pressures of the role inevitably lead to it as an outcome. The blind fear of insufficiently-controlled release of information leads to useless communication efforts leads to speculation, hasty correction, and ultimately a chaotic and confused trickle of information mixed with misinformation... leading to the very thing that was feared.
– Shog9
Commented Jul 16 at 15:32
5

"To the extent that your colleagues' jobs are to steward the Stack Exchange network – a community resource – they are not good at their jobs." so very true. I didn't understand what community management means and why it's important until Stack Exchange demonstrated a negative example. Excuse me, I meant examples.
– Passer By
Commented Jul 17 at 9:18

Add a comment |

curiousdannii · Accepted Answer · 2024-07-13 01:56:43Z

We would really rather users do not upload the file to archive.org or similar data pile sites. Assuring the viability of the Network takes resources, and companies that profit off the back of the work of this community should feel an obligation (“socially responsible AI”) to give back to the communities whose work they use to create the models that they are marketing commercially. Our hope is that because the process for individuals to request the dumps is lightweight and quick, you won’t feel the need to undermine these efforts to encourage commercial re-users to contribute back.

You (the company) don't understand.

The Creative Commons Share-Alike license is not anti-commercial.

All generative AI that absorb CC BY-SA works aren't complying with the license. This is my opinion; I'm aware that the CC org itself seems to disagree and some courts might rule differently. But mine is probably the common opinion of most people releasing things under that license. (Surveys to the contrary would be interesting to see!)

But we contributing members of the SE network are not against the commercial exploitation of our content. The SE company profits off our contributions! So can other companies, so long as they comply with the license.

There is no implication of "giving back" in the CC BY-SA license, instead it has a legal requirement to license derivative works under the same/similar license. As long as you do that, if you can make a profit from our contributors, then the CC BY-SA license is an open invitation to do so! We contribute to our SE sites because we want to help people learn about hundreds of topics. If companies can do so in new innovative ways, while complying with our license, then that's great. There is no implication or expectation that they somehow contribute back to the SE community in the process.

+1 What the AI would need to do is to state: "This answer was produced based on answers A, B and C from curiousdannii, Philippe, goldPseudo on StackOverflow site". That would provide the required attribution for the actual authors, and also give traffic to the SE questions. That might also show the all-knowing-bot actually copying answers given by someone else in a broken way, which I agree the AI companies might not fancy, but ignoring the license is not the way. — Ángel, Commented Jul 14 at 14:29
Similarly, it might not be easy for them to determine the source of the answers they give, as it is coded now, but it's not our task to solve that. If they have no license to use the content, they must refrain from doing so. — Ángel, Commented Jul 14 at 14:30

SPArcheon - on strike · Accepted Answer · 2024-07-22 15:54:10Z

I have to join this discussion a little later due to being mostly offline in the last few days.

Let me start with a personal consideration. I won't sugarcoat it. You have reached a new low. This is pathetically bad.

In the last months I have asked you multiple times if you were trying to remove anonymous access (this is just one example) and you conveniently ignored that (like you ignored my questions about the options in the data request page...). Now I come here to see I was right.

The company is trying to put a login wall on the availability of the data dump. Given this post's passive aggressive threat that "when you breach the agreement that you make when downloading the dumps file, we do have the option to decline to provide you with future versions of the data dumps," I suspect the company thinks they can watermark the dumps to be able to identify who posted a downloaded file to a public facing site... and not get their attempt utterly destroyed the second that the community - mostly made of devs - decides to revolt. Also, please remember that the worst option is RPA, and despite you thinking you can detect that easily... trust me you don't want to force people to organize.

I will forego here any argument about the legality and morality of the company actions. I think others have already done a great job.

I will instead point out what - in my opinion - the company are actually trying to do because I didn't see this mentioned a lot.

The company's recent actions clearly point at two parallel intents:

The company want to artificially inflate the userbase numbers. Constantly showing a "pwetty plez, make account" prompt in the face of anonymous users clearly shows how you hope to increase user "activation." There is no doubt about that.

Yet what I am not sure about is if everyone in the company really thinks this is going to work or if this is just a "mad cell" trying to scam the top management with bigger numbers that won't translate to actual contributions: (Look at these shiny numbers! 1k more users on the site... Yeah, they didn't post anything, but that is 1k more users still!!! Our plan to annoy them to the point of creating an account and never posting anything worked great!)
It is also clear that the company doesn't care at all about the site purpose. The only value Prosus sees in the site is selling the data / using the data as training material. Considering the recent OpenAI partnership, I think it is pretty safe to assume that Prosus is desperate to try to get the exclusivity to the training material the community is making available and is willing to do whatever it takes to that goal, at least until some clamorous mishap ends this story in disaster (and to be honest I think they are really getting there).

Like before, I am also not sure of the actual reality here: does Prosus really think they can restrict access to the content - freely reproducible under a CC BY-SA license - without someone making it available somewhere else? Or is this just pretending to be able to in order to mislead their OpenAI partner into thinking they have the "strategic advantage of exclusive access to Stack Data"? To be honest, I don't care; for us users the result is still the same.

I think there is only one reasonable course of action. Starting today, I will officially stop contributing new Q/A content to the network and performing any curation activity. I will only stay active on posting on Meta.SE.

Kudos, while I won't join the strike yet, I do agree with every word here, especially the part where the company ignores any question they don't feel comfortable to answer. (But what could they say? "Sure, we're thinking very hard how to block access to the data dump!") — Shadow Wizard, Commented Jul 18 at 12:11
@ShadowWizard I think that at least in the data request case it could be formalized as a violation of the GDPR principles. I know that one of the requirement for a cookie consent window to be considered "valid" is that it should be clear and not obfuscate its purpose. I think that the same rules apply to a data request module. Having a data request module but obfuscating what each option does probably is like having no module at all. — SPArcheon - on strike, Commented Jul 18 at 12:43
I would join you, except I stopped contributing (mostly) a while ago. I will stop contributing harder though, in solidarity :) I did give in and add some alt-text to an image recently. No more! — ColleenV, Commented Jul 20 at 21:06
@ColleenV plot twist: SE will use AI to auto create those alt texts that will be missing now. :D — Shadow Wizard, Commented Jul 22 at 9:05
@ShadowWizard Great, so they’ll be even more difficult to find and correct. AI will describe the image, not the information the author intended to communicate. It will parse the text from an image, even though the point of it was that the font or color is messed up. Every time I think I’ve heard the dumbest use of genAI possible, some new one comes along. genAI is a wonderful tool for humans to use; it’s currently terrible at fully automated anything. Oh, it was a joke lol. I need more coffee. — ColleenV, Commented Jul 22 at 11:15
I think Stack Overflow's Senior Leadership Team is the cell of employees trying to shine up a turd to show Prosus the company is worth the $3B Prosus paid a couple of years back. Prosus supposedly is invested for the long term, but they don't want long term money pits. They want to get some sort of return on their investment. — AMtwo, Commented Jul 22 at 12:03

zcoop98 · Accepted Answer · 2024-07-16 23:11:46Z

I have many, many thoughts on this, so this answer might end up being a bit fragmented. I know this isn't novel among the other voices here, but I think this is a very, very bad idea.

Self-Interest & Hosting Burden

Something that dumping to the Internet Archive does is remove Stack's self-interest from the equation of hosting the data. I.e. Stack no longer pays for the burden of exfiltrating that much data to every user who wants it, and Stack isn't in unilateral control who can or cannot access my content, in accordance with its license.

I don't mean to suggest anything about Stack's ability to pay for bandwidth costs– my point is that by taking sole ownership of hosting the dump, especially when there's clearly expressed interest in this very question for why they don't want the dump to be accessible, the Company is taking on paying for something that they don't really want to support in the first place. In other words, this move creates a clear & tangible incentive for the Company to discontinue the dumps whenever the cost begins to seem unreasonable, even more than already exists.

That's bad.

The reason the dumps were done to begin with was a goodwill stake in the ground, a tangible means of saying "the Company is a steward of the data, not its owner or arbiter". By putting yourselves in control of who gets to exfiltrate the data, even if today it's wholly unrestricted, you've created a massive temptation to decide who does or doesn't get to have it, or to disable it altogether.

In my view, completely regardless of any tangible benefits like download speed, it's a terrible, awful, no good, very bad idea to put yourselves in that position if you have any inkling of care for the goodwill the dumps were designed to facilitate all those years ago.

OverflowAPI & The Dump

This is a little bitter, but it's staggering to me that the Company apparently thinks that its OverflowAPI product offers so little value that the raw data dump would even register as a potential threat.

Does the Company really believe this little in its own engineering capacity?

The data dump is giant, unwieldy, unfiltered, and delayed – if OverflowAPI is so bad or limited that a client would even consider the dumps as a reasonable alternative, then there's incredibly vast room for improvement.

The dumps should be 99.99% orthogonal to the existence and value of OverflowAPI. Your product must offer more value than the data alone, precisely because the data is already out there– that's the inconvenient (for the Company) truth of the CC-BY-SA license that's baked into the foundation of the site.

Hiding the dump isn't going to change that.

Hypocrisy of Expecting "The High Road"

The Company is stomping on a lot of goodwill to protect its interests here (it's seemingly been a theme lately).

If you, the Company, are willing to trash the years of goodwill represented by the openness of the Data Dump; willing to take arguably one of the most significant remaining pieces of consistent goodwill offered to the Community, and crush it, lock it away, in order to protect your own interests...

...then how can you possibly expect other companies to take the proverbial high road and pay for your API instead of scraping the site or using the content dump that they have a legal right to use under license, if its not in their interest?

It feels absurd to come all this way, stomp all over what the Data Dump stands for, clearly and demonstrably misunderstand what the license even means and allows, and then still, for some reason, expect other companies, who care about their own bottom line(s), to "take the high road" when you arguably won't even take it yourself.

There's a reason folks are really bitter about this move.

I'd note, and strongly feel that trading increasingly dwindling amounts of trust and goodwill for short term, perceived gain for their SAAS products is a bit of a tradition at this point. As is cutting back resources when said SAAS products underperform. — Journeyman Geek, Commented Jul 17 at 0:08
"stomping on a lot of goodwill" -- not a very big pile to stomp all over at this point in time. Probably more of a Stockholm syndrome in many cases too (definitely my case, but it's still so damn useful no matter much as those wonderful people in management try to achieve the opposite). — Dan Mašek, Commented Jul 17 at 0:15

Flimm · Accepted Answer · 2024-07-15 09:37:00Z

We, the community, want to contribute content under a Creative Commons Attribution ShareAlike license, to share with the whole world. We're inspired by Wikipedia and other free culture initiatives. Stack Exchange, Inc is making it difficult to access the knowledge that we want to be freely shared with the world, and that's a threat to our mission. We trusted you, Stack Exchange, to steward this data well, even though you never paid any one of your users or moderators for our contributions, not even one dollar. Freely you have received, freely give: that's the mission we all signed up for.

Of course, we want you to make money! I also want all the contributors and moderators to make money. We don't hold you responsible to work out how to pay us for our work, so don't restrict our knowledge base and our data dumps because you are not making as much money as you would like, or because you envy NVidia, whose market cap is $3.18 trillion, more than the entire market of many countries.

I'll quote Zoe:

There's no way around it anymore now that you [SE] have pushed this through in spite of internal warnings; you [SE] have made Stack Exchange, Inc. the biggest threat to the sustainability and future of the community. genAI is in second place, because the ongoing damage you [SE] are doing to the community, and to data archival and preservation efforts beats any damage genAI could do by scraping the data. While this hasn't been said anywhere, it's clear to all of us that the only reason you [SE] are doing this isn't to protect the community, but to protect your [SE's] revenue

Journeyman Geek · Accepted Answer · 2024-07-12 23:04:16Z

31

So there's a few 'transformative' versions of the datadump that respect the CC licence, and exist specifically due to the open nature of the network - Brent Ozar's MSSQL conversion comes to mind. There's an older dump on bigquery by google (who is a SE partner) which appears to have been last updated in '22 . There was also an offline SE instance by Kiwix , which is (was?) officially supported by SO inc and has it listed as a partner.

All of these examples are in the spirit, and letter of the rules, and two of them are officially supported by SE. They are either educational (Brent), non profit (Kiwix) or in the case of google a current partner

Considering they respect the CC licence but downstream users may not, how would this new policy affect someone who wants to share the data for non profit or educational use but could potentially have their work used by LLMs with or without their knowledge?

answered Jul 12 at 23:04

Journeyman Geek

185k48 gold badges342 silver badges736 bronze badges

3

Add seqlite.puny.engineering to the list
– Restore The Data Dumps Again
Commented Jul 13 at 9:59
1

It wasn't meant to be exhaustive, but rather illustrative :D. Aka, there's probably a ton more
– Journeyman Geek
Commented Jul 13 at 10:08
1

... I didn'r recognise you in the comments, but yeah, I think there will be community backups :D
– Journeyman Geek
Commented Jul 13 at 11:16
4

@JourneymanGeek - All three of those are more than welcome to continue to get the data dumps. We certainly can't and wouldn't consider declining to provide them to someone based on something that a downstream user MIGHT do.
– Philippe StaffMod
Commented Jul 13 at 12:33
16

@Philippe I'd appreciate if you're able to clarify: are they welcome to get the data dumps, or are they welcome to continue to distribute their transformations? One of the conditions shown in the screenshot is "I will not transfer it to others without permission from Stack Overflow." You also state that if someone doesn't follow that condition, the company has the option to decline to provide them with future dumps. Projects like SEqlite amount to transferring a format-shifted version of the data dumps to the public. How much transformation is required to make transferring the dumps ok?
– Zach Lipton
Commented Jul 13 at 21:42
9

Also it appears that preparing something like SEqlite would require clicking through 364 separate download forms (assuming "additional checks for bots" manages to successfully prevent automation), which poses a rather large practical impediment. Saying that someone is "welcome" to do something if only they click through hundreds of forms instead of the single download available now is...not particularly welcoming.
– Zach Lipton
Commented Jul 14 at 3:24
@ZachLipton In these cases - someone can request the dumps - as per meta.stackexchange.com/questions/401324/… . As long as one or more people are fine with requesting a full dump (and we're yet to see the process, and how all of this does/doesn't scale) and is willing to share a download somehow, there's no need.
– Journeyman Geek
Commented Jul 14 at 3:28
"are they welcome to continue to distribute their transformations?" If it's truly CC-BY-SA, isn't that compulsory? "ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original." Or is it saying that iff distributed, distributions must use the same license?
– Michael come lately
Commented Jul 17 at 16:07
1

@Michaelcomelately your second statement is correct, the license doesn't require redistribution, but if one does distribute they need to use the same license.
– Abdul Aziz Barkat
Commented Jul 17 at 16:51
@AbdulAzizBarkat Noted, thanks.
– Michael come lately
Commented Jul 17 at 18:14
2

In 'this' context though the redistribution is implicit - all these cases are about redistribution in different forms. I recently came across a defunct FOSS LLM dataset that contained info from SE dumps. That would be an interesting case but I am waiting on the next set of clarifications before I ask yet another question(tm)
– Journeyman Geek
Commented Jul 17 at 23:40

Add a comment |

Stephen Ostermiller · Accepted Answer · 2024-07-12 18:51:49Z

30

Is the download going to be from a torrent? When I have downloaded the dumps in the past, the only way I've gotten it to work was through bittorrent. The dumps are just too big to download through other protocols in my experience.

answered Jul 12 at 18:51

Stephen Ostermiller

3,34615 silver badges29 bronze badges

3

"The dumps are just too big to download through other protocols in my experience." no, can easily download via CLI meta.stackexchange.com/a/306594/178179
– Franck Dernoncourt
Commented Jul 12 at 19:44
7

@FranckDernoncourt Good for you. But there are people who have worse Internet connection than what you have available, and those should be accommodated as well.
– Dan Mašek
Commented Jul 13 at 22:19
7

@DanMašek archive.org servers allow you to resume downloads.
– Franck Dernoncourt
Commented Jul 13 at 23:24
2

It's not going to be on a torrent, because they cannot restrict access to logged-in users in that way.
– Federico Poloni
Commented Jul 15 at 16:49
1

I'd think they could gate access to the torrent behind login. If that isn't technically possible, it wouldn't have to be a torrent. But it would need a server powerful enough to push it out in a reasonable amount of time and a the ability to resume downloads when client inevitably get disconnected halfway through.
– Stephen Ostermiller
Commented Jul 15 at 17:21
2

One maybe could upload them at the internet archive again to make them that accessible again. Probably a good place to store them anyway, however inconvenient if others have to do that instead of the company.
– NoDataDumpNoContribution
Commented Jul 20 at 21:10

Add a comment |

Thomas Owens · Accepted Answer · 2024-07-27 21:18:54Z

I understand that this file is being provided to me for my own use and for projects that do not include training a large language model (LLM), and that should I distribute this file for the purpose of LLM training, Stack Overflow reserves the right to decline to allow me access to future downloads of this data dump.

This has moved from questionably (il)legal to almost certainly unethical.

Most of the issues surrounding Section 2(a)(5)(C) of the CC BY-SA license, which states that "you may not offer or impose any additional or different terms or conditions on ... the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material" have been cleared up.

The key changes to the wording make it clear that the downloader can exercise all of the rights granted to us by CC BY-SA. The only condition is that if we use the data in certain ways (such as training an LLM), we may lose access to further updates to the data set. This seems to be permissible, since it's already established that a distributor can establish criteria for distribution and is not obligated to distributed CC-licensed content that they hold.

The only open licensing/contracts/legal question that I see is the last part: "should I distribute this file for the purpose of LLM training, Stack Overflow reserves the right to decline to allow me access to future downloads of this data dump".

Although the Creative Commons licenses allows the licensor to make preferences and non-binding requests, this seems to be a clickwrap contractual obligation. On the surface, it doesn't impose any additional terms or restrictions on sharing the version of the data dump I have, it does require me to establish control of or otherwise accept the responsibility of third-parties in order to guarantee continued access to the data dumps. This seems like a questionable contractual term, at best.

Depending on how current court cases go, this may also become a moot point. My understanding is that copyright law generally takes precedence over contract law in many, if not most, cases. If the US courts find that training an AI model is fair use in general or a particular instance of training an AI model is fair use, then this agreement could become meaningless.

The claim that there is a "heightened ease of receiving a dump" is also blatantly false. At least in the current state, not only is it more difficult to receive the data, there are also privacy implications in that a downloader needs to have an account on each site and expose their PII to every mod on the network. The fact that main and meta dumps are separate also increase the number of downloads. There are solutions to this, such as moving the dumps to the a person's Stack Exchange network profile instead of (or in addition to) individual sites. Claiming that this is easier until it actually is easier is wrong, though, especially since we don't know how many data dump releases it will be until it truly becomes easier to access the data.

More broadly, though, these changes violate some of the fundamental principles of open source and the commons.

Consider some words from the early days of Stack Overflow:

Everything contributed to the Stack Exchange network of websites is licensed under Creative Commons Attribution - Share Alike. This means it belongs to everyone, and can be freely reused (even commercially!), so long as it is follows our simple rules of attribution. That's our contract with the community -- it's your generously contributed content that makes these websites worth visiting in the first place!

To emphasize: freely reused, even commercially, by everyone.

This is the promise that was made that led us all to contribute.

Some people may not like how people use their work. However, a principle of open source is a lack of discrimination. We do not discriminate based on the person, group, or endeavor. The work is out, in the open, for everyone to use.

I do understand the company wanting to monetize the content to the extent possible. However, this isn't the way to do it. Take a lesson from Wikimedia.

Instead of adding terms and conditions to the data dumps, sell more frequent data dumps (if that's something that the AI companies want). Instead of quarterly public data dumps, allow people and companies to buy monthly data dumps.

I highly suspect that the company is doing something similar with OverflowAPI. Without seeing documentation, I suspect that the API isn't that different from the public Stack Exchange API, except maybe with even higher rate limits and perhaps a streaming API.

The underlying problem is that the company is unilaterally making decisions on how to manage, share, and protect user-contributed data without involving the people who own the data. There is no significant effort to understand our concerns and hear out different viewpoints about our concerns on who is accessing our contributions, how they access them, and what they are doing with them.

The company doesn't put any significant effort into the data dump, with respect to selecting, arranging, or otherwise curating the content. It's a raw dump of the data that people have contributed, and it's not even provided to consumers in the most usable way. Some people have even found errors in the data.

I highly suspect that, when you wanted to start to put these "guardrails" up, if you came to the community and said things like:

We want to reduce the free API daily request quote from 10,000 requests per day to 2,500-5,000 requests per day and reduce the number of requests per second from 30 to 10, but we will work with trusted community members to ensure that public applications will continue to work.
We will slow down the production of free public data dumps to once or twice a year in an attempt to encourage commercial adoption. However, if you are a community member or academic researcher, we will work with you to review your use case and deliver more frequent data.

We wouldn't be where we are today.

I don't find this argument convincing. The "don't do things with the data that we don't like" clause is not about what you are permitted to do with the underlying data, but what SE will do if you do certain things with the data: they will stop making it easy to obtain. CC-BY-SA does not require SE to make copies easily available. Imagine that someone publishes some SE answers into a book---is that person then required to sell that book to anyone who wants it? I can't imagine that anyone would be compelled to do... — Xander Henderson, Commented Jul 27 at 14:41
@XanderHenderson if someone included my answers in a book, yes I'd expect the whole book to be CC licensed as a result of the share alike clause. Regardless of if they can legally get away with whatever the final version of this looks like it still undermines the principles the whole site was built on and people contributed freely under. — Flexo - Save the data dump, Commented Jul 27 at 19:54
@Flexo-Savethedatadump You miss my point---the content is still CC-BY-SA. But that doesn't mean that the physical book itself is free, nor that the person selling the book is required to sell that book to any punter who comes along. The only thing the license prevents you from doing is preventing others from doing the same. — Xander Henderson, Commented Jul 27 at 20:05
@Flexo-Savethedatadump If your answers were included in a book but were otherwise unmodified, the book would become a collection or compilation. In the United States and under the CC BY-SA license, a collection or compilation does not have to have the same license as the constituent work. There are already books for sale that are made up of SE answers. — Thomas Owens, Commented Jul 27 at 20:09
All the CC-BY-SA books I've seen have had freely available PDF versions as well as print versions. — Flexo - Save the data dump, Commented Jul 27 at 20:10
@Flexo-Savethedatadump (1) There is nothing in the CC-BA-SA license, so far as I can tell which requires that the material be provided as a .pdf. (2) The analogy that I am making is that the physical form of the book is not the same as the content in the book. All of the material in the data dumps is still available on the SE site, and can still be accessed (e.g. via SEDE). SE are providing an additional service by wrapping all of those data into a handy data dump. It is not the data in the dump that SE are limiting access to, but that handy form factor. — Xander Henderson, Commented Jul 27 at 20:17
By way of analogy, if I compile material into a physical book, I am not required by CC-BY-SA to sell that physical book to anyone who wants it. I might only print 500 copies, and when they run out, they are out. I am not required to print more. The manner in which the data is transmitted (a physical book, a .pdf, a data dump, etc.) is distinct from the data itself. — Xander Henderson, Commented Jul 27 at 20:18
And, again, just to be clear, I am not arguing that how SE has handled the discussion of data dumps has been good. My point is that this legalistic argument about the CC-BY-SA license is a poor argument. SE is not legally obligated to provide data dumps. Even if they provide data dumps, they are not legally obliged to provide them to anyone who asks. They can put up any barriers they like to accessing the data dumps (from making them one-click available to anyone, to not providing them at all). The only thing SE cannot do is dictate how the data are used afterwards. — Xander Henderson, Commented Jul 27 at 20:22
But if they don't like you, for whatever reason, they can stop providing you with data dumps. They can say "If you use the data in a manner we don't like, we aren't going to just give you future versions of those data any longer." There is nothing in the license which requires them to just hand over the data dumps. Which is why I would prefer that the conversation focus on the underlying intention of the data dumps and the bad blood generated by SE behaving in the way they are behaving. The legal argument about the license just seems to miss the mark... — Xander Henderson, Commented Jul 27 at 20:24
To emphasize: freely reused, even commercially, by everyone. This is the promise that was made that led us all to contribute. SE can’t have it both ways. Either we’re volunteering to build a library of knowledge for everyone, or we’re volunteering to work for SE in exchange for imaginary internet points and precious little appreciation. — ColleenV, Commented Jul 28 at 1:39
@testing-for-ya I'm not saying that there was no effort to build the tooling. However, the data dump has no curation beyond what is done on the sites (which is, by far, mostly contributors and volunteer moderators), no selection (everything public gets into the dump), and no arrangement (data is ordered by dates and times, not by any creative organization). Once the tooling is built, there's little month-over-month effort. The change to move away from IA is supposed to remove the last bits of effort, which is the babysitting to make sure the upload worked properly. — Thomas Owens, Commented Jul 28 at 10:57
@ThomasOwens Everything public does not get into the dump (it’s an explicit set of tables, and only some columns, and not always all rows, based on this, and not everything that gets into Data Explorer gets into the dump (e.g. deleted questions) - there are even fewer tables there. As recently as this year there have been many improvements shared publicly to both SEDE (a bunch of views rot make our queries easier, a more reliable cutoff date, and much faster delivery time) and the dump (such as quarter alignment and data fixes). I don’t think… — testing-for-ya, Commented Jul 28 at 12:25
@ThomasOwens …the OP meant it this way, but the wording was ambiguous and you’ve reinforced that it can be taken this way: that staff just sit on their thumbs every time the dump comes out and no other effort goes into it. From what I’ve seen here and what I can only imagine does not get broadcasted, I think this is a very naive and even disrespectful accounting of the efforts of employees in delivering these things (and not management, as it’s becoming more clear). — testing-for-ya, Commented Jul 28 at 12:29
@testing-for-ya You are conflating the effort to build tools for the data dump with the creative effort to create the data dump. There is no creative effort on the part of Stack Exchange to create the data dump - no selection or arrangement of content, curation, or anything else creative. There was technical work to allow the dumps to be created and there is technical babysitting, but migrating to self-hosting from upload to IA should remove most, if not all, of that. In addition, the tooling is being neglected because there are cases of obvious data integrity issues in the data dump. — Thomas Owens, Commented Jul 28 at 13:55
@ThomasOwens My suggestion was that you and the OP have conflated those with wording too generic. Suggesting there is “no significant effort” is not a fair characterization. Also in my observations the staff have been pretty quick to fix data issues (the recent reports about rows pointed to deleted posts and purged users seem to have already been fixed in the current Data Explorer refresh), and this type of work is also not zero effort. It perhaps doesn’t fit a much narrower definition of “creative” work, but that’s not what the OP said. — testing-for-ya, Commented Jul 28 at 14:24

bad_coder · Accepted Answer · 2024-07-17 17:22:51Z

The data dump should be easily accessible, that's what open source data is all about. The fact that you're putting roadblocks in the way of researches will just make legitimate users not access the data. I'm completely against the "ask for access" mentality because there's no guarantee you'll give access in a timely manner, if at all.

Oh, and as Franck's post points out: you are in fact relicensing! But I never gave you permission to relicense my contributions, and the license was irrevocable when I granted it.

ColleenV · Accepted Answer · 2024-07-30 11:43:38Z

The data dump is a hedge against the company failing or turning evil. It was that promise to keep the content available to the global public that made a lot of us willing to donate our time and expertise not only creating that content, but curating it. Placing the data dump behind a link where the company can arbitrarily deny people access breaks that promise. Even if the company were currently trustworthy, it would still be a problem because we can’t know whether the company would remain trustworthy.

How long before there is a reputation requirement to access the data dump? I assume access will be blocked from tor to prevent people from getting around being denied access. Yeah, this is going to work just fine in the short term. Stack Overflow has plenty of meat left on its bones, so you will be able to dine off its corpse for probably many years to come. My prediction is that the folks who are willing to donate to the company in exchange for internet points are not going to be as reliable partners as the people who do it to build a repository of knowledge for everyone regardless of whether they can afford to subscribe to Google’s AI service.

Ending public access to the data dumps will be a hard thing to undo. I hope you looked at the data and can see that the community still hasn’t recovered from the strike. This could very well kill the golden-egg-laying goose. There are ways to monetize the data dump without breaking the promise to the people that created the value you’re trying to sell.

An aside: The curation of the content is just as (if not more) valuable as the content, especially for AI. There is very little reward for curating content on SE compared to asking or answering questions. Many people do it because they care about the site their community is built around. Everything the company does that demoralizes highly-engaged users erodes that connection and takes away reasons to put up with the clunky tooling and the flak that curators inevitably get when they try to keep content quality high. How are the voting stats lately? — ColleenV, Commented Jul 31 at 17:18

Hoppeduppeanut · Accepted Answer · 2024-07-25 04:16:53Z

The "pretty please don't use the data dump commercially thank u" checkbox and all the potential legal issues around it have been mentioned ad nauseum already, but given the WeAreDevelopers summary that was posted last week¹, and how this change's primary purpose - according to the post - is "[...] to still be accessible by the majority of users while restricting access to those using them for commercial purposes", it's worth drawing attention to this part of the announcement:

I’m going to start with an important statement: this is primarily only a change in location for where the data dump is accessed. (emphasis not mine)

To quote Tyler the Creator, "So that was a f***ing lie." If you as a company have ulterior motives regarding the changes you make, at least try not to brag about them at a conference?

I won't repeat myself and preach to the choir more than I need to, because quite frankly everyone else has already made the point far more eloquently than I can. But to me, this confirms in my mind what everyone is afraid of with these changes. Stack Exchange is seeking to add downstream conditions beyond the CC BY-SA license that the data dump is licensed under. This is potentially a breach of the license, and can potentially result in the license being terminated.

^{¹ And very conveniently, was not featured! Even though last year's post of the same event was featured for over 3 weeks.}

Well, yes they lie to us, but that's nothing new, sadly.
– Shadow Wizard
Commented Jul 25 at 10:39 — Shadow Wizard, Commented Jul 25 at 10:39

Basic · Accepted Answer · 2024-07-25 20:37:55Z

16

Frankly I'm impressed by all those assuming good faith.

This announcement is the worst form of gaslighting going. Don't try and pretend any of this is being done on our behalf.

answered Jul 25 at 20:37

Basic

2,23013 silver badges23 bronze badges

Add a comment |

Script47 · Accepted Answer · 2024-07-26 21:04:54Z

Today, we are announcing some changes to the data dump process. I’m going to start with an important statement: this is primarily only a change in location for where the data dump is accessed. Moving forward, we’ll be providing the data dump from a section of the site user profile on a Stack Exchange profile.

There are a number of reasons for this: first, this is an attempt to put commercial pressure on LLM manufacturers to join us and our existing partners in the “socially responsible AI” usage that we’re advocating for - to get them to give back to the communities whose data they consume.

Second, we want to help make the process of accessing data dumps quicker and more efficient. While the Internet Archive has been a great partner to us, as you may know, both internally and externally, people have encountered challenges with uploading and downloading the dumps with any reasonable speed.

Lastly with the heightened ease of receiving a dump, we’re curious to see what community members will build with this information.

It finally happened, albeit it took 1 year and 1 month from when you first tried it but nevertheless, here we are.

Always talking about the "greater good" or "protecting users", just so unbelievably disingenuous and extremely insulting, as though your buzzwords will convince or fool us.

You don't care about what people have or will build from this, this is nothing but an attempt to push companies towards your commercial offering - whatever shape that may take.

When organizations are able to skip out on their obligations to contribute back, the whole internet suffers. Without Stack Overflow as a resource, many of the world’s millions of technologists would not be able to find the answers to their programming questions. And we know that these organizations are not contributing voluntarily - so putting this light commercial pressure on them is the best method to encourage socially conscious behavior. It’s important to say that when you breach the agreement that you make when downloading the dumps file, we do have the option to decline to provide you with future versions of the data dumps. But we really don’t want to have to do that.

Utter hypocrites.

Management should take their own advice first and foremost considering how they treat their own power users, you know, the ones that actually contributed towards making this resource what it is.

First, our product team has regretfully informed us that it will not be possible, within the existing time constraints, to bundle the dumps for a site and its meta together. So if you want to get the download for a site and its meta, you will need to do the request on both sites. We know the initial announcement implied that the two files would be bundled together, and we’re sorry that won’t be the case right now. That bundling was initially assumed to be a low lift, but it turned out not to be, so was taken out of scope for this release.

My answer from when the June 2023 data dump was flagged as missing seems particularly prescient:

We're quickly scrambling because this would be an amazing opportunity for us to make some more money off your backs.

We're probably going to degrade the experience for people within the community who used to use these dumps because we need to somehow get companies to pay us to use this data in a more friendly way.

Sonic the Anonymous Hedgehog · Accepted Answer · 2024-07-19 05:54:03Z

12

A bit late to the party, but...

This could create legal problems for Stack Exchange.

Specifically, this has to do with the CC BY-SA license's requirement to remove one's name from content if one asks (i.e., post dissociation).

As quoted in the Creative Commons FAQ that is linked to in the post:

if the licensor does not like how the material has been modified or used, CC licenses require that the licensee remove the attribution information upon request [...] in 4.0, this also applies to the unmodified work.

When a post is dissociated from a user upon them making a request under said clause of the CC license, prior data dumps are not updated to reflect the change as they're historical content.

If the data dumps are hosted by someone else, this is perfectly legal, as Stack Exchange wouldn't have control over them and the host would technically be a different "licensee" under the above quote. However, if the data dumps are hosted in-house, then Stack Exchange would be under legal obligation to go through prior data dumps and remove the requester's name from them since they're the same licensee.

Having someone else (e.g., the Internet Archive) would remove the need for SE to do this on every data dump and thus greatly reduce the workload on SE staff. Yes, I'm aware that "there's no such thing as true dissociation on the Internet", but this is a legal issue and thus something SE's obligated to do even if it doesn't technically matter.

answered Jul 19 at 5:54

Sonic the Anonymous Hedgehog

96.9k17 gold badges193 silver badges414 bronze badges

3

The full legal code says If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable. It's important to read the full code and not the summaries. Th last five words in the clause - to the extent reasonably practical - are key. Of course, there are legal concerns, but attribution removal isn't one regardless of where the content is hosted.
– Thomas Owens
Commented Jul 19 at 9:43
7

@ThomasOwens to be fair, I would say that it is "practical" to remove the info from old dumps. IMHO the intent of that line is to rule out impossible tasks like "go door by door to each one who downloaded your data dump and fix the local offline versions they have on their hard disk", not to cover up for bad decisions someone made with their infrastructure. It is "reasonably possible" to fix a file, the fact that it would be extremely costly if they stored every single dump since the start of time is not part of the picture.
– SPArcheon - on strike
Commented Jul 19 at 11:55
1

[cont.] Otherwise it would be very easy to workaround the rules by carefully crafting technical limitations in your own infrastructure that have no reasons to exist outside being abused to claim you "can't" fulfill your legal obligations. That said... I must be missing something here. WHY would they have to store every single data dump available in the past? If any, this seem the perfect devious plan to silently remove that and make only the last dump available. After all, the vexatious UI mockup does NOT include any date selection.
– SPArcheon - on strike
Commented Jul 19 at 11:58
5

@SPArcheon-onstrike The term "reasonably practical" isn't well-defined, but if you define a data dump as a point-in-time snapshot of the user contributed content, I wouldn't consider it reasonably practical to invest the time and effort to go through every single snapshot (or even the most recent snapshot) and remove the attribution and then repackage it.
– Thomas Owens
Commented Jul 19 at 12:01
3

There are no prior dumps to be updated. In the post: "The dump file will be provided in an “instant” format - we will generate a URL on the backend for the data from the site to be downloaded."
– ert
Commented Jul 19 at 16:16
1

@SPArcheon-onstrike yeah, the "isn't practical" lasts until a court order tells the billion dollar companies they must and damages are awarded. I think it's practical do delete something.
– bad_coder
Commented Jul 19 at 16:39

Add a comment |

HolyBlackCat · Accepted Answer · 2024-07-27 04:59:05Z

11

Please clarify "and" in the new checkbox wording

this file is being provided to me for my own use and for projects that do not include training a large language model

Is it just me, or "and" is a bit ambiguous here?

I assume you mean for projects that do not include training a LLM, unless it is for personal use only, but this can also be interpreted as only for personal use, and additionally this use must not include training a LLM.

I'm fairly sure you mean the former, but either way I suggest updating the wording to one of those two options.

Assuming you meant the former:

The new checkbox wording is much better

The reason why I care about the data dumps is because they (in principle) allow bootstrapping a new community/site elsewhere, if SE becomes evil (or too evil to use). I don't care about LLMs being trained on it.

I even thought about proposing this exact thing, but figured SE wouldn't listen.

edited Jul 27 at 4:59

answered Jul 27 at 4:46

HolyBlackCat

5,4262 gold badges21 silver badges36 bronze badges

For me it's clearly the later, what makes you think it means the first? They don't want the data to be used to train LLM - be it personal or commercial LLM. Does it make sense? Sure, they're committed to a specific LLM by now. Is it fair or good? Not really.
– Shadow Wizard
Commented Jul 27 at 6:19
1

@ShadowWizard Since this wording was born after the legal team had a look at the previous one, I expect the new wording to be more relaxed, not less (in this answer, the former is more relaxed, while the latter is less relaxed). Also note the second sentence, "should I distribute this file for purpose of LLM training..." - there's no penalty for non-personal use, only for LLM use.
– HolyBlackCat
Commented Jul 27 at 9:05
4

Overall, I feel SE for some reason just can't fathom that anybody would need the dump for anything other than LLM training, they have this false dichotomy between "personal use" and "LLM training".
– HolyBlackCat
Commented Jul 27 at 9:08
5

Either way, any extra limitations SE tries to tack on the data dump are unenforceable anyways, as CC-BY-SA expressly prohibits adding any limitations not already included in it. You can safely ignore SE's legal team's baseless scaremongering.
– Kryomaani
Commented Jul 28 at 1:04

Add a comment |

Starship · Accepted Answer · 2024-07-12 16:32:55Z

6

this is an attempt to put commercial pressure on LLM manufacturers to join us and our existing partners in the “socially responsible AI” usage that we’re advocating for - to get them to give back to the communities whose data they consume

You do understand that creating an account and contributing are two different things, right?

answered Jul 12 at 16:32

Starship

2,8279 silver badges25 bronze badges

Of course we understand that. However, what I was trying to say there was that the partners in socially responsible AI are also contributing back to the community's work in other ways (feeding back data, for instance), or by paying for the data. That's obviously very different from those who just create an account, though those users are welcome too.
– Philippe StaffMod
Commented Jul 13 at 12:35
4

Yet again, how are they contributing as they will likely just make an account to get the data. Why would they use their time to contribute when they don't need to (or contribute in any other way?) @Philippe
– Starship
Commented Jul 13 at 18:06
15

@Starship it's important to realise they mean "contribute to our growth statistics", not "contribute to society".
– OrangeDog
Commented Jul 14 at 11:43
Yep, that's my point. A user account registered doesn't do anything. It would be different if they said get 1000 rep to use it, because then they at least contribute something back to the community @OrangeDog
– Starship
Commented Jul 14 at 13:59
@Starship assuming they don't produce a bunch of automatic low-quality answers just to bypass such requirements (times the number of AI creators), actually lowering the quality of the corpus.
– Ángel
Commented Jul 14 at 14:22
AI content is banned...so they would likely get suspended/lose rep for making lots of AI answers, and wouldn't get to 1000 rep or something like that. @Ángel
– Starship
Commented Jul 15 at 2:12
@Starship AI content is banned... against the company's explict wishes. The last strike to keep that ban worked, but there's no guarentee the next one will maintain it effectively.
– OrangeDog
Commented Jul 15 at 19:03
@OrangeDog (hopefully) the company has learned the lesson that you can't build a community-made Q&A site with a community to make it.
– Starship
Commented Jul 16 at 1:55
2

@Starship all the evidence suggests they haven't
– OrangeDog
Commented Jul 16 at 8:44
Well they did indeed give in so...hopefully they learned something@OrangeDog
– Starship
Commented Jul 16 at 13:17
1

(meant to write: I don't understand the downvotes, it's a valid point. Philippe's post contains many invalid arguments to justify the thinly veiled license change)
– Franck Dernoncourt
Commented Jul 18 at 17:11

Add a comment |

Mari-Lou A Слава Україні · Accepted Answer · 2024-07-30 10:22:13Z

5

If data dumps are not available, we can't check that your AI is in fact "socially responsible" as you call it. I don't trust you that much.

You claim that attribution is non-negotiable: https://stackoverflow.blog/2024/02/29/defining-socially-responsible-ai-how-we-select-api-partners/

but AIs are known for misattributing/not attributing things. So if you can't prove that your AI does not have this problem - by making data dumps available, so we can check all of AI answers against all the data it was trained on, I call your "socially responsible" label a fraud that you are trying to hide by disabling data dumps.

edited Jul 30 at 10:22

Mari-Lou A Слава Україні

25k10 gold badges52 silver badges101 bronze badges

answered Jul 30 at 10:14

Marian Spanik

1573 bronze badges

New contributor

Add a comment |

Nemo · Accepted Answer · 2024-07-26 13:14:01Z

-23

Well, looks sane! As an AI developer myself I'm totally for this type of licensing! And - to avoid the bots - my offer is to adopt the Wikipedia's approach: they do offer regular dumps pre-made by them for download, so there's no any need for scraping anything! Cloaking the data from AI or poisoning it sounds like a 21th century luddism for me - why protein neural networks can learn from something but other can not? It's OK!

edited Jul 26 at 13:14

Nemo

4,0641 gold badge23 silver badges36 bronze badges

answered Jul 20 at 16:10

Alexey Vesnin

591 bronze badge

18

Nothing changes in the licensing. That always has been and still is CC BY-SA. And the data dumps have been published on Archive.org since 2009. You're missing the point. This recent move of SE makes obtaining the data harder even for your usecase.
– rene
Commented Jul 20 at 16:44
well, my use case is totally unaffected ;)
– Alexey Vesnin
Commented Jul 21 at 20:05
10

Wait until they learn about what you AI develop, see your commercial success and then restrict your access forcing you to a commercial license. Looks sane to me.
– rene
Commented Jul 22 at 4:53
@rene as far as I can see dumps are not regular on the link you provided
– Alexey Vesnin
Commented Jul 22 at 20:22
It is software that needs to be run by a company and hosted by a third party. There are reasons why quarterly wasn't met: meta.stackexchange.com/… and the torrents are a community effort. Archive.org only has/had the latest.
– rene
Commented Jul 23 at 5:08
well, it's actually not an issue at all: anyway the company does backup procedures regularly, so - once a month they make a monthly incremental dump and a total dump. 12 monthly dumps are rotated and a full dump is updated. After that it's served by the BitTorrent and IPFS nodes. No extra costs and loads at all! Yes, they will need one server for that, for sure, but comparing to the server pool they already have and maintain - it's smaller than a water drop compared to the ocean
– Alexey Vesnin
Commented Jul 26 at 9:33

Add a comment |

Announcing a change to the data-dump process

UPDATE July 26, 2024

Original Announcement

12 July, 2024

Data dump access

FAQ

Why are you updating the data dump process?

How do users access data dump files?

How is a request for commercial use made?

Will Stack Overflow be uploading the data dump to archive.org?

Can a user give the file to their employer for them to train a language model on?

A user has an unusual need for the full data dump file. How can they get that fulfilled?

27 Answers 27

Missing the July Dump deadline.

WHY CAN'T THE DUMP BE POSTED TO ARCHIVE.ORG ONE MORE TIME?

Stack Overflow has had plenty of time

Stack Overflow will be violating the BY-SA license

Policing commercial use goes against the Creative Commons BY-SA license

A breach of the license terms results in automatic termination

The data dump is licensed under CC BY-SA explicitly.

"We would really rather users do not upload the file to archive.org or similar data pile sites"

"When organizations are able to skip out on their obligations to contribute back..."

Let's assume Stack Overflow proceeds...

even in a unique format to circumvent CC BY-SA...

Stop Gaslighting me

Dual licensing

My promise to Stack Overflow

The company cannot be trusted to own the server hosting the data dumps.

The story of how LLMs changed the Company

Github Co-Pilot, the start of the LLM revolution

ChatGPT, the commoditization of LLMs

Broken Promises

Data Dump Disaster (Data Dumpster Fire?)

Survey? What survey?

Fast Forward to today...

My proposal for moving forward.

Harsh reality

Data dump access

Self-Interest & Hosting Burden

OverflowAPI & The Dump

Hypocrisy of Expecting "The High Road"

This could create legal problems for Stack Exchange.

Please clarify "and" in the new checkbox wording

The new checkbox wording is much better

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged discussionfeatureddata-dumpannouncements.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
discussion
featured
data-dump
announcements
.