Reddit will block the Internet Archive

General_Effort@lemmy.world · 19 days ago

Reddit will block the Internet Archive

captainastronaut@seattlelunarsociety.org · 19 days ago

As long as the previous collections of archives are still intact. We probably don’t need all of their new spam posts in the wayback machine anyway

Ŝan@piefed.zip · edit-2 18 days ago

LOL I should have scrolled down. You said what I said, with fewer words, first.

hamFoilHat@lemmy.world · 19 days ago

It is my understanding that if you block the wayback machine from indexing your site it will also delist the history as well.

Natanael@infosec.pub · 18 days ago

The ability to block crawling is separate from the ability to delist old pages. The latter usually happens after domains change owners

Jason2357@lemmy.ca · 19 days ago

They do archive sites against the owners wishes when they consider it an important site for public archiving, like some news sites. They are in no obligation to delete the archives and hope they don’t.

tal@lemmy.today · edit-2 18 days ago

Parties have archived the data from pushshift, which cover a lot of Reddit history.

kagis

https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

Subreddit comments/submissions 2005-06 to 2024-12

This is the top 40,000 subreddits from reddit’s history in separate files. You can use your torrent client to only download the subreddit’s you’re interested in.

I mean, that won’t have the past half year or some low-traffic subreddits, but…

buddascrayon@lemmy.world · edit-2 17 days ago

Not that reddit isn’t hot garbage right now, and has been for a while actually, but there’s a lot of people here who have glazed over the reason why reddit instituted this policy.

AI companies are scraping the Wayback Machine. This is something that should concern all of us.

General_Effort@lemmy.world · 17 days ago

Why?

Midnight1938@reddthat.com · 17 days ago

Circumventing sites with ‘no ai scraping’ rules

General_Effort@lemmy.world · 17 days ago

And what do I care about Reddit getting paid?

If the IA doesn’t complain about being used, then it’s fine for me. The ideal outcome would be, if the archive can make some arrangement where they scrape the data and provide it to everyone. That way, sites only get scraped once and not constantly hammered.

buddascrayon@lemmy.world · 16 days ago

There are plenty of sites out there not owned by major conglomerates that have norobots and noscrape tags that AI companies can use Wayback as a way to circumvent their policies.

This isn’t about reddit, it’s about AI companies stealing everything on the internet and then selling it back to you while taking your job away.

General_Effort@lemmy.world · 16 days ago

This is why we can’t have nice things. Tell you what. I will have as much support for you, as you have for blue collar workers. Sound fair?

buddascrayon@lemmy.world · 16 days ago

Since I’m a union worker, sounds good.

General_Effort@lemmy.world · 16 days ago

Ahh, the next Ronald Reagan.

Eh-I@lemmy.world · 17 days ago

Fuck Spez

Njos2SQEZtPVRhH@piefed.social · edit-2 18 days ago

People who posted on Reddit ( speaking in the past tense, because who would continue to do so now that we have better things? ) never intended for it to be of limited access. Reddit was a publicly accessible place, and people shared their thoughts and comments on it because it was the frontpage of the internet, so the place of choice to share things with the world. That being scraped should not be a problem. But clearly Reddit didn’t want to give you a platform to share your thoughts with the world, they wanted you to donate your thoughts and take it as their property so that they can capitalize on it.

General_Effort@lemmy.world · 17 days ago

I don’t know… I mean, I agree. But I’m seeing a lot of demands that instances should prevent scraping. Ok, it could be astroturf; a campaign by Reddit/data brokers to neutralize the free competition. But you have seen all those deleted posts on Reddit. Those are some special little minds.

Njos2SQEZtPVRhH@piefed.social · 17 days ago

you’re right, there’s probably some anti-ai/anti-scraping folks on there aswell as here. Personally I most definitely hate intellectual property more than I do generative AI. But you’re right, different people on there will feel differently. But the point still stands that for those who thought they shared their thoughts with the world, their ideas that they donated were taken from them.

Jhex@lemmy.world · 18 days ago

what’s a reddit?

Bloomcole@lemmy.world · 18 days ago

You use it too scratch your butt I think.

😈MedicPig🐷BabySaver😈@lemmy.world · 18 days ago

Fuck Reddit and Fuck Spez.

SocialMediaRefugee@lemmy.world · 18 days ago

So reddit will become even less valuable

sandwich.make(bathing_in_bismuth)@sh.itjust.works · 18 days ago

That means big news is coming, and the media doesn’t want to fuck up the reporting that is comming. Reddit preparing for mass submission of articles

Evono@lemmy.dbzer0.com · 17 days ago

Reddit warned my account ( first warn in 10 years ) and deleted the comment when I told a American he can strike peacefully to show the government they are against it.

I got a warn for recommending violence by an ai , the human that checked it agreed and didn’t remove the warn haha.

Reddit is just feared that their censorship goes public.

Eh-I@lemmy.world · 17 days ago

I was on Reddit for like 15 years, then got all my warnings and a ban in like a month or two earlier this year. Oh well, lol.

lukaro@lemmy.zip · 17 days ago

I just replied “Liar, or fucking liar.” To every republican lie I saw. Only took 2 days for a permaban. I feel if they can lie we should be able to call them out on it at least.

ArmchairAce1944@discuss.online · 17 days ago

I was on reddit for 11 years before getting banned due to zionists. I have a throwaway reddit account now for porn and other shit, but I dont post.

NigelFrobisher@aussie.zone · 18 days ago

Is that even possible?

General_Effort@lemmy.world · 18 days ago

Technologically no. Reddit sends out the data to 10s of millions of users as part of their normal operations. They need to try to block those who collect that data for the IA. Reddit has the very short end of the stick.

The problem is that evading such counter-measures may be criminal in the US. Obviously, EU laws are much harsher.

Bloomcole@lemmy.world · 18 days ago

Slightly related, can you explain how (a few times for me) an archived page I tried to revisit got erased?

General_Effort@lemmy.world · 18 days ago

I don’t know their take-down policy. Could be privacy, could be copyright.

I think they are shielded by Section 230 under US law. That means, if they don’t do take-downs when requested, they become liable just like the original uploader. So it depends on whether they think they can defend something as fair use. IDK what they do with requests under non-US laws.

Bloomcole@lemmy.world · 18 days ago

Thanks for your detailed explanation.
When I look that up it’s specifically about ‘defamatory, illegal, or harmful content’.
That would be understandable to take down.
Never encountered that myself, the cases I’m referring to were totally legal content AFAIK.
Only very damaging or proof of something.
As a hypothetical example, let’s say an organisation posts it’s associated with Epstein in 1999 which now obviously is very inconvenient.
They understandably remove it from their website but it should stil be on the archive if captured before.
However, in similar controversial real cases it wasn’t.
So it appears certain forces have more influence to get them to remove content beyond what’s legally required.
Since then I always screenshot the archive page.

General_Effort@lemmy.world · 18 days ago

Hmm. There are many things that could cause legal trouble for the Wayback Machine. I wouldn’t jump to conclusions.

You can see on Lemmy that many people would prefer to outlaw scraping, fair use, and all that. Well, not for the “good guys” obviously, but the law doesn’t work on vibes. The IA would be legally impossible in most countries. In the EU, it would be a major crime because of copyright and GDPR. It’s only the traditional US commitment to free speech and fair use that makes it possible at all.

The IA exists in a legally precarious position. That’s not because of any shady backroom dealing. If the crowd in this community had its way, it would be gone.

Bloomcole@lemmy.world · 18 days ago

I know the EU has different (stricter) laws and that they vary between states. (Germany being particularly awful)
There is however some complicated form of fair use policy.
If the IA hosts music and books that might be problematic.
But I’m talking about archived webpages and information previously available to the public with zero commercial value that has been removed.
And this includes American sites.

General_Effort@lemmy.world · 17 days ago

But I’m talking about archived webpages and information previously available to the public with zero commercial value that has been removed.

It is still “intellectual property”. Maybe the policy is to just oblige removal requests if the content doesn’t seem to be of public interest. Cause why not, right? Look at all the people here on Lemmy angry that their worthless posts are scraped or deleting them on Reddit. Obliging takedown requests is certainly the path of least resistance.

SocialMediaRefugee@lemmy.world · 18 days ago

Not to mention all of Asia, South America, Africa…

ozoned@piefed.social · 18 days ago

Good plan. Keep locking down your big tech platforms, and we’ll all be over here letting folks know where they can find freedom.

Bloomcole@lemmy.world · 18 days ago

‘freedom’ as long as the mod agrees with you.

yarr@feddit.nl · 18 days ago

Or… let them stay on Reddit. I like lemmy much better, and it’s possibly due to the people that are not present and the lack of commercial interest.

Jason2357@lemmy.ca · 18 days ago

I think if the fediverse was ever to become more mainstream, it would naturally splinter. For example, the corporate stuff would be big, and those people who value the small-instance experience we have now would probably de-federate from it. There would always be small fediverses, even if the big fediverses got REALLY big.

ozoned@piefed.social · 18 days ago

No harm in that. To each their own. :-) Everyone gets to decide at least.

Zombie-Mantis@lemmy.world · 18 days ago

Just make your own invite-only server if you’re so worried about it. Digital freedom should be for everyone, not just a few antisocial nerds.

yarr@feddit.nl · 18 days ago

I’m not worried about anything.

Zombie-Mantis@lemmy.world · 18 days ago

Well, clearly you are, or you wouldn’t suggest that most people should stay on (what I think we both agree to be) an inferiror platform that affords them fewer freedoms.

If you’re worried that somehow that would bring unwanted attention or a bad crowd, you can always sequester yourself in a more niche server. That’s the whole point of this federated system to begin with - giving us more control of our digital presence.

aquovie@lemmy.cafe · 18 days ago

Careful. Lemmy is too small to draw the attention of sophisticated, persistent abuse. As a company, Reddit has struggled with revenue and we’ve all seen those struggles quite publicly. Lemmy instances with those same challenges would probably just fold and close up.

Federated networks give you freedom but the potential for abuse is proportional to that freedom while at the same time, federation is far more expensive taken as a whole.

girsaysdoom@sh.itjust.works · 18 days ago

I’m sure it would persist even after an event of malicious activity. It may just turn out like email with servers needing to be added to an allowlist at worst and more moderation. I think scalability might be the limiting factor at some point though and as a result we could end up with several disconnected islands of server clusters instead of globally meshed servers.

ByteOnBikes@discuss.online · 18 days ago

Lemmy instances with those same challenges would probably just fold and close up.

Can confirm. I set up a pixelfed instance for my city with the goal of moving people from Insta to this version. After about three months, user accounts went from 1-10 signups a week to a hundred a week.

No way did that many business owners sign up. And yep, all spam.

After a while, my random weekend project in Spring became a full time job. I closed it last month.

Jason2357@lemmy.ca · 18 days ago

I’ve thought of doing something similar, and think, while the federated spam is hard to deal with, signup spam is manageable if you somehow restrict signups to the actual community you want to support. Open signup on the web is a nightmare.

For a city, an interesting idea might be to only allow signups on a dedicated, physical wifi AP placed somewhere strategic in your city. People would literally have to go to a physical location to sign up. Piggy-backing on a library system would be another option if you could somehow get them to buy-in.

Blackmist@feddit.uk · 18 days ago

It’s another move to protect against AI scraping that isn’t paying them for access.

sqgl@sh.itjust.works · 18 days ago

Weren’t Reddit comparing a couple of years ago that too many AI bots crawls were stressing their servers.

Doesn’t the internet archive relieve that stress?

supersquirrel@sopuli.xyz · 17 days ago

Doesn’t the internet archive relieve that stress?

I think that was probably the real reason for the block, the Internet Archive is too functional, scalable and accessible of a service for reddit’s lame excuses about needing to gatekeep access to the community created content on their website to not make reddit look totally stupid unless they came up with an excuse to block the Internet Archive.

tal@lemmy.today · 19 days ago

Given that the Internet Archive is the de facto standard way to cite material as seen on a given date — they’re a trustworthy party that will probably persist for a long time — that’s going to make it harder to cite content on Reddit.

Deceptichum@quokk.au · 18 days ago

Damn, guess if you want reddit data to train your AI that you’ll need to pay Spez for access.

tal@lemmy.today · edit-2 18 days ago

It’s important for people writing papers and such who need to cite material.

I wonder if there’s some way to use the TLS certificate to get a cryptographically-signed copy of a webpage with timestamp that someone could later validate as having been downloaded on that date. I don’t know if existing TLS libraries are capable of that. Like, Web browser menu option “Store cryptographically-signed webpage”. Absent a later certificate compromise, I’d think that that’d at least provide people a way to credibly say “this is really what was on that webpage on August 15th, 2026”. Like, you’d have to save a copy of the TLS session and then have libraries that could read and validate an already-generated session. The timestamp is already embedded in the session.

Some protocols, like OTR, are designed to specifically not allow that, but AFAIK, TLS could.

EDIT: Well, technically the timestamp is gonna be during the handshake, not tied to the HTTP request internal to the TLS session. It might be possible to game that by establishing a TLS session, holding it open without activity, and issuing a request much later. I’d think that that’d potentially be disallowed by Web servers one way or another, since otherwise you could probably do a denial-of-service attack by holding open a lot of sessions for a long time.

EDIT2: Oh, wait, no, shouldn’t be an issue, because the HTTP Date response header is gonna have a timestamp tied to the response.

TheNamlessGuy@lemmy.world · edit-2 18 days ago

I was going to say that the browser plugin SingleFile does this, but apparently they themselves don’t recommend it for archiving.

PastafARRian@lemmy.dbzer0.com · 18 days ago

Don’t forget, Reddit is legally allowed to train on your content, but not the other way around. It’s consistent with US law, where corporate tax is half of income tax.

FalseTautology@lemmy.zip · 19 days ago

I am new to Lemmy, is there a fuckreddit sub?

frongt@lemmy.zip · 19 days ago

Why would you want to spend more time thinking about a dead site?

FalseTautology@lemmy.zip · 19 days ago

I just like to laugh at things I dislike. And I also like to see how bad it’s getting. Iwas in the undelete sub and it was amazing.

Jesus@lemmy.world · 19 days ago

!reddit@lemmy.world

ThePantser@sh.itjust.works · 19 days ago

If you seek a pleasant public forum, look about you.

entropicdrift@lemmy.sdf.org · 19 days ago

Lka1988@lemmy.dbzer0.com · 19 days ago

Yes.

Hi welcome to Lemmy, we hate reddit here.

morto@piefed.social · 19 days ago

In a way, the entire lemmy community is the fuckreddit sub

Auth@lemmy.world · 19 days ago

https://lemmyverse.net/

This is a great site to search for communities. Doesnt seem like there is one.

MadMadBunny@lemmy.ca · 19 days ago

Damn you Spez.