@mm_maybe

mm_maybe@sh.itjust.works · 4 days ago

My wife once hit me in front of my kids because she didn’t like my pointing out a double standard in how she was treating them. The one she was favoring recently started hitting the other one in a similar manner–basically just to silence her when she said something he didn’t like–and when I pointed out the similarity to my wife’s actions and suggested he had learned it from her she got mad and claimed that rather than hitting me she had “hit my hand away” which is a lie and she knows it. It is 100% classic spousal abuse and gaslighting, and yet due to the sheer size difference between us–I’m a foot taller–I feel ridiculous calling it that, and don’t want to find out what else my son learns is OK from his mom if I’m not around, so here I am still married to her, mostly trying to forget the abuse when it’s not actively happening. She’s been abusive, but I’m not really in any physical danger, so staying seems like the rational option in my situation… I imagine that’s relatively common among men.

mm_maybe@sh.itjust.works · 9 days ago

There are a bunch of reasons why this could happen. First, it’s possible to “attack” some simpler image classification models; if you get a large enough sample of their outputs, you can mathematically derive a way to process any image such that it won’t be correctly identified. There have also been reports that even simpler processing, such as blending a real photo of a wall with a synthetic image at very low percent, can trip up detectors that haven’t been trained to be more discerning. But it’s all in how you construct the training dataset, and I don’t think any of this is a good enough reason to give up on using machine learning for synthetic media detection in general; in fact this example gives me the idea of using autogenerated captions as an additional input to the classification model. The challenge there, as in general, is trying to keep such a model from assuming that all anime is synthetic, since “AI artists” seem to be overly focused on anime and related styles…

mm_maybe@sh.itjust.works · 13 days ago

Well, maybe we need a movement to make physical copies of these games and the consoles needed to play them available in actual public libraries, then? That doesn’t seem to be affected by this ruling and there’s lots of precedent for it in current practice, which includes lending of things like musical instruments and DVD players. There’s a business near me that does something similar, but they restrict access by age to high schoolers and older, and you have to play the games there; you can’t rent them out.

mm_maybe@sh.itjust.works · 15 days ago

r/SubSimGPT2Interactive for the lulz is my #1 use case

i do occasionally ask Copilot programming questions and it gives reasonable answers most of the time.

I use code autocomplete tools in VSCode but often end up turning them off.

Controversial, but Replika actually helped me out during the pandemic when I was in a rough spot. I trained a copyright-safe (theft-free) bot on my own conversations from back then and have been chatting with the me side of that conversation for a little while now. It’s like getting to know a long-lost twin brother, which is nice.

Otherwise, i’ve used small LLMs and classifiers for a wide range of tasks, like sentiment analysis, toxic content detection for moderation bots, AI media detection, summarization… I like using these better than just throwing everything at a huge model like GPT-4o because they’re more focused and less computationally costly (hence also better for the environment). I’m working on training some small copyright-safe base models to do certain sequence prediction tasks that come up in the course of my data science work, but they’re still a bit too computationally expensive for my clients.

mm_maybe@sh.itjust.works · 21 days ago

We don’t. It probably is. Mastodon is the way, but they need to fix a few things themselves.

mm_maybe@sh.itjust.works · 21 days ago

Ok, thanks for clarifying. FWIW, I find the built-in adblocker in Vivaldi extremely dependable, without the performance cost of loading an add-on (especially on top of a base browser that is significantly slower to begin with).

mm_maybe@sh.itjust.works · 21 days ago

Honest question: why is it not safe after then? They developed their own adblocker if I’m not mistaken? What am I missing?

mm_maybe@sh.itjust.works · 24 days ago

may I ask which third-party tool you use? i’m using onedriver and it’s pretty unreliable in my experience

mm_maybe@sh.itjust.works · 24 days ago

It will legit be a fantastic era for Linux on the desktop though… imagine how cheap we’ll be able to get perfectly good hardware.

mm_maybe@sh.itjust.works · 1 month ago

'tis true that women’s bodies hold great power, and not irrelevant at all to the discussion at hand. rather than reiterate and attempt to paraphrase jaron Lanier on the topic of how male obsession with creating artifical people is linked to womb envy, I’ll just link to a talk in which he explains it himself:

https://youtu.be/rGqiswuJuQI?si=oAKvWrtlji4yrfpd&t=42m05s

mm_maybe@sh.itjust.works · 1 month ago

My “day job” is doing spatial data science work for local and regional governments that have a mandate to addreas climate change in how they allocate resources. We totally use AI, just not the kind that has received all the hype… machine learning helps us recognize patterns in human behavior and system dynamics that we can use to make predictions about how much different courses of action will affect CO2 emissions. I’m even looking at small GPT models as a way to work with some of the relevant data that is sequence-like. But I will never, I repeat never, buy into the idea of spending insane amounts of energy attempting to build an AI god or Oracle that we can simply ask for the “solution to climate change”… I feel like people like me need to do a better job of making the world aware of our work, because the fact that this excuse for profligate energy waste has any traction at all seems related to the general ignorance of our existence.

mm_maybe@sh.itjust.works · 1 month ago

I find it very funny that people are so concerned about false positives. Models like these should really only be used as a screening tool to catch things and flag them for human review. In that context, false positives seem less bad than false negatives (although, people seem to demand zero error in either direction, and that’s just silly).

mm_maybe@sh.itjust.works · 1 month ago

If you don’t mind, I’d be interested to see the images you used. The broad validation tests I’ve done suggest 80-90% accuracy in general, but there are some specific categories (anime, for example) on which it performs kinda poorly. If your test samples have something in common it would be good to know so I can work on a fix.

mm_maybe@sh.itjust.works · 1 month ago

Friendly reminder that my AI-generated image detector is available to use free of charge here: https://huggingface.co/spaces/umm-maybe/sdxl-detector

mm_maybe@sh.itjust.works · 1 month ago

Me: I’ve cut my coffee intake down to one cup a day! Look how disciplined and restrained I am!

Also me: drinks 1.5 cans of Celsius per day

mm_maybe@sh.itjust.works · 2 months ago

It’s 100% this. Politics is treated like a sport in the USA; the only thing that matters is your side winning, and which side you root for is largely dictated by location and family history. This is encouraged by the private news media, who intentionally report on election campaigns in this manner in order to increase ratings and ad revenue. Social media only made it worse because it made a lot of abstract identity dimensions, such as political affiliation, feel stronger to people than their everyday lives.

mm_maybe@sh.itjust.works · 2 months ago

Y’all should really stop expecting people to buy into the analogy between human learning and machine learning i.e. “humans do it, so it’s okay if a computer does it too”. First of all there are vast differences between how humans learn and how machines “learn”, and second, it doesn’t matter anyway because there is lots of legal/moral precedent for not assigning the same rights to machines that are normally assigned to humans (for example, no intellectual property right has been granted to any synthetic media yet that I’m aware of).

That said, I agree that “the model contains a copy of the training data” is not a very good critique–a much stronger one would be to simply note all of the works with a Creative Commons “No Derivatives” license in the training data, since it is hard to argue that the model checkpoint isn’t derived from the training data.

mm_maybe@sh.itjust.works · 2 months ago

Yeah, I’ve struggled with that myself, since my first AI detection model was technically trained on potentially non-free data scraped from Reddit image links. The more recent fine-tune of that used only Wikimedia and SDXL outputs, but because it was seeded with the earlier base model, I ultimately decided to apply a non-commercial CC license to the checkpoint. But here’s an important distinction: that model, like many of the use cases you mention, is non-generative; you can’t coerce it into reproducing any of the original training material–it’s just a classification tool. I personally rate those models as much fairer uses of copyrighted material, though perhaps no better in terms of harm from a data dignity or bias propagation standpoint.

mm_maybe@sh.itjust.works · 2 months ago

Model sizes are larger than their training sets

Excuse me, what? You think Huggingface is hosting 100’s of checkpoints each of which are multiples of their training data, which is on the order of terabytes or petabytes in disk space? I don’t know if I agree with the compression argument, myself, but for other reasons–your retort is objectively false.

mm_maybe@sh.itjust.works · 2 months ago

I’m getting really tired of saying this over and over on the Internet and getting either ignored or pounced on by pompous AI bros and boomers, but this “there isn’t enough free data” claim has never been tested. The experiments that have come close (look up the early Phi and Starcoder papers, or the CommonCanvas text-to-image model) suggested that the claim is false, by showing that a) models trained on small, well-curated datasets can match and outperform models trained on lazily curated large web scrapes, and b) models trained solely on permissively licensed data can perform on par with at least the earlier versions of models trained more lazily (e.g. StarCoder 1.5 performing on par with Code-Davinci). But yes, a social network or other organization that has access to a bunch of data that they own, or have licensed, could almost certainly fine-tune a base LLM trained solely on permissively licensed data to get a tremendously useful tool that would probably be safer and more helpful than ChatGPT for that organization’s specific business, at vastly lower risk of copyright claims or toxic generated content, for that matter.