The Octonaut

The Octonaut@mander.xyz · 1 day ago

Cisco make TVs?

The Octonaut@mander.xyz · 3 days ago

Though Meta and Google have withdrawn of their own accord, a spokesperson for Mardi Gras said the companies would not now meet the festival’s criteria for partners.

That’s not what happened.

The Octonaut@mander.xyz · 8 days ago

Ah yes. Those are 7000 internal nuclear missiles, for internal use. Phew, I was worried the rest of the world might be as affected by America’s actions as much as we were by China’s mirroring of open source github repos.

The Octonaut@mander.xyz · 8 days ago

Because when dangerous states do things regarded as threatening, you’re supposed to just name the country. Right?

Hey sometimes it even applies to private companies

What’s different about America doing things?

The Octonaut@mander.xyz · 8 days ago

Can we use the established media convention and just say “America fires nuclear arsenal staff”? Was this going to be the Biden administration in exile?

The Octonaut@mander.xyz · 12 days ago

All sounds awful but I’m mostly confused as to why a software project needs a discord

The Octonaut@mander.xyz · 12 days ago

the accepted terminology

No, it isn’t. The OSI specifically requires the training data be available or at very least that the source and fee for the data be given so that a user could get the same copy themselves. Because that’s the purpose of something being “open source”. Open source doesn’t just mean free to download and use.

https://opensource.org/ai/open-source-ai-definition

Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms.

In particular, this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.

As per their paper, DeepSeek R1 required a very specific training data set because when they tried the same technique with less curated data, they got R"zero’ which basically ran fast and spat out a gibberish salad of English, Chinese and Python.

People are calling DeepSeek open source purely because they called themselves open source, but they seem to just be another free to download, black-box model. The best comparison is to Meta’s LlaMa, which weirdly nobody has decided is going to up-end the tech industry.

In reality “open source” is a terrible terminology for what is a very loose fit when basically trying to say that anyone could recreate or modify the model because they have the exact ‘recipe’.

The Octonaut@mander.xyz · 16 days ago

Only vaguely conscious of this guy from my nice safe seat abroad but IIRC he wasn’t exactly “establishment” in the first place? He can’t have that many friendly colleagues after basically completely changing his policies once in office. Never mind the reputed reason for that happening.

The Octonaut@mander.xyz · 16 days ago

No need to brag

The Octonaut@mander.xyz · 18 days ago

The point is that no branch was ever called a slave branch, just as no audio copy was ever called a slave copy. One does not direct the other in the same way that master and slave implies. Usually quite the opposite.

Oh and master-slave usually refers to hardware infrastructure, not programming. Where, as you mentioned, client-service is the equivalent, or parent and child.

The Octonaut@mander.xyz · 19 days ago

Master in branch meant the same as the master of an audio track or video. We haven’t all stopped saying “remaster” or “masterpiece”.

As it turns out, there are software developers from outside the country with people whose grandparents-grandparents were chattel slaves, and they name things without the same baggage. It’s Gulf of America stuff, but for the ‘good guys’.

The Octonaut@mander.xyz · 19 days ago

Sorry, you’ve spent 75 years arming your own government to the point of making this impossible. In the name of “security”. Do you feel secure yet?

The Octonaut@mander.xyz · 26 days ago

It’s certainly better than "Open"AI being completely closed and secretive with their models. But as people have discovered in the last 24 hours, DeepSeek is pretty strongly trained to be protective of the Chinese government policy on, uh, truth. If this was a truly Open Source model, someone could “fork” it and remake it without those limitations. That’s the spirit of “Open Source” even if the actual term “source” is a bit misapplied here.

As it is, without the original training data, an attempt to remake the model would have the issues DeepSeek themselves had with their “zero” release where it would frequently respond in a gibberish mix of English, Mandarin and programming code. They had to supply specific data to make it not do this, which we don’t have access to.

The Octonaut@mander.xyz · 27 days ago

A model isn’t an application. It doesn’t have source code. Any more than an image or a movie has source code to be “open”. That’s why OSI’s definition of an “open source” model is controversial in itself.

The Octonaut@mander.xyz · 27 days ago

I know how LoRA works thanks. You still need the original model to use a LoRA. As mentioned, adding open stuff to closed stuff doesn’t make it open - that’s a principle applicable to pretty much anything software related.

You could use their training method on another dataset, but you’d be creating your own model at that point. You also wouldn’t get the same results - you can read in their article that their “zero” version would have made this possible but they found that it would often produce a gibberish mix of English, Mandarin and code. For R1 they adapted their pure “we’ll only give it feedback” efficiency training method to starting with a base dataset before feeding it more, a compromise to their plan but necessary and with the right dataset - great! It eliminated the gibberish.

Without that specific dataset - and this is what makes them a company not a research paper - you cannot recreate DeepSeek yourself (which would be open source) and you can’t guarantee that you would get anything near the same results (in which case why even relate it to thid model anymore). That’s why those are both important to the OSI who define Open Source in all regards as the principle of having all the information you need to recreate the software or asset locally from scratch. If it were truly Open Source by the way, that wouldn’t be the disaster you think it would be as then OpenAI could just literally use it themselves. Or not - that’s the difference between Open and Free I alluded to. It’s perfectly possible for something to be Open Source and require a license and a fee.

Anyway, it does sound like an exciting new model and I can’t wait to make it write smut.

The Octonaut@mander.xyz · 27 days ago

I understand it completely in so much that it’s nonsensically irrelevant - the model is what you’re calling open source, and the model is not open source because the data set not published or recreateable. They can open source any training code they want - I genuinely haven’t even checked - but the model is not open source. Which is my point from about 20 comments ago. Unless you disagree with the OSI’s definition which is a valid and interesting opinion. If that’s the case you could have just said so. OSI are just of dudes. They have plenty of critics in the Free/Open communities. Hey they’re probably American too if you want to throw in some downfall of The West classic hits too!

If a troll is “not letting you pretend you have a clue what you’re talking about because you managed to get ollama to run a model locally and think it’s neat”, cool. Owning that. You could also just try owning that you think its neat. It is. It’s not an open source model though. You can run Meta’s model with the same level of privacy (offline) and with the same level of ability to adapt or recreate it (you can’t, you don’t have the full data set or steps to recreate it).

The Octonaut@mander.xyz · 27 days ago

I take more than a minute on my replies Autocorrect Disaster. You asked for information and I treat your request as genuine because it just leads to more hilarity like you describing a model as “code”.

The Octonaut@mander.xyz · 27 days ago

I ignored the bit you edited in after I replied? And you’re complaining about ignoring questions in general? Do you disagree with the OSI definition Yogsy? You feel ready for that question yet?

What on earth do you even mean “take a model and train it on thos open crawl to get a fully open model”? This sentence doesn’t even make sense. Never mind that that’s not how training a model works - let’s pretend it is. You understand that adding open source data to closed source data wouldn’t make the closed source data less closed source, right?.. Right?

Thank fuck you’re not paid real money for this Yiggly because they’d be looking for their dollars back

The Octonaut@mander.xyz · 27 days ago

The most recent crawl is from December 15th

https://commoncrawl.org/blog/december-2024-crawl-archive-now-available

You don’t know, and can’t know, when DeepSeeker’s dataset is from. Thanks for proving my point.

The Octonaut@mander.xyz · edit-2 27 days ago

Since you’re definitely asking this in good faith and not just downvoting and making nonsense sealion requests in an attempt to make me shut up, sure! Here’s three.

https://commoncrawl.org/

https://github.com/togethercomputer/RedPajama-Data

https://huggingface.co/datasets/legacy-datasets/wikipedia/tree/main/

Oh, and it’s not me demanding. It’s the OSI defining what an open source AI model is. I’m sure once you’ve asked all your questions you’ll circle back around to whether you disagree with their definition or not.