• 0 Posts
  • 42 Comments
Joined 2 years ago
cake
Cake day: July 1st, 2023

help-circle






  • the accepted terminology

    No, it isn’t. The OSI specifically requires the training data be available or at very least that the source and fee for the data be given so that a user could get the same copy themselves. Because that’s the purpose of something being “open source”. Open source doesn’t just mean free to download and use.

    https://opensource.org/ai/open-source-ai-definition

    Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms.

    In particular, this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.

    As per their paper, DeepSeek R1 required a very specific training data set because when they tried the same technique with less curated data, they got R"zero’ which basically ran fast and spat out a gibberish salad of English, Chinese and Python.

    People are calling DeepSeek open source purely because they called themselves open source, but they seem to just be another free to download, black-box model. The best comparison is to Meta’s LlaMa, which weirdly nobody has decided is going to up-end the tech industry.

    In reality “open source” is a terrible terminology for what is a very loose fit when basically trying to say that anyone could recreate or modify the model because they have the exact ‘recipe’.







  • It’s certainly better than "Open"AI being completely closed and secretive with their models. But as people have discovered in the last 24 hours, DeepSeek is pretty strongly trained to be protective of the Chinese government policy on, uh, truth. If this was a truly Open Source model, someone could “fork” it and remake it without those limitations. That’s the spirit of “Open Source” even if the actual term “source” is a bit misapplied here.

    As it is, without the original training data, an attempt to remake the model would have the issues DeepSeek themselves had with their “zero” release where it would frequently respond in a gibberish mix of English, Mandarin and programming code. They had to supply specific data to make it not do this, which we don’t have access to.



  • I know how LoRA works thanks. You still need the original model to use a LoRA. As mentioned, adding open stuff to closed stuff doesn’t make it open - that’s a principle applicable to pretty much anything software related.

    You could use their training method on another dataset, but you’d be creating your own model at that point. You also wouldn’t get the same results - you can read in their article that their “zero” version would have made this possible but they found that it would often produce a gibberish mix of English, Mandarin and code. For R1 they adapted their pure “we’ll only give it feedback” efficiency training method to starting with a base dataset before feeding it more, a compromise to their plan but necessary and with the right dataset - great! It eliminated the gibberish.

    Without that specific dataset - and this is what makes them a company not a research paper - you cannot recreate DeepSeek yourself (which would be open source) and you can’t guarantee that you would get anything near the same results (in which case why even relate it to thid model anymore). That’s why those are both important to the OSI who define Open Source in all regards as the principle of having all the information you need to recreate the software or asset locally from scratch. If it were truly Open Source by the way, that wouldn’t be the disaster you think it would be as then OpenAI could just literally use it themselves. Or not - that’s the difference between Open and Free I alluded to. It’s perfectly possible for something to be Open Source and require a license and a fee.

    Anyway, it does sound like an exciting new model and I can’t wait to make it write smut.


  • I understand it completely in so much that it’s nonsensically irrelevant - the model is what you’re calling open source, and the model is not open source because the data set not published or recreateable. They can open source any training code they want - I genuinely haven’t even checked - but the model is not open source. Which is my point from about 20 comments ago. Unless you disagree with the OSI’s definition which is a valid and interesting opinion. If that’s the case you could have just said so. OSI are just of dudes. They have plenty of critics in the Free/Open communities. Hey they’re probably American too if you want to throw in some downfall of The West classic hits too!

    If a troll is “not letting you pretend you have a clue what you’re talking about because you managed to get ollama to run a model locally and think it’s neat”, cool. Owning that. You could also just try owning that you think its neat. It is. It’s not an open source model though. You can run Meta’s model with the same level of privacy (offline) and with the same level of ability to adapt or recreate it (you can’t, you don’t have the full data set or steps to recreate it).



  • I ignored the bit you edited in after I replied? And you’re complaining about ignoring questions in general? Do you disagree with the OSI definition Yogsy? You feel ready for that question yet?

    What on earth do you even mean “take a model and train it on thos open crawl to get a fully open model”? This sentence doesn’t even make sense. Never mind that that’s not how training a model works - let’s pretend it is. You understand that adding open source data to closed source data wouldn’t make the closed source data less closed source, right?.. Right?

    Thank fuck you’re not paid real money for this Yiggly because they’d be looking for their dollars back