Skip Navigation

Akisamb

@ Akisamb @programming.dev

Posts

16
Comments

103
Joined

3 yr. ago

Moderating

Machine Learning programming.dev

2y ago

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From
Jump
Akisamb @programming.dev 2y ago
Don't know why you are down voted it's a good question.
As a matter of fact it almost happened for search engines in France. Newspaper's argued that snippets were leading people to not go into their ad infested sites thus losing them revenue.
https://techcrunch.com/2020/04/09/frances-competition-watchdog-orders-google-to-pay-for-news-reuse/

2y ago

Midjourney bans all Stability AI employees over alleged data scraping

Akisamb @programming.dev 2y ago

It does seem odd that scraping activity from just two accounts allegedly managed to cause such an extended server outage. The irony of this situation also hasn’t been lost on online creatives, who have extensively criticized both companies (and generative AI systems in general) for training their models on masses of online data scraped from their works without consent. Stable Diffusion and Midjourney have both been targeted with several copyright lawsuits, with the latter being accused of creating an artist database for training purposes in December.

As far as I know they do not have copyright over the output of their models. Apart from banning the users they pretty much have no solutions to stop this. Even if they had copyright, it's still legally unknown if training LLMs constitutes a copyright violation.

In a similar fashion a lot of the recent chat llm's have been trained on output from chatgpt. After all why pay humans to produce training data when your competitor has already done it for you.

2y ago

White House: Future Software Should Be Memory Safe

2

Akisamb @programming.dev 2y ago

Why would java have an impact on battery performance ? Pretty much all credit cards run java for their encryption algorithms, and they need pretty much no power to run.

2y ago

Autonomous murderbot incinerated by SF Chinese New Year street partiers

Akisamb @programming.dev 2y ago

I don't agree. Curvy roads are dangerous, but there are much more conflicts in cities. You're not going to have many pedestrians in curvy mountain roads.

That said, you are right that the ideal comparison would be int the same city. But I'm not sure that the data exists, I'll have to look this afternoon.

That said, even if my data is not perfect, it's much better than taking one accident and saying that self driving cars are dangerous. They are not going to be magically better than humans, after all driving is a difficult task, but we should at least crunch the numbers before dismissing them.

2y ago

Autonomous murderbot incinerated by SF Chinese New Year street partiers

2

Akisamb @programming.dev 2y ago

You can't take one accident and use that to generalize.

You need to take into account all accidents and see how worse humans are.

https://arstechnica.com/cars/2023/12/human-drivers-crash-a-lot-more-than-waymos-software-data-shows/

Cars are naturally dangerous. A robot car is going to have deaths no matter what. That does not mean they are bad if they mean a reduction of cars and accidents. Taxis if done properly can help a public transport system.

2y ago

Can I just convert to Judaism tomorrow and get a free vacation to Israel?

Akisamb @programming.dev 2y ago

They gave them a birth control shot without properly informing them of what it was. Still scandalous, but not what you are saying.

2y ago

GPT told me to break my system

Akisamb @programming.dev 2y ago

These models do not see letters but tokens. For the model, violet is probably two symbols viol and et. Apart from learning by heart the number of letters in each token, it is impossible for the model to know the number of letters in a word.

This is also why gpt family sucks at addition their tokenizer has symbols for common numbers like 14. This meant that to do 14 + 1 it could not use the knowledge 4 + 1 was 5 as it could not see the link between the token 4 and the token 14. The Llama tokenizer fixes this, and is thus much better at basic algebra even with much smaller models.

2y ago

Je suis un maraicher en bio sur petite surface - DMN (AMA)

1

Akisamb @programming.dev 2y ago

Tu fais la ceuillette seul ou tu embauches de l'aide ?

2y ago

AI girlfriend bots are already flooding OpenAI’s GPT store

Akisamb @programming.dev 2y ago

Yes to your question, but that's not what I was saying.

Here is one of the most popular training datasets : https://pile.eleuther.ai/

If you look at the pdf describing the dataset, you'll find the mean length of these documents to be somewhat short with mean length being less than 20kb (20000 characters) for most documents.

You are asking for a model to retain a memory for the whole duration of a discussion, which can be very long. If I chat for one hour I'll type approximately 8400 words, or around 42KB. Longer than most documents in the training set. If I chat for 20 hours, It'll be longer than almost all the documents in the training set. The model needs to learn how to extract information from a long context and it can't do that well if the documents on which it trained are short.

You are also right that during training the text is cut off. A value I often see is 2k to 8k tokens. This is arbitrary, some models are trained with a cut off of 200k tokens. You can use models on context lengths longer than that what they were trained on (with some caveats) but performance falls of badly.

2y ago

AI girlfriend bots are already flooding OpenAI’s GPT store

2

Akisamb @programming.dev 2y ago

There are two issues with large prompts. One is linked to the current language technology, were the computation time and memory usage scale badly with prompt size. This is being solved by projects such as RWKV or mamba, but these remain unproven at large sizes (more than 100 billion parameters). Somebody will have to spend some millions to train one.

The other issue will probably be harder to solve. There is less high quality long context training data. Most datasets were created for small context models.

2y ago

Transformer-Based Large Language Models Are Not General Learners: A Universal Circuit Perspective

Akisamb @programming.dev 2y ago

For folks who aren’t sure how to interpret this, what we’re looking at here is early work establishing an upper bound on the complexity of a problem that a model can handle based on its size. Research like this is absolutely essential for determining whether these absurdly large models are actually going to achieve the results people have already ascribed to them on any sort of consistent basis. Previous work on monosemanticity and superposition are relevant here, particularly with regards to unpacking where and when these errors will occur.

I'm not sure this work accomplishes that. Sure, it builds up on previous work that showed that a transformer can be simulated by a TC⁰ family. However, the limits of this fact are not clear. The paper even admits as such

Our result on the limitations of T-LLMs as general learners comes from Proposition 1 and Theorem 2. On the one hand, T-LLMs are within the TC⁰ complexity family; on the other hand, general learners require at least as hard as P/ poly-complete. In the field of circuit theory, it is known that TC⁰ is a subset of P/ poly and commonly believed that TC⁰ is a strict subset of P/ poly, though the strictness is still an open problem to be proved.

I believe this is one of the weakest points of the paper, as it bases all of its reasoning on an unproven theorem. And you can implement many things with a TC⁰ algorithm, addition, multiplication, basic logic, heck you can even make transformers.

There still is something that bothers me. Why did it define general learning as being at least a universal circuit for the set of all circuits within a polynomial size ? Why this restriction ? I tried googling general learner and universal circuit and only came up with this paper.

While searching, I found that this paper was rejected, you can find the reviews here : https://openreview.net/forum?id=e5lR6tySR7

If you are searching for a paper on the limits of T-LLMs the paper What Algorithms can Transformers Learn? A Study in Length Generalization may prove more informative. https://arxiv.org/pdf/2310.16028.pdf It explains why transformers are so bad at addition.

Here is the key part of their abstract :

Specifically, we leverage RASP (Weiss et al., 2021)— a programming language designed for the computational model of a Transformer— and introduce the RASP-Generalization Conjecture: Transformers tend to length generalize on a task if the task can be solved by a short RASP program which works for all input lengths.

2y ago

Hydroxychloroquine could have caused 17,000 deaths during COVID, study finds

Akisamb @programming.dev 2y ago

Didier Raoult for a large part. He was the one who published the paper that really started this whole mess. His shoddy research practices and non-respect for patients did plenty of harm.

Good thing that they've forced his retirement.

2y ago

Japan determines copyright doesn't apply to LLM/ML training data

1

Akisamb @programming.dev 2y ago

Hard to say from the article only, but if it is like the status quo in the EU and USA, then only the training data can be illegally obtained. If I have an AI that is able to say verbatim the script of the Bee movie, I will be sued.

Google books had a similar issue. They scanned pretty much all the books in existence and indexed them. Small issue they did not obtain the consent of the copyright holders before doing this. They were sued and won. You can use copyrighted data as long as you do not provide Access to it.

2y ago

America's housing shortage explained in one chart

Akisamb @programming.dev 2y ago

To avoid people being homeless ?

2y ago

Permanently Deleted

2

Akisamb @programming.dev 2y ago

Les associations d'anti-corruption s'intéressent principalement au gouvernement, ce qui est bien normal. Je pense que le mieux est que tu juges leur travail toi-même :

(source pour leur habilitation il me semblait d'avoir lu le nom d'une troisième association, mais je me souviens peut-être mal)

D'ailleurs, transperency france est pour le renouvèlement de l'accord pour Anticor. Je n'ai pas trouvé la position de sherpa.

2y ago

Permanently Deleted

4

Akisamb @programming.dev 2y ago

Il reste trois associations d'anticorruption qui ont cet agrément. Ce qui est reproché à anticor, c'est leur aspect politique, ils se doivent de poursuivre des gens de tous les partis, ils n'ont pas le droit de se rapprocher d'une organisation politique.

Ainsi, certains aspects de leur organisation sont un petit peu moyens, comme le don anonyme de 80000 euros.

Après de ce que j'ai vu, l'association semble avoir fait des efforts pour augmenter leur transparence. Visiblement le choix va être fait par un tribunal administratif, donc on va avoir le fin mot de l'histoire.

Face à ce refus du gouvernement, Anticor se tournera à son tour devant le juge administratif. « Nous sommes quelque part soulagés de pouvoir enfin démontrer que nous remplissons bien tous les critères pour pouvoir être agréés », assure la présidente.

2y ago

Permanently Deleted

1

Akisamb @programming.dev 2y ago

Article du journal la croix qui contient l'historique de cette affaire : https://www.la-croix.com/france/anticor-l-association-anticorruption-perd-son-agrement-20231227

2y ago

Permanently Deleted

1

Akisamb @programming.dev 2y ago

Nous avons découvert (…) dans la presse un certain nombre de choses extrêmement troublantes, notamment qu’il aurait été promis, négocié en échange de votes de soutien à ce texte, des casernes de gendarmerie, des postes de police et que sais-je encore

Beurk, du clientélisme. A la fin c'est ceux qui ont les élus les plus relous et les moins coopératifs qui se retrouvent avec tout les services.

2y ago

Tech CEO Forced Assistant to Sign 'Sex Slave' Contract, and Tormented Her With Sadomasochistic Bondage, Lawsuit Alleges

Akisamb @programming.dev 2y ago

I think it's healthy to have clear boundaries with coworkers, they are not the same things as friends.

That said I spend 41 hours a week working, no way I'm not going to socialise with my coworkers. If I don't make any friends after several years of working at a place I feel I have done something wrong.

2y ago

En direct, le projet de loi « immigration » rejeté par les députés avant même son examen en hémicycle

1

Akisamb @programming.dev 2y ago

Oui, après pas sûr que la majorité accepte.