/int/ - Polish language just got confirmed the best in the - International

>>216258095
i think its a matter of them having a better polish data set when training these. the whole idea was to compare low-resource languages and high resource languages (they have 3 times more training data in english than dutch for example)

maybe english isnt the best is because the average quality of the data in english is a lot worse

Anonymous
10/28/25(Tue)02:53:08 No.216261483

Anonymous 10/28/25(Tue)02:53:08 No.216261483

>>216261444
The problem is every retard is like hmmmmm should I use classic literature OR should I scrape some forums from the 90s, lazily write a script to de-boilerplate the most standard HTML stuff, and then stuff it into my training dataset of 9 trillion text files no one will ever have time to manually check to see what's in there, and advertise it based on volume and publish it.
After all quantity is the most important so that means I definitely shouldn't ensure any basic quality right?
That's not poisoning the model right haha

Anonymous
10/28/25(Tue)02:56:52 No.216261542

Anonymous 10/28/25(Tue)02:56:52 No.216261542

>>216258095
best is actually chinese
kanji=concepts, not words
soon Large Concept models not large language models

Anonymous
10/28/25(Tue)03:23:16 No.216261850

Anonymous 10/28/25(Tue)03:23:16 No.216261850

File: 1747884833732852.png (1.59 MB, 1780x1346)

1.59 MB PNG

>>216261483
pic related is the specific tasks they were doing, 'needle in a haystack' tasks. maybe polish has a lot of text in the dataset related to such tasks idk

also yea including old texts with archaic language wouldnt 'poison' it but it would just take up 'space' in the model allowing less information to be encoded about other shit

Anonymous
10/28/25(Tue)03:27:07 No.216261893

Anonymous 10/28/25(Tue)03:27:07 No.216261893

>>216258095
proud to be a Pole

Anonymous
10/28/25(Tue)03:29:29 No.216261927

Anonymous 10/28/25(Tue)03:29:29 No.216261927

>>216261542
No, they very much stand for words, or more accurately morphemes (or, even more accurately, meaningful syllables, which mostly overlaps with morphemes in Chinese but not always, as in 徘徊, 葡萄, 蝴蝶 etc)

Anonymous
10/28/25(Tue)03:32:01 No.216261970

Anonymous 10/28/25(Tue)03:32:01 No.216261970

>>216261850
No real reason it should perform lower than English for coding-type tasks like that so maybe Polish is just Simply Better.

Anonymous
10/28/25(Tue)03:36:40 No.216262029

Anonymous 10/28/25(Tue)03:36:40 No.216262029

>>216261542
>>216261970
That's already what tokens are either way thoughever and it makes them on the fly
Years back, I don't have the link to the video sadly, but there was already an AI that generated or chose its own value sliders for its configuration based on its training data. The sliders would change the output, I think it was for like a mockup videogame character creator, based on the same prompt just adjusting sliders. Some of the sliders seemed to do almost nothing when the guy pulled on them, but would have a big hard to simply describe effect when used in conjunction with another one or two sliders in different ways. Some of them were really simple and obvious like a human would make.
All because the AI was just looking for strong correlations in its huge token database.
Back in SD 1.5 I was trying to gen something with very limited source material in the original model, and I actually downloaded and ctrl+f through a list of all of its tokens which also gave their number of occurences. Words, symbols, even single letters sometimes but mostly a lot of syllables or combinations of two syllables.
There's a way you can bypass typing text and having it interpreted and just use tokens in sequence and put the weights directly on them, OR just syntactically insert token literals in strings of input text
Funny stuff

Anonymous
10/28/25(Tue)03:37:42 No.216262040

Anonymous 10/28/25(Tue)03:37:42 No.216262040

>>216258095
anglophones will still piss themselves because their „sh” is „sz” in polish

Anonymous
10/28/25(Tue)03:44:28 No.216262135

Anonymous 10/28/25(Tue)03:44:28 No.216262135

File: 1759651672259410.png (152 KB, 1110x305)

152 KB PNG

>>216261970
its not coding it's literally just finding words and numbers, and matching them in a text.

they say they had native speakers of each language translate. it could be up to the question phrasing, up to how polish is encoded into tokens used by these llms

the bigger point is low-resource language get rekt while high resources are performing better. i dont get why all these models even know all these languages why not cut korean and hindi out to specialize the model in english and then make a language-specific version for other languages?

Anonymous
10/28/25(Tue)03:45:55 No.216262155

Anonymous 10/28/25(Tue)03:45:55 No.216262155

>>216261444
there is a reason why Polish-speaking Jews were the smartest ones btw

Anonymous
10/28/25(Tue)03:47:50 No.216262175

Anonymous 10/28/25(Tue)03:47:50 No.216262175

>>216262155
jews were never polish-speaking we only learnt about it from playing witcher

Anonymous
10/28/25(Tue)03:49:35 No.216262191

Anonymous 10/28/25(Tue)03:49:35 No.216262191

>>216262175
your prime minister's father didn't even know Hebrew when he was young

Anonymous
10/28/25(Tue)03:50:22 No.216262201

Anonymous 10/28/25(Tue)03:50:22 No.216262201

>>216262175
So what did the Jews in Poland speak historically? Sure, some of them spoke Yiddish, but it's not as if none learned Polish.

Anonymous
10/28/25(Tue)03:50:36 No.216262204

Anonymous 10/28/25(Tue)03:50:36 No.216262204

>>216262175
please come back to Poland

Anonymous
10/28/25(Tue)03:53:18 No.216262239

Anonymous 10/28/25(Tue)03:53:18 No.216262239

>>216262191
>>216262201
>>216262204
kek i guess i triggered the polish larper

Anonymous
10/28/25(Tue)03:54:26 No.216262253

Anonymous 10/28/25(Tue)03:54:26 No.216262253

Joseph Conrad said that he only writes in English because he considers himself a poor writer and English is a very simple language so even a mediocre novelist like him can easily excel at it, but he doesn't want to write in Polish because Polish is such a complex and sophisticated language so his poor skills would be immediately exposed.

Anonymous
10/28/25(Tue)03:55:35 No.216262259

Anonymous 10/28/25(Tue)03:55:35 No.216262259

>>216262239
i want you back
why do you reject my love?

Anonymous
10/28/25(Tue)03:56:36 No.216262266

Anonymous 10/28/25(Tue)03:56:36 No.216262266

Jewish intellectual capacity + Polish language = global domination

Anonymous
10/28/25(Tue)03:57:49 No.216262284

Anonymous 10/28/25(Tue)03:57:49 No.216262284

>>216262135
That's a coding-type task.
Math or text searches are things you normally write a script for hence this is replacing the work the code would be doing.
High-resource languages perform better because they are designed to accommodate more types of thought.
>why not cut korean and hindi out
Because every time you run the model with an input that isn't korean and hindi it is basically guaranteed to immediately eliminate any thoughts related to them unless there's something useful beyond the surface level of the tokens in those languages
AI is reductive and model architecture performs somewhat like a binary search tree.
That is to say, if something with 1024 things takes 9 units of time, something with 2048 things takes 10 units of time to find the thing you're looking for, because each binary choice halves the possible results.
That's a bad way to explain it because it's including rather than excluding, or more precisely balancing and processing, based only on your input tokens and its internal instructions or default sense of the model data, but basically the only reason to trim training data on a model is to reduce the ending filesize somewhat, not even linearly, which isn't a good tradeoff if the data might be useful for anything, especially if it's a language which makes the data's inclusion required for the model to function for monolinguals of that language.
The better way to reduce a model's filesize AND reduce its hardware impact to run on less powerful systems without swapping is to quantize it, which is just reducing the precision of the finished model by some factor, you can usually go to Q4 or so without fucking up the practical use of the model too badly, it's worse but not much worse and much lighter.

Anonymous
10/28/25(Tue)04:01:33 No.216262326

Anonymous 10/28/25(Tue)04:01:33 No.216262326

in the 1930s Polish was the lingua franca for Israeli politicians and intellectuals who did the aliyah but didn't learn Hebrew yet. I once read a report from Palestine in a pre-war Polish newspaper and they wrote that it was weird reality because you found yourself like in a tropical Poland, everyone around speaks Polish, from a shop seller to a minister except it's +40 C and oranges grow on the trees.

Today Russian replaced it I guess.

Anonymous
10/28/25(Tue)04:04:47 No.216262359

Anonymous 10/28/25(Tue)04:04:47 No.216262359

why do Jews always forget about 1000 years of uninterrupted thriving and happy Jewish life in Poland and only focus on the 5 last years as if 5 > 1000?

Anonymous
10/28/25(Tue)04:06:58 No.216262379

Anonymous 10/28/25(Tue)04:06:58 No.216262379

>>216262284
A little more explanation, the final model has tokens which are syllables or symbols or sometimes small words as I've already mentioned earlier in thread, but those themselves are just ways to access existing clusters of pre-weighted neural networks within the model. So like if I increase the weight of a single token or group of tokens within the prompt, it's basically just amplifying and applying that multiplier to everything produced by all of those neural clusters thus accessed, blindly across all of their individual weights.
So if the tokens aren't being pinged then those specific neural clusters aren't getting activated, just like unless you have some rare neural or psychological disorder or some specific combinatory memory, you don't see a dog and taste grape jelly, and the memory of the taste of grape jelly takes neural development and maintenance to preserve on a systems level but the knowledge of what grape jelly tastes like doesn't "clutter" your brain or make it slower at recognizing dogs, it's a totally different part of the brain physically for one which AI doesn't have, but even if it weren't you still wouldn't taste the jelly unless you were eating the jelly, and you wouldn't think about the taste of jelly unless you were remembering it for some reason, like seeing jelly or a picture of jelly or me writing about jelly or just being hungry and wanting quick carbs.
So removing languages from a model's input is just not exposing that person to jelly. Remember most models don't have a strict limit on their filesize, they may have an overall neural limit but they may just run until they seem to get the right checkpoint wherever that winds up for their level of precision. Yeah someone with a categorical memory of all jellies ever made knows less about everything else, but that's because he took up limited human lifespan and bandwidth to learn all of that and focus on it. The model doesn't even know it knows anything until it's asked.

Anonymous
10/28/25(Tue)04:09:33 No.216262407

Anonymous 10/28/25(Tue)04:09:33 No.216262407

>>216262379
And since models don't have actual neurons, it's all represented by virtual neurons which are just nodes with connections to other nodes and the relative strength between them expressed as a floating point decimal number I'd assume.
So quantizing a model in a very broad sense is sort of just reducing the number of significant digits on the virtual neurons, giving the hardware less to calculate and also reducing the overall model size but without losing the neurons.
I think, I don't actually study AI and have just read a few things explaining some concepts and am filling in based on intuition so don't take it as gospel.

Anonymous
10/28/25(Tue)04:24:27 No.216262566

Anonymous 10/28/25(Tue)04:24:27 No.216262566

>>216258095
People say these things to be cutesy. Pilots communicate in the King's language. You think fighter pilots are ever going to degrade themselves with zwsyssiky pwwkkissy? If AI is cutesy, it's gonna get hit.

Anonymous
10/28/25(Tue)04:39:30 No.216262766

Anonymous 10/28/25(Tue)04:39:30 No.216262766

>>216262175
is this what they teach you? god yoru fake country is so disgusting on so many levels

Anonymous
10/28/25(Tue)05:12:09 No.216263243

Anonymous 10/28/25(Tue)05:12:09 No.216263243

>>216261850
In the picture you posted, what is the relation between the word and the "special magic number"?

Anonymous
10/28/25(Tue)05:16:17 No.216263325

Anonymous 10/28/25(Tue)05:16:17 No.216263325

>>216258095
Source: polish source

Damn, man, they don't even have their own AI, zero integration into life.

Anonymous
10/28/25(Tue)05:44:20 No.216263767

Anonymous 10/28/25(Tue)05:44:20 No.216263767

>>216263325
>they don't even have their own AI
dumbass

Anonymous
10/28/25(Tue)05:45:21 No.216263784

Anonymous 10/28/25(Tue)05:45:21 No.216263784

>>216262266
Jewish what?

Anonymous
10/28/25(Tue)06:49:05 No.216264904

Anonymous 10/28/25(Tue)06:49:05 No.216264904

>>216258095
I thought French would be. Would this imply Russian and Lithuanian would both be good contenders too?

Anonymous
10/28/25(Tue)07:18:48 No.216265449

Anonymous 10/28/25(Tue)07:18:48 No.216265449

>>216263325
the source is this paper by university of maryland and microsoft: https://arxiv.org/pdf/2503.01996

Anonymous
10/28/25(Tue)07:55:30 No.216266115

Anonymous 10/28/25(Tue)07:55:30 No.216266115

>>216264904
French was second best
Chinese scored poorly for some reason

Anonymous
10/28/25(Tue)08:07:24 No.216266347

Anonymous 10/28/25(Tue)08:07:24 No.216266347

>>216262201
Not even Trotsky spoke Yiddish, it was considered a Germanic dialect.

Anonymous
10/28/25(Tue)08:15:38 No.216266503

Anonymous 10/28/25(Tue)08:15:38 No.216266503

This is actually a reason for it being one of the worst languages, not the best though

It's a lovely language though, but Poles are self-hating kuks and there's barely any good content so waste of time

Anonymous
10/28/25(Tue)08:27:09 No.216266769

Anonymous 10/28/25(Tue)08:27:09 No.216266769

>>216266503
kill yourself