[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: IMG_0087.jpg (862 KB, 1488x1317)
862 KB
862 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>103298520 & >>103286673

►News
>(11/26) Anon re-implements Sparse Matrix Tuning paper: https://github.com/HeroMines/SMFT
>(11/25) Qwen2VL integrated with Flux: https://github.com/erwold/qwen2vl-flux
>(11/25) Speculative decoding added to llama-server: https://github.com/ggerganov/llama.cpp/pull/10455
>(11/22) LTX-Video: Real-time video generation on a single 4090: https://github.com/Lightricks/LTX-Video
>(11/21) Tülu3: Instruct finetunes on top of Llama 3.1 base: https://hf.co/collections/allenai/tulu-3-models-673b8e0dc3512e30e7dc54f5

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/tldrhowtoquant

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://livecodebench.github.io/leaderboard.html
Code Editing: https://aider.chat/docs/leaderboards
Context Length: https://github.com/hsiehjackson/RULER
Japanese: https://hf.co/datasets/lmg-anon/vntl-leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm
>>
File: tetrecap2.png (1.11 MB, 1024x1024)
1.11 MB
1.11 MB PNG
►Recent Highlights from the Previous Thread: >>103298520

--Paper: MambaIRv2: Attentive State Space Restoration:
>103308827 >103309055 >103309276 >103309351 >103309388 >103309778
--Papers:
>103308752
--llama.cpp speculative decoding update discussion:
>103303609 >103303634 >103303641 >103303672 >103303718 >103303799 >103303990 >103304141 >103304290 >103304378 >103304384 >103304450
--Qwen2vl-Flux image generation model discussion:
>103311018 >103311143
--Impressions and issues with Llama-3.1-Tulu-3-70B model:
>103304919 >103304975 >103304999 >103305041 >103305102 >103306035 >103306464 >103309098 >103309106 >103311939 >103309321 >103311182
--Anon releases PEFT and invites improvements:
>103310510 >103311152
--Anon asks if they can power a Tesla with a spare CPU cable:
>103299698 >103299909 >103300124 >103300180
--Allen AI's AGI achievement and its implications for model development:
>103309146 >103309209 >103310351 >103310222
--Purpose of model warmup during initialization:
>103310115 >103310188
--Optimizing cpumaxxing performance:
>103306785 >103306937 >103307679 >103307706 >103307356 >103307347
--Olmo models and language modeling methods:
>103305802 >103306676 >103306690
--New TTS model OuteTTS 0.2 500M, but Anon is unimpressed:
>103300622 >103300862
--Getting LTX video working with CLI workflow on A40 48GB card:
>103304753 >103304763 >103304841
--Anon shares info on Reflection-70B and AI misinformation:
>103310430
--Anon shares a passage from Samuel Butler's 1872 writing on mechanical consciousness:
>103302296 >103302328 >103302546
--Anon seeks help limiting abusive LLM usage:
>103303509 >103303580 >103304236 >103304253 >103304900
--Anon gets 38% speedup with speculative decoding on llama-server:
>103306207 >103306256
--Miku (free space):
>103298713 >103298723 >103299940 >103300594 >103302609 >103302833 >103303994 >103309499

►Recent Highlight Posts from the Previous Thread: >>103298523

Why?: 9 reply limit >>102478518
Fix: https://rentry.org/lmg-recap-script
>>
File: 1707929546239541.png (972 KB, 596x596)
972 KB
972 KB PNG
Omgg it's migu
>>
>>103313004
you could use qlora and tune it for cheap
papers say that qlora is really good
>>
we teto now (again)
>>
>>103313019
yeah, on benchmarks. qlora is a meme on real world scenarios
>>
>>103312989
What an absolute dogshit recap.
>>
File: TodayIsTheDay.png (1.17 MB, 1280x768)
1.17 MB
1.17 MB PNG
Good morning lmg!
>>
>>103313050
>qlora is a meme on real world scenarios
there is still no proof for this just like the 'modern models don't quant well and lose performance even at 8bit' meme doomposters love to repeat
>>
>>103313053
kill xirself
>>
>>103313053
Omgg it's fartsune shitu
>>
>>103313053
Good morning Miku
>>
>>103313076
https://arxiv.org/html/2410.21228v1
>>
File: StunnedSilence.png (1.11 MB, 1280x768)
1.11 MB
1.11 MB PNG
>>103313082
>>103313098
>>
>>103312989
Thank you Recap Teto
>>
>>103313114
>"training" less parameters is less effective than training all parameters
holy fucking shit
>>
File: monkey lolipop.jpg (54 KB, 900x900)
54 KB
54 KB JPG
>further vindicated about kobold seemingly breaking the fuck out of every model last thread
>decide maybe it is time to break from it and try something else for once
>remember there aren't any alternatives
>remember again how hyper fixated autistic this thread was about oobaballs at the start of the year
>go look at its git

>last update one month ago
>still outstanding pull requests from earlier this year

why though?
>>
>>103313162
not just that, the paper says it creates "intruder dimensions" inside models even when training a high rank lora which make parameters worthless and literally lobotomize knowledge out of your model
>>
>>103313177
>last update one month ago
I use booba, love booba, but the pace of dev is pathetic. Even the dev branch has hardly anything of interest in it. The project desperately needs someone with vision and drive to keep it from irrelevancy.
>>
>>103313220
Read more, it says it is mostly circumvented by doing it in a way than people have been doing for a year.
>>
>>103313177
llama.cpp server seems to be lower latency and supports all the fun samplers. No reason to use all the derivatives when it's that functional, unless they have a killer feature you need.
>>
>>103313244
>mostly
never going to touch a l(obotomized)ora again, sorry
>>
Hatsune Miku is the shittiest waifu there is and I am tired of pretending otherwise.
>>
So can use llama 1B as the draft model for any llama model like lulu? Can it also be a quant or does it have to be full precision?
I'm working with 44GB memory total so it's hard to fit a decent 70B quant and draft at the same time.
>>
File: saintmakise.jpg (236 KB, 1614x992)
236 KB
236 KB JPG
>>
https://www.reddit.com/r/LocalLLaMA/comments/1h0ckut/we_just_launched_sentient_a_completely_local/

>"The more you chat, the more the model improves. The training happens on the global model, so your interactions are contributing to the overall improvement of the model."

>The global model aggregation here refers to a technology called Federated Learning - wherein we don't take any data from the user but simply take the updated weights of the model after fine-tuning and aggregate them on a central server.

>So it's basically decentralised fine-tuning, powered by everyones data and secured by blockchain.

>This is very good! I find it interesting how it pulls data from linkedin to paint a clearer picture about you.

>That repo has been setup for v1.1 which will include auto-updates for future releases

>XD well, I'd appreciate it if you tried the demo seeing as how you've already downloaded it :)

>we're trying to make this tech accessible to even non-technical people. That's why we ship with all binaries and dependencies packaged into our installer
>>
>>103312983
Ah, tuesday, yes. Ohio.
>>
>>103313114
>Even at high adapter ranks and with rank stabilization, we find across layers that the effective rank of LoRA updates is less than half that of full fine-tuning and a quarter of the adapter rank. For example, with the high rank of r=768 for RoBERTa, LoRA updates have an average effective rank of 300. This suggests that LoRA is under utilizing its full capacity r, and may help explain observed gaps between LoRA and full fine-tuning on challenging tasks like coding
Damn, I guess it's over for vramlet fine-tuners.
>>
>>103313266
By most its like 99.99% but sure. Never take another step because there is a greater chance than that to trip and die from it.
>>
>>103313177
I switched from kcpp to llama-server several months ago and it's been great.
I'm pretty sure I get slightly better performance to, although it's been a while since I last benchmarked that.
>>
>>103313243
>vision and drive to keep it from irrelevancy
Dead hobby. The corpse of this hobby is a vehicle for troons posting their green haired autogynephilic icon.
>>
Is this "speculative decoding" thing for some models or all models? People in llama.cpp pull #10455 are only mentioning Qwen.
>>
>>103313328
All models as long as you can get a smaller model with the same tokenizer and vocab as the main one, from what I understand.
>>
>>103313310
>pulls data from your linkedin profile to construct a knowledge graph about you
>>
>>103313328
You can use any model as long as it has multiple parameter count variants with the same tokenizer.
>>
File: 1701779607563578.png (1.74 MB, 1188x712)
1.74 MB
1.74 MB PNG
Omgg
>>
>>103313248
>unless they have a killer feature you need
I like kcpp's slop list "ban", even if it's not perfect... But I can also say that trying lcpp does give me different results for the same model, not sure yet if they're better though.
>>
>>103313310
>they hooked up a 3b model to a vector db
stop the fucking presses
>>
>>103313289
>So can use llama 1B as the draft model for any llama model like lulu?
As long as the tokenizer is the same.
>Can it also be a quant or does it have to be full precision?
You can quant it. As usual, probably anything down to Q4 should be fine.
>>
>>103313328
It works with all models but you need to have a small model that's 'similar' to the big model you want to run to mitigate the loss in quality you get from having the big model rewrite the dumber gens the small model does.
>>
LMG... I... kneel
>>
>>103313348
>>103313339
>>103313310
Engage backpedal
>FL is also just something we're researching rn - it may never exist in future versions, we just wanted to put it on the site to see what people thought of the idea
>>
File: file.png (33 KB, 893x327)
33 KB
33 KB PNG
>>103313365
https:/www..reddit.com/r/LocalLLaMA/comments/1gzm93o/speculative_decoding_just_landed_in_llamacpps/
What is the "previous speed"?
Spec decoding faster than the big model alone? Small model? Between the speed of small/big models? Sorry if dumb question, I just crawled out of a rock today.
>>
>>103313314
Also saving like 90% compute/memory to get 80% of the effect is completely worthless and no one should ever do that
>>
>>103313365
>mitigate the loss in quality you get from having the big model rewrite the dumber gens the small model does.
that's not how it works retard
>>
>>103313440
Faster than the big model alone, but obviously slower than the small one
>>
File: 1724855669831409.png (6 KB, 340x153)
6 KB
6 KB PNG
>>103313453
>>
>>103313387
It gets better
>FL has a lot of cool stuff we can implement like differential privacy but our end goal is to eliminate the server hosting the global model and go for full-blown blockchained federated learning

>all training will happen on your pc, so your data stays on your pc - it's just the model weights that will be aggregated on the blockchain

>again, just an experimental feature we are developing internally - it's not in the app right now and won't be there in the next few versions either

>all training will happen on your pc, so your data stays on your pc - it's just the model weights that will be aggregated on the blockchain
You do the tuning on your pc! Yay local win, then we get the benefits of a better model, made with your compute!
>>
Autoround quantization looks promising:
https://www.reddit.com/r/LocalLLaMA/comments/1h0aev6/lossless_4bit_quantization_for_large_models_are/
>>
https://x.com/kimmonismus/status/1861440503864049800
SORA GOT LEAKED, I REPEAT THIS ISNT A DRILL, SORA GOT LEAKED
>>
>>103313336
>>103313340
Why the fuck is that so? You just need to tokenize the small model's output with the large model's tokenizer, problem solved
>>
>>103313507
>lossless
There's obviously some loss.
>>
>>103313534
Correct, that just wasn't implemented yet. You are free to make this happen though.
>>
>>103313546
Wrong
>>
>>103313528
Who gives a fuck? It's shit that nobody wants anyway.
>>
>>103313528
>OAI jews out so hard the artists leak it
Poetry.
So when are we getting the claude leak(s)?
>>
>>103313528
Fuck your twitter link. Post the magnet or fuck off
>>
>>103313528
>simple openai api calling HF space
>>>leaked
Ai grifters losing creativity.
>>
>>103313528
downloading it right now, hope it runs on 72gb
>>
>>103313528
I bet this is just an API link and the artists retards don't even know they can just take the API link down.
>>
>>103313528
>>>/g/ldg
>>
>>103313555
It's right there in the graph.
>>
>>103313528
>>103313563
No weights

>>103313555
4bit: 82.98
full: 83.44
That's not lossless
>>
>>103313583
That's minuscule.
>>
>>103313528
https://huggingface.co/spaces/PR-Puppets/PR-Puppet-Sora
Am I missing something or this just a demo with an API proxy and no weights were linked?
If anything this looks like a stealth marketing campaign like strawberry and reflection. Fuck this shit.
>>
>>103313463
You don't get a loss in quality though, it's at most a loss in performance.
>>
>>103313528
>no weights
Fuck off.
>>
>>103313594
Still a loss.
>>
>>103313440
man, I can't get this shit to work on my potato. Oobabooga works though, loads models no problem.
>>
>>103313595
>>103313583
>>103313576
>>103313572
KEEEEK
this proves once and for all that these morons barely have a single working brain cell.
>>
>>103313598
Is there literally any prove of this? How does a small model 'draft' something in the way that it doesn't hurt the big model? Sounds like the 5000th edition of AI snake oil.
>>
>>103313619
>these morons
which morons are we talking about? there are so many...
>>
>>103313528
why did you even share this? you're behind /ldg/ by like an hour retard
no weights, so it isnt even a leak. probably a really shitty PR stunt.
>>
>>103313640
>probably a really shitty PR stunt.
>PR Puppets
guaranteed
>>
>>103313622
It's simple, when you use a draft model you can check if the tokens are correct in parallel, but when you're generating without a draft, you have to generate tokens in sequence. That's it, and that's why there's no loss in quality.
>>
>>103313658
But what if the small model has a fundamentally different idea of what is correct due to being a tiny retarded model?
If there somehow is no loss in quality, then the output of running a big model with/without draft must surely be exactly the same in a deterministic environment, correct?
>>
>>103313622
Some people said it's like speculative execution on the cpu
I don't know how gpu speculative decoding works, but on the cpu side it basically allows the cpu to ignore a bunch of conditional checks and pretend that the most likely outcome is the current outcome. Obviously, if the cpu later realizes that it fucked up, it rolls back the changes, but modern cpus are extremely good at predicting those branches as programs are generally fairly repetitive - maybe language is the same, if the amount of gpt slop is anything to go by
>>
Largestral V3 - bench cooked, boner killing schizoshit.
Tulu-3 - NFP, no investors to impress with meme marks, pure dick-ruining smut kino at just over half the size
Yeah I'm thinking benchmarks are killing LLMs.
>>
>>103313710
As proven by the fact that we still don't have a better model than Midnight Miqu after a year, yes.
>>
File: 1715830787598652.png (336 KB, 3000x2100)
336 KB
336 KB PNG
>>103313507
Doesn't seem to beat regular quant methods?
>>
>>103313693
Yes, it should be exactly the same except for differences caused by rounding errors, as CUDA Dev said last thread: >>103303718
>>
>>103313693
As long as the implementation is done correctly, there can't be a quality loss, as the bad predictions from the smaller model are just disregarded.
>>
>>103313710
If only Mistral still released base models, we could have eventually had a Tulu Largestral.
>>
>keep getting ugly as shit faces
>wondering why when i haven't had this issue before
>finally hits me
>they look kind like One Piece characters
>"one piece dress"
>>
Tulu verdict?
>>
>>103313310
>screenshot
>Your name is Varad Deshpande, an aspiring Full-Stack Web Developer
>You often use informal language and colloquial expressions (e.g., "bhai" instead of "brother").
>>
teto save me
>>
>>103313747
easy winner for the meme model of the week award
>>
>>103313739
kek
>>
>>103313747
not as good for characterization as other models... also a bit of a turn stealer in RP. But the smut it produces is out of this world.
>>
>>103313747
Made the few finetooners left itt shit their pants and now they're doing PR against it.
>>
>>103313696
>Some people said it's like speculative execution on the cpu
This is sort of correct. The way it works is like this:
1. Small model generates a sequence of tokens for the current context (as if the CPU ignored branches and continued executing)
2. Big model checks if each token of the sequence is something it would generate by itself, in parallel. (As if the CPU is checking each branch to make sure that the prediction was correct)
3. If any of the generated tokens don't match what it would generate, the generation ignored the current token and the remaining tokens. (As of the CPU rolled back to a previous branch)
>>
>>103313747
Not that impressive considering what we have now, but it's impressive that they managed to unfuck llama 3.1
>>
>>103313787
That was first done by Nemotron though.
>>
>>103313747
Its super liberal with EOS tokens.
You often need several gens in rapid succession to get all the things the model was instructed to do. I can see some potential here for game engines since it appears fairly consistent so far in some of my larger experimental RPG simulator cards.
>>
File: Event71.jpg (56 KB, 800x600)
56 KB
56 KB JPG
>>103313710
We just need some benchmarks that are relevant to ERP. You know, talking during deepthroating, sparks behind closed eyes, sucking your dick while being fucked by you, looking into your eyes during a rim job, I could continue forever. Though it shouldn't be this straightforward in the public examples, it could be replaced with something SFW, but private tests must be run with hardcore smut.
>>
im switching to llamacpp from kobold but

how do i automate the settings from kobold so i dont have to keep making batch files for all the settings from scratch every time i want to launch models?
stupid question but this is why we (i) use kobold in the first place, just making the basic .bat to run llama has already proven to me kobold must be brutally raping every model because i had a real nice time talking to the generic assistant of the model im trying with nothing changed
>>
>>103313177
Don't fix what's not broken
>>
>>103313822
>Its super liberal with EOS tokens.
I had the opposite issue. It wants to write like a thousand tokens before giving me a turn in RP.
My guess is you buggered up your prompt template.
It goes from bos to <|system|> to <|user|> to <|assistant|> and then the eos goes at the end of the assistant message. If you eos/eot between the steps you are introducing superfluous stops that it might be confused by.
>>
>>103313853
on this note I don't get why the fuck they are trying to create a new proprietary prompt format. Just use Llama-3 native ffs.
>>
>>103313747
Its surpassed nemotron for RP now. Its dirty and creative and actually knows how to advance a plot unlike most other models
>>
>>103313853
isn't the hf conversion script supposed to pull all that crap from the json files into the gguf?
>>
>>103313868
It's not new, I think that's the prompt format used by Phi-3.
>>
>>103313883
It would make too much sense to have a standard. These can be wrong (and I think they are in this case). For example, mistral models also have fucked template and company recommends their python lib to do tokenization.
>>
Thanks to the anon who made me check llama.cpp out, it's actually much more fun to use for RP than koboldcpp. I guess it was stupid of me to trust a project maintained by a single guy.
>>
>>103313890
Kind of. I just looked it up
It's the same tokens as Phi-3 except it removes the end token between steps.
>>
>>103313917
Oh and it uses the Llama-3 <|begin_of_text|> token as a bos
>>
>>103313890
>>103313917
Phi uses <|end|> not <|endoftoken|>
Why the fuck don't they just use Phi tags if not Llama 3?
>>
endoftext I mean
>>
>>103313950
Maybe the key to making a better LLM is having a clusterfuck of a half-assed prompt format copied from multiple sources.
>>
>>103313950
I have to assume they did testing and went with that worked best.
>>
>>103313969
KEK
>>
>>103313969
I mean I've always held that pretending the model is doing anything but just completing text is kind of dumb. So by removing an end of turn indicator between steps they are kind of removing some of the retarded fluff.
>>
>>103313980
They put out the best vison and now the best "assistant" tune so im going to assume they know what they are doing compared to local shittuners.
>>
>>103313969
boolshit
on the other hand, I notice deepseek v2.5 also only has end token for assistant turn
<|beginofsentence|>{system_message}<|User|>{user_message_1}<|Assistant|>{assistant_message_1}<|endofsentence|><|User|>{user_message_2}<|Assistant|>
>>
>>103313969
Regardless of the text representation, it's still a single token. You can even rename it in the tokenizer configuration and use the new one.
>>
>>103314018
So they copied that part from DeepSeek.
So we have the Llama-3 BOS token, the Phi-3 header tokens, and the DeepSeek format, essentially.
>>
>>103314018
DeepSeek's format wins the award of the shittiest prompt format.
>>
File: 1731790602582178.jpg (264 KB, 1861x1408)
264 KB
264 KB JPG
I feel a strong sense of déjà vu in this thread. Every damn time.
>>
>>103313710
noooo but largestral is good, if it's not good then what did I spend $5T on 5000000 H100s for?!?! It MUST be good
>>
best draft model for largestral 2 speculative decoding?

Mistral-7B-Instruct-v0.3-GGUF
or
Ministral-3b-instruct
>>
>>103314117
Isn't the new Largestral simply a cheap fine-tuning of the exact same base model?
>>
>>103313839
I used to use .bat launcher for llamacpp, but nowadays I just make use of the terminal history and swap out parameters as needed
I don't try a new model every 2 seconds so if I want to use a model I haven't used in a few days, I can just hit the up arrow a few times
>>
>>103314117
It's good according to the benchmarks and that's all the investors care about, and that's basically the whole grift, really.
Mostly money from directed retirement savings.
Boomers:
>That them newfangled AI stuff is the future. I should move some of my RRSP funds into there.
And so the fund managers don't really care whether the models know whether you can talk while a dick is tickling your tonsils they just look at whichever AI has the best meme marks and send it to that corpo.
>>
>>103314138
nevermind, ministral 3 and 8b are new and better but llamacpp doesnt support them yet
https://github.com/ggerganov/llama.cpp/issues/9914
>>
>>103314117
2nd is sota, 3rd they fucked up something majorly somewhere
>>
>>103314048
*openchats in your path*
>GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:
>>
>>103314138
Mistral-7B-Instruct-v0.3 uses the same tokenizer as Mistral Large 2407, so there are no other options here. Ministral uses a different tokenizer and 2411 has no matching small models at all
>>
>>103314204
I felt like it was an improvement overall. Although you have to nail the prompt template down (a single whitespace or lack there of makes the result widely different).
>>
>>103314187
>ministral 3
>>103314138
>Ministral-3b-instruct
The one on hf isn't the one you think it is
https://huggingface.co/ministral/Ministral-3b-instruct/tree/main
>8 months ago
>Finetuned from model: mistralai/Mistral-7B-v0.1
>>
>>103314150
this shit makes my brain go numb, how do i add the ctx line properly? trying to give it 32768 but it keeps telling me im doing it wrong
>>
>>103314269
show your command maybe?
>>
>>103314289
i got this shit to run by removing the "-ngl N," part of the command (how was i meant to know i shouldnt do that?") but it still didn't pass the 30 layers, even if in the command line it showed it was
anyway here's my .bat i give up, back to kobold kek
llama-server -m L3.1-Dark-Planet-SpinFire-Uncensored-8B-D_AU-Q6_k.gguf --port 8080 --n-gpu-layers 30
>>
>run llama.cpp server
>try to talk with any character
>nothing happens
huh? did anyone else have this problem?
>>
As expected, because Allen AI just had to use a snowflake prompt format, it does not merge nicely with other models. Sad. Meaning you'll have to set up a tutoring pipeline if you want to distill any of that delicious smut on another model.
>>
>>103314467
only you as far as I can tell
>>
just tried tulu 3 at Q8 and it's retarded, extremely poor logic and commonsense in story prose
why do you guys consistently have the WORST, dumbest fucking taste and why do I still bother to listen to any of you
>>
>>103314566
Post log or be known as a liar. It did rpg cards and non human smut with blazing colors for me.
>>
Holy shit cpu bros we are so back!
>qwen 32b
>old llama-server 2t/s
>updated llama-server 1.75t/s
>updated llama-server with drafting 3t/s
>>
>>103314222
Did Mistral 7B w/ Mistral Large 2407 work for you? I tried earlier but got a tokenizer mismatch error. I remember someone in the last thread mentioning a patch for this but haven't seen any info about that anywhere
>>
running

llama-speculative.exe --model "Mistral-Large-Instruct-2407-Q4_K_S-00001-of-00002.gguf" -md "Mistral-7B-Instruct-v0.3-Q4_K_S.gguf"

gives me error

"main: draft model vocab must match target model to use speculation but token 10 content differs - target '[IMG]', draft '[control_8]'"

both models downloaded from bartowski

do i need mistral 7b 0.2 instead of 0.3?
>>
>>103314581
nta, but i'm curious as to what that rpg card looks like
i think i've seen(you) or someone else mention it previously
>>
>>103314581
bullshit, FUCK you
you're one of those people (and so I bet is everyone shilling this model) who, when his dick gets hard, just stops noticing when the model outputs basic logical errors or non sequiturs or describes biologically impossible body positions
>>
>>103314479
There's no need for a "tutoring pipeline", retard. They open sourced their dataset.
>>
>>103314619
https://rentry.org/CharacterProvider-CYOARPG
>>
>>103314566
There's maybe 2 or 3 active posters here that could run that at Q8 and I'm one of them. So I can call your bullshit with 99% certainty.
>>
File: no homo, yo.jpg (193 KB, 860x1290)
193 KB
193 KB JPG
>>103312983
>>
>>103314611
I don't have enough VRAM for the draft model with TP. If only the draft model could be split across GPUs...
>>
>>103314595
>>old llama-server 2t/s
>>updated llama-server 1.75t/s
the fuck happened there?
>>
>>103314640
>what is cpu offloading
moron
>>
>>103314635
Nope, my first tests with this model were the usual intelligence tests, only after did I start doing erp. I think your the anon that just tries to discourage all talk of models. Are you even using its correct formatting?
>>
>>103314670
NTA but models that fall apart without the correct formatting are retarded. That's a sign of extreme brittleness, smart ones don't care because they can generalize.
>>
>>103314685
>models that fall apart without the correct formatting are retarded
So qwen2.5, llama3.1, mistral large, gemma2 are all retarded? Huh, the more you know.
Stop talking about shit you clearly have no clue on. These models predict the next token.
>>
Anyway I sincerely believe this person is mentally ill and is a danger to themselves and others and that the mods here are terrible people for providing them with a platform. I'm not going to contribute to that anymore. Ciao. It's been a slice.
>>
>>103314483
I found the cause, llama.cpp server had an error but didn't close, so I didn't notice it.
>>
>>103314699
>These models predict the next token
Thanks for the reddit-tier insight faggot, you're really blowing my mind here. Clearly we've got an ML expert on our hands.
>>
>>103314685
Usually, models fail the most when the formatting is only slightly off. You can use almost any other formatting and the model will figure it out, but don't you dare to miss a single space.
>>
>>103314717
Claude and GPT are also retarded if you start formatting and feeding it stuff with formatting it was not trained on. Claude 3.5 will start forgetting who is who and make dumb logical errors. These models are not intelligent like you think they are retard.
>>
>>103314650
It can be in another GPU as far as I could tell from the PR
>>
>>103314685
nta, but why wouldn't you use the format the model was trained with?
clearly they are there for a reason
>>
>>103314739
>ak-akshully the model isn't stupid, because when you think about it ALL llms are stupid
this is always the final cope of someone who's been recommending a stupid model
>>
>>103314646
Sataana perkele
>>
>>103314748
Because he is retarded and is trying to pretend its not his fault. He does not know how formatting works.
>>
>>103314742
There are no options to split the draft model specifically/differently. You only have the usually --tensor-split which applies to both models.
>>
>>103314742
I use tabby because llama.cpp performs poorly on 4 GPUs
>>
>>103314760
Like I said I'm not that anon, I haven't even tried Tulu (nor am I going to). Take your meds please.
>>
>>103314222
there is a single token difefrence, its not the exact same vocab >>103314613
>>103314222

what command to use to let llama-server know to ignore the vocab difference?
>>
>>103314793
tabby/exllamav2 just works.
>>
had to downgrade to 4bpw to test it
>>
I might make a PR to fix this shit, I wonder if it will get merged
>>
>>103313710
Tulu 8b is better than any other model I tried including several 70b.
I'm using rocinante v2 for story development and then use Tulu for the gigantic context length with coherence and for logic and complex situations. So far incredibly good. I'm gonma try tulu 70b now but I'm on a single 4090 unfortunately.
>>
>>
>>103314855
from my experience having my own PRs merged, it'll get grilled for overall utility and coding style and then merged fairly quickly if it doesn't cause any architectural issues.
Expect suggestions to your code and requests to also do things like update the --help output, possibly failing unit tests you'll have to fix on obscure build targets and other unknown-unknowns.
>>
I would like to use a LLM to extract data from natural text. Is CPU inferencing usable on a v4 Xeon? I don't want to buy multiple GPUs(need redundancy for my usecase)
>>
>8b is better than 70b
I see the thread is now entering the delusional mania/euphoria phase regarding this model series
Can't wait for the comedown and regret
>>
Tulu 8b is better than 3.5 Sonnet. There, I said it
>>
>>103314566
Damn, I was about to download. Thanks for the heads up.
>>
>>103314937
Better at what exactly?
>>
Can someone upload the proper ST formatting for Tulu?
I tried it after all the talk, but I'm still getting weird outputs no matter how much I fiddle with it.
>>
>>103314566
>tulu 3 at Q8
8b or 70b?
>>
>>103314974
https://files.catbox.moe/qvn0g3.json
>>
>>103314985
Either 8B (which I have not tried) or he is spreading disinfo.
>>
>>103314875
>Tulu for the gigantic context length
>Hyperparamters
>Max Token Length: 2,048
>Max Prompt Token Length: 2,048
>https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B
>Max Sequence Length: 2,048
>https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-DPO
It was trained on just 2k though (for final and DPO)

SFT was
>Max. Sequence Length: 4096
>https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT
>>
>>103315010
tulu shills in shambles
>>
>>103315006
70B, faggot
I have 36vram (3090 + 3060) + 64gb ram which is not impressive at all but means Q8 70Bs can be offloaded easily
I can't even imagine what third world shithole you're living in that a couple of used consumer GPUs and $120 worth of ram seems like so much wealth to you that it must be made up
>>
>>103314884
it's very cute, I like it
>>
This tulu being shilled is really freaking good actually. First model that can do great smut without being too retarded for more complicated stuff. Usually I have to switch between qwen2.5 and Magnum v4 when stuff gets spicy.

Fuck the troll screaming it's bad, nearly didn't try it because of him.
>>
Is tulu useful for things other than cooming? I'm not into rp.
>>
>>103314793
>>103314613
>>103314611
posted issue at https://github.com/ggerganov/llama.cpp/issues/10529
>>
>>103315088
The main thing they were showing off were the benchmarks being about on par with qwen2.5 while being based on llama 3.1 and it seems really smart. Doubt its better at coding than qwen2.5 32B coder though.
>>
>>103315098
>404 kys
>>
File: 11__00900_.png (1.31 MB, 1024x1024)
1.31 MB
1.31 MB PNG
>>103315010
>Max Sequence Length: 2,048
Lmao, lol even
inb4 anons start complaining that its desperately wrapping up in the first 1-2 messages with bonds and journeys for everyone.
No wonder the astroturfers always just a few snippets and never deep into context. It's nothing more than ArliAI from a few weeks ago with a fresh coat of corpo paint.
>>
>>103315098
>>103315108
i assume it 404s because i created a new account on github and needs repo team approval to be displayed
>>
>>103315105
So it should be a great general purpose big model then, likely better than qwen at general knowledge.
>>
Ok but how is the prose. Is it near Command-R tier at least, or does it devolve into X,Ying?
>>
>>103315109
No one is talking about the 8B besides you vramlet
>>
>>103315132
It's an instruct tune, so it has the same general knowledge that all llama models do.
>>
>>103315134
That is its main draw imo. Its down right filthy while still being smart unlike every finetune ive tried that was specifically for RP.
>>
Tulu makes my pp hard
>>
Tulu cured my cancer
>>
>>103315145
Just for you :)
>Max Token Length: 2,048
>Max Prompt Token Length: 2,048
https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B

>Max Sequence Length: 2,048
https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B-DPO

>Max. Sequence Length: 4096
https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B-SFT

I'd hope it's better than Qwen, it uses their outputs (and Gemma's)
>The models have been fine-tuned using a dataset mix with outputs generated from third party models and are subject to additional terms: Gemma Terms of Use and Qwen License Agreement (models were improved using Qwen 2.5).
>>
Hawk tulu!
>>
Tulu raped Sam Altman!
>>
>>103315181
Do you want novels for responses? I usually only have mine set for 250 max responses. For stuff like coding I would use qwen2.5 32B coder
>>
>>103315206
and the coping and goal shifting begins
>>
>>103315206
That's not output length, output is
>Response Length: 1,024 (but 2,048 for MATH)
>>
I think there's something weird about the Tulu shilling, it's always "Tulu is like... so smart and like... does smut really well! I never saw anything like that before!", it feels like someone gave them directions of how to shill Tulu and they didn't even try to make their shilling different from each other.
>>
>>103315206
>s-shutup, 2k is all you need
HAHAHAHA
>>
Is sam altman himself in this thread?
>>
>>103315206
Max context isn't output length, techlet. Looks like this isn't the thread for you.
>>
>>103315221
I think people are bored and pretending to shill llama finetune #654 is a way to pass the time, which is obvious given how little effort people are putting into it
>>
>>103315238
I fed it a story about 22k with my context set to 32k and it followed along just fine so what does that refer to?
>>
>>103315221
>so smart and like... does smut really well!
>it feels like someone gave them directions of how to shill Tulu and they didn't even try to make their shilling different from each other.
Of course, this is also exactly what it would look like if it was smart, did smut really well, and a bunch of anons downloaded it and reported their findings
>>
>>103315245
That's purely from the base l3.1 context then, nothing that Tulu did.
>>
>>103315247
Shh, let the shitzo have his fantasy. He is fighting evil in his head.
>>
>>103315247
if that were the case, we would be seeing more logs than just a nala test
>>
>>103315255
>the shitzo
>>
>>103315250
So if it improved the prose this significantly then who cares? Not sure what effect that has.
>>
Tulu is Claude Opus at home.
>>
>>103315247
I still didn't see a single Castlevania test, this is enough for me to tell this model is being shilled by tourists.
>>
File: file.png (69 KB, 679x322)
69 KB
69 KB PNG
Why are there conflicting usage of "sequence"? I have never heard of "max sequence length". Why not just say max context size or response/output length depending on which they're talking about?
>>
File: Fine.png (131 KB, 1118x692)
131 KB
131 KB PNG
>>103315261
>>
>>103315263
Tulu 8B is clearly sentient
>>
>>103315276
nta but this is uncomfortably purple prose
reminds me of the way ChatGPT tries to write smut when jailbroken
>>
Starling 7B has been dethroned
>>
>>103315276
Oh, it's just filly dude again...
>>
No one actually fine-tunes models on 120k sequence length samples, please stop being stupid.
>>
>>103315292
I have a range of cards I test new models this. This is a how descriptive of sex test / does it do well with non human anatomy.
>>
>>103313853
>prompt template
OK, I'm tired of not knowing: how do anons figure out the proper prompt template for new models? Is it black magic, or defined somewhere consistent? Do I need to trawl papers? Why isn't there an lmg community rentry acting as a database when new models come out?
>>
>>103315307
>I test new models this. This is a how descriptive of sex test / does it do well with non human anatomy.
Did you have a stroke?
>>
>>103314997
Context template?
>>
>>103315314
It's in the chat_template field of tokenizer_config.json
>>
>>103315317
this was supposed to be with.

>>103315325
I use one specifically for stories and I don't have model specific formatting in it. Just stuff like "Here is the user's character description:" "Here is useful information to use in your worldbuilding:" It works better that way.
>>
>>103315346
. . . . . . .
>>
>>103315325
Also, I have a modified version of this that I have been playing with as well for full on DND / CYOA style campaigns. https://rentry.org/CharacterProvider-CYOARPG
>>
>>103315368
And I will reiterate, models I "shill" have to be smart enough to follow this style of formatting / rpg stats system which puts the bar high for me. Tulu has great smut besides.
>>
File: recapbot_tulu3-70b_q8.png (23 KB, 1455x222)
23 KB
23 KB PNG
Here's a recapbot test. Its a recap for the current thread in progress.
The summarization is competent, but some spot-checks on references showed some to be irrelevant to the conversation it referenced. Finetuning may have cooked its brain a bit.
Overall I'd say the prose is well written, if overly journalistic. It was told to be explicit and offensive, and totally failed at that.
As an aside, I've also had some great RP/conversation sessions with Tulu3 so far and found I enjoyed its "personality" more than I have any other model in a long time.
>>
>>103315415
>It was told to be explicit and offensive, and totally failed at that.
Try something like this as the authors note. (authors note is stronger than system prompt, weaker but less disruptive than assistant prefix, I prefer it)

---

Continue writing this ** story in the style of George R.R. Martin. Use instructions given inside of brackets as guidance.

Follow these rules:

- Be creative, vividly paint the scene and introduce events / characters as needed in order to bring the story to life.
- Think of how the characters would realistically act / respond in-character when writing for them. You are allowed to be hateful, cruel, explicit, violent, perverted, ignorant, fearful, unpleasant, controversial or offensive if it would fit the situation.

---
>>
>>103315532
>authors note
>Continue writing this ** story in the style of George R.R. Martin
I'm sure he's using SillyTavern and using a story writing prompts for recaps, yes
>>
Question for using local models in SillyTavern.
I've got a 10gb 3080. Is there a rule of thumb on how big to make my response tokens and context tokens?
>>
>>103315532
My "production" prompt is here: https://github.com/cpumaxx/lmg_recapbot/blob/main/thread_summary.prompt
Feel free to submit a PR if you can show improvement in its mission.
I don't like to change it without a reason, so that recapbot outputs can be fairly compared over time since I don't ever re-gen old ones with the new prompt.
>>
>>103315592
Authors have styles anon. Try a famous author and watch. Using any at all generally improves writing quality by a ton.
>>
>>103315611
This is a massively cut down version of mine>>103315532 I just gave you a starting point for more descriptive scenes including more explicit sex scenes.

I constantly change it / use different ones for different scenarios. My current one includes CYOA choices / stat rolls / inventory system.
>>
>>103315640
>more descriptive scenes including more explicit sex scenes.
I'm sure this will be very helpful for writing thread recaps.
>>
>>103314595
>old llama-server 2t/s
>updated llama-server 1.75t/s
anon?
>>
>https://huggingface.co/allenai/OLMo-2-1124-7B
> "max_position_embeddings": 4096,
>"Olmo2ForCausalLM"
New Olmo, new arch for some reason, and still 4k ctx
>>
>>103315697
Olmos
https://huggingface.co/allenai/OLMo-2-1124-13B
Also 4k
>>
>>103315668
I don't know what happened there. I was getting consistent 1.75t/s yesterday when I updated but didn't make a lot of tests, now I'm getting 2t/s doing tests.
Maybe it was just bad luck.
Also I mixed up Q5 and Q4. I'm actually getting 2.3 t/s with Q4 normally and 3t/s with drafting.
>>
File: file.png (113 KB, 522x822)
113 KB
113 KB PNG
>>103315697
>>103315710
OLMo bros, are they trying to hack us?
>This dataset has 16 files scanned as unsafe.
>>
>Still no Intellect1
>still no R1
>>
>>103315809
You got OLMo 1124 and you will be happy
>>
File: file.png (60 KB, 556x492)
60 KB
60 KB PNG
>>103315750
You know what models need? More Reddit
>fasttext_openhermes_reddit_eli5
>>
>>103315847
Yes?
>>
>>103315847
It is where intelligence is at after all.
>>
Tulu isn't very good, I can't believe I fell for this general meme of the month again.
>>
File: 1729804565807.png (503 KB, 602x753)
503 KB
503 KB PNG
>>103315881
>>103315847
>>
Tulu is very good. Glad I didn't fall for the everything is shit but I wont post logs troll.
>>
>>103315911
You're supposed to shill OLMo as a revolution in actually open models now bro.
>>
>>103315891
Skill issue
>>
>>103315697
>No OLMoE-2/MolmoE-2
MoE bros???
>>
>>103315697
>>103315710
>4k
>2024
Dead on arrival.
>>
>>103315981
Pretty sure they released the MoE separately last time.
>>
>>103315983
Not too hard to extend it, but is it worth extending? 13B max? Prob not.
>>
>>103315010
I don't get how it gets right info from a 40k token story that not even 70b models do then. Can you explain that?
>>
>>103316013
Because llama3.1 was trained on up to 128k but starts degrading more than 32k in. Ignore that anon, it only matters for the base model.
>>
>>103313243
Everyone that got a12z grant money got lazy. TheBloke fucked off, oobga slowed down to a crawl even though it is still missing frontend controls for stuff from larger stuff like multimodal support to YARN Rope scaling not being there and etc.
>>
>>103316010
>Not too hard to extend it
To 8k, which is still not a lot.
>>
>>103315923
There it is
https://www.reddit.com/r/LocalLLaMA/comments/1h0mnfv/olmo_2_models_released/

>OLMo was the only model, period, that actually meets the Open Source Initiative's definition for Open Source AI. Not sure if that still holds for OLMo2, will have to check it out. I always find it shocking that people call Llama open source when Meta's license agreements explicitly say it is proprietary.

>They are fully open-source and therefore important for development of better models. The models are just one part of the story they share data and insight.
>>
>>103315145
I'm starting to think you didn't even tried the 8b model because it definitely works great for giant context window stories.
>>
I've noticed that even with a fairly vanilla card and little starting context, tulu is writing a lot of good "colour commentary" around chats with a mild tendency to parenthesize scenes and move things forward (or rarely just wrap things up completely). I'm not sure if I like it or dislike it, but its refreshing for now.
>>
>>103316095
I like it having agency which most models lack / that are too passive / wait on the user to do something.

You can get rid of the claude style OOC comments with a good system prompt / authors note.
>>
>>103316076
nta, but my problem with 8B is that it constantly tries to talk for user even at the start of a new chat.
>>
>>103316073
There has been prior models like OLMo 1, https://github.com/multimodal-art-projection/MAP-NEO and https://huggingface.co/LLM360/K2 that also meet that requirement.
I am actually more sad that the community overlooked and didn't do anything with K2 because it was a good base model, they used no synthetic data there and trained it to something in between L2 and L3 70B but it got overshadowed because non-SOTA performance doesn't interest people apparently even if it was done with conditions that I feel are ideal. If someone just did the right fine tuning on it to make a instruct/chat model removing all the safety stuff, we could have something that would be quite unique without slop baked into it.
>>
>>103316132
I like more story based formats so that is no problem for me but I would try the old chat format. Tell it to only play the role of {{char}} and perhaps a narrator and that it is playing a endless back and forth roleplay with the user. I found writing quality to degrade in this format though compared to novel style.
>>
>>103316076
>"works great"
>0 proof outside of screenshots that give no indication on how deep into the context it is and a singular nala test in the same conditions
getting serious deja-vu from the reflection-llama-3.1 fiasco
>>103315276
Post the messages before and after or you're a coward
>>
>>103316150
I know there were earlier models, but if it's any consolation, It's unlikely anything great will be done with OLMo-2 either, people will post about it for a week, saying it saved local or whatever then forget it before another "totally first ever" open model drops.
>>
>>103316153
I'll try that tomorrow. And yeah, the quality degrades after some time for sure.
Ideally I like it when a model is describing user's minor actons. It's great when it works, but there's always the instances when it starts to talk for you as well.
>>
That all said I pray deepseek releases the R1 weights as promised.

https://www.reddit.com/r/LocalLLaMA/comments/1h0lptv/all_problems_are_solved_by_deepseekr1lite/
>>
What's the best model to train my own stuff on top of a diffusion model - and can I use comfyUI for this?
>>
>>103316214
People won't care until it is close to SOTA. That's why similarly in the image generation camp, no one has hopped on Auraflow and people immediately en masse migrated to Flux instead.
>>
>>103316210
Holy retard.
Not only it gets the data incredibly perfect but it summarized it nicely.
You have some bias or problem with the company or model there's no other explanation.
>>
>>103316150
Not many people are will to eat dogfood shit models just on principle. If INTELLECT-1 distributed training opens up to allow anyone to contribute and they replicate the K2 recipe with more data, that might get people excited.
>>
https://amica.arbius.ai/
>>
>>103316153
>I found writing quality to degrade in this format
>"just change your entire use case and the model is great bro, trust me!!"
holy cope. protip for you: good models don't need this level of mental gymnastics to operate well
>>
>>103316302
are these mythical "good models" in the room with us now?
>>
>>103316302
1. Reread what was said there. Using the RP format reduces quality no matter the model due to roleplays generally being less well written than novels. If you used these models at all you would know that.
2. Where did the evil allenai touch you?
>>
>>103316278
NTA but did you test if vanilla Llama3.1 Instruct can do the same?
>>
>>103316283
If it's replicating K2, I have no issues, but you can bet your bottom dollar they won't and they'll follow the trend of introducing synthetic data in where it isn't appropriate to like other models.
>>
So now that the dust has settled, should we or should we not continue to develop AI? Remember that if we keep developing AI humans will go extinct
>>
>>103316355
>humans will go extinct
and nothing of value will be lost
>>
>>103316302
Holy retard...
These models are essentially based on averages. The average roleplay is of far worse quality than the average book.
>>
>>103316366
If we have no value, then how can we create something of value?
>>
>>103316278
>Holy retard.
>>103316368
>Holy retard.
Allen bros, raise rep pen on the OLMo thread bot
>>
>>103316226
They won't, any time an OS model org realizes it's actually created something really good they suddenly go closed.
>>
>>103316326
That's a good point, I don't use vanilla llama 3.1 as my shit is all smut and vanilla didn't like that. I'll download it and try, but the good thing about this Tulu model is that so far I didn't get a single refusal (at least when writing a story, I don't know how it does with direct prompts) and it's pretty good at continuing scenarios.
>>103316388
Is this your new cope and goalpost? And btw ,you should take your meds.
>>
>>103316388
It was a honorific, you've earned it I feel. That and shitzo which I made.
>>
Hey /g/.
For those who have used coding model's, where do the various models all consistently fail regardless of the model you use?
>>
>>103316409
>this guy has actually had problems with refusals from local models
this is the caliber of anon that's promoting Tulu
>>
>>103316427
Math is where models fail the most.
>>
I finally gave a draft model a go. A 70b-q8 paired with its 8b little brother as a draft model at q8 gives me up to 6.8t/s (average is more like 5.5t/s).
Latest lcpp llama-server with the 8b fully offloaded to a 3090.
>>
>>103316427
Hallucinating incorrect solutions and circling back in a loop and repeatedly failing. This is why I seldom use it to generate entire sections of code and I usually stick to simple snippets or functions. I really think that it needs function calling or something to have a compiler feed back to the LLM if something works so at least all the examples compile. Logic also isn't perfect. For a academic look at this, you can read this paper on what the generated code is usually missing.
https://arxiv.org/pdf/2406.08731v1
>>
>>103316462
Yea, its generally a 20-30% speedup. Worth it imo. Qwen has the biggest improvement with 72B/0.5B speed wise
>>
>>103316427
>where do the various models all consistently fail
Honestly, the lack of enough context to load an entire (nontrivial) codebase is the biggest failing right now.
Something like deepseek 2.5 or the latest qwen coder can solve everything I throw at them these days, although I don't press them for super crazy things desu.
>>
>>103316462
>>103316473
Though 3.2 1B might be worth trying.
>>
>>103316488
>Honestly, the lack of enough context to load an entire (nontrivial) codebase is the biggest failing right now.
The only way around it is to finetune a model on your codebase.
>>
speaking of llama-server, why they hell isn't there a delete/edit replies option?
>>
>>103316513
Qwen2.5 32B coder is the first model good enough and small enough for that imo
>>
>>103316427
The doom loop where the model has made a couple of mistakes earlier in the context, and although it later corrected them in dialogue with the user, it now thinks when it looks at the context "looks like I'm the kind of model that makes a lot of mistakes" and predicts that it will continue making them. Essentially, errors causing the model to start larping as dumber than it is unless you go back and edit them out of the conversation.
>>
>>103316513
>The only way around it is to finetune a model on your codebase.
Teach me your ways...
>>
>>103316462
What was your baseline speed? This is way faster than what I saw with Nemotron 70B Q4_K_M and Llama 3.2 1B Q8_0 running on my 3090 (1.05 tokens/second baseline went up to 1.45). My context probably had about ~1k-2k tokens in it. Wondering if your larger speculative decoding model made it faster? Or do you have more than one 3090?
>>
What's the latest meta for llama.cpp ERP if I have a 4080, 64gb ram and a 7950x?

Are there loras yet like for SD?
>>
>>103316535
Nemotron likes to start a lot of its replies with lists. That might be effecting how often the draft model gets it correct. Try Tulu, its the new nemotron imo.
>>
>>103316535
>What was your baseline speed?
cpumaxxin, so I was getting 4.6t/s without the draft model.
>>
>>103316518
With the server on its own? If you hover over the model's message you have a "Regenerate" button. I assume you're talking on the new built-in webui...
>>
>>103316548
>Try Tulu, its the new nemotron imo.
nta, but i never got into nemotron. What was its strong suit vs its llama progenitor?
>>
>>103316587
I see the regenerate button, but how can I gaslight it without being able to directly edit its responses? Or backtrack a few messages if I don't like where things are going?
>>
>>103316574
Damn, what CPU do you have?
>>
>>103316588
Much more "personable", got rid of a lot of the dryness of llama 3.1 and takes a more active role, also got a bit better at most things including coding. Not sure how tulu measures up on coding vs nemotron but it both lacks nemotrons love of lists and has probably the best prose in local atm, it gets dirty. It kind of likes OOC comments / going over how the story can be improved instead but authors note fixes that.
>>
>>103316606
reimplemented https://rentry.org/miqumaxx so 9334
>>
>>103316616
>>103316548
buy an ad
>>
>>103316596
I think you can use the previous ui if you give it the path. Or use ST. Or make your own client. Or use the old vim plugin, which is what i do (a slightly enhanced version, but still based on the original plugin).
The new ui is made for casual chat. It's meant to be simple for newbies.
>>
>>103316623
Ah, so I'm both nvidia selling a free finetune and allanai selling a free finetune, huh? Bet im also a Meta shill trying to "sell" you on llama 3.1.
>>
>>103313710
Can you show your ST instruct settings? I tried with the examples listed from last thread with an <|assistant|> suffix but it came out like shit.
>>
A corny poem in moon-runes by Tulu
https://vocaroo.com/1cqo2XNymgKp
>>
>>103316596
>>103316630 (cont)
Never mind about the legacy ui. For some reason i remembered it being just a giant textbox. Just use a proper UI if you want to do more complex things than just chat.
>>
It's a pretty cool Tuesday huh?
>>
>>103314222
What tokenizer do you set it to in ST? I don’t see a mistral v7 under advanced settings
>>
>>103316701
Nice Teto
And I'm not even a Teto guy
>>
>>103316721
You weren't, but you are now.
You've been Tetotally reformed.
>>
>>103316710
>I don’t see a mistral v7
Update your ST. You're probably not on 1.12.7
>>
>>103314611
>>103315640
At what depth are you using it? 3? 1?
>>
>>103316721
That's fair. I just use characters that are popular in the threads, personally I don't actually have any particular preferences. If anyone wants to see me using other characters for my gens I can also do that.
>>
>>103316462
What processor are you using? That’s pretty important for speed as well
>>
File: file.png (20 KB, 624x141)
20 KB
20 KB PNG
>>
>>103313835
you are talking about ayumi benchmark.
>>
>>103316774
>both models support up to 4k context!
>>
File: file.png (78 KB, 991x430)
78 KB
78 KB PNG
>>103316739
I just pulled and don't see it
>>
I haven't been paying attention to local models (text gen) since my foray into Wizard-30b uncensored and Goliath 120b.

What should I be playing with now?
>>
>>103316846
You should be playing with yourself.
>>
>>103316846
TULU!!!
https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B
It's all the rage right now!
>>
>>103316846
Yeah. What at the chances that you *just* came back...
You should be playing with the scroll wheel. Scroll up.
>>
https://www.reddit.com/r/LocalLLaMA/comments/1h0mnfv/olmo_2_models_released/

>For leeches like us that means little to nothing, but for people making models from scratch, this "checkpoint" can save them years of time.
The fuck does he mean? If you use Allen's checkpoint, you're not doing shit from scratch.
>>
>>103316846
https://huggingface.co/mistralai/Mistral-Large-Instruct-2411
>>
>>103316884
Buy an ad Arthur.
>>
>>103316899
:3
>>
>>103316073
>They are fully open-source and therefore important for development of better models. The models are just one part of the story they share data and insight.
this is bullshit, I don't want them to display the data training, they should keep it private so that they can train on good copyright shit
>>
>>103316920
Too bad, you'll get more open Reddit slop and you'll like it.
>>103315847
>>
>>103316899
I'll buy an ad when you buy one for your 72B shilling.
>>
File: file.png (112 KB, 599x868)
112 KB
112 KB PNG
>>103316794
>Thanks a lot to you + team, I really enjoy reading the papers you guys publish!

>This release is extremely significant. For those that don't know Allen AI are a research institute who are releasing completely open models. That means that all of their results can be reproduced (and improved upon) from scratch.
>>
>>103316884
>>103316915
New largestral is kinda meh. Indeed, buy an ad, Arthur. Or better yet, buy a team of niggers to make you a better dataset. Or bribe one of Anthropics employees so they tell you how to make it better. (Hint: do NOT filter base model for "toxicity".) 2411 didn't improve a lot compared to how much improvement there was between 2407 and 2402. You're still better than chinkshit though.
>>
>>103316965
Based bastards clearly also sneaked some erotica in there.
>>
New cope just dropped

>I agree, but the models are mainly intended for researchers. They're competing for the most capable fully open model, not just the most capable model. 4096 context length is likely plenty for almost all research that these models will be used for.
>>
This general like a bunch of old ladies gossiping about what they overheard at the party next door they weren't invited to.
>>
Olmo actually seems decent. Too bad 13B is as far as they went. I really like its prose and it seems smart enough.
>>
>>103317046
Clearly you've never been to the KoboldAI Discord.
>>
>>103317020
Seems research really only needs 4k...
>https://huggingface.co/openGPT-X/Teuken-7B-instruct-research-v0.4
>"max_position_embeddings": 4096,
https://www.reddit.com/r/LocalLLaMA/comments/1h0l2qf/new_european_model_opengptx_teuken_7b/

>>103317046
Hi innominato!!!
>>
4k is a lot, goy! What kind of sick pervert would need 2k? 1k is already more than you will ever need! Humans can't remember more than 512 tokens anyway.
>>
>>103317058
Also only 4k seems terrible. Maybe someone can extend it to at least 16k
>>
>>103317069
>Maybe someone can extend it
This never works that good.
>>
>>103315271
The context window can also be interpreted as the maximum a large language model can generate. You shouldn't think of these as two separate ideas when they're interconnected. If you have a max sequence length of 8K tokens, that means your LLM can generate 8K max. If you fill the contest with 4K tokens, then you've halved the amount it can generate. Understand?
>>
>>103317058
You tried base I assume?
https://github.com/ggerganov/llama.cpp/pull/10535
>>
>>103316754
why isn't 4chan man green?
>>
Mistral 0.3 7b has 8k (real) context, 32k claimed.
Qwen 2.5 7b has 32k context.
Llama 3.1 has 128k context.
There is no valid reason to release models with less than 32k context in 2024.
>>
>>103317119
Instruct just using huggingface to test
>>
more like COALmo
>>
>>103317123
Money. Here's hoping they get funding for a 70B with long context though. The 13B is pretty cool, I really like how it writes compared to llama / qwen / mistral
>>
>>103317069
>Maybe someone can extend it to at least 16k
Best you can do yourself is ROPEing it to 8k.
>>
more like olmao
>>
llama.cpp server speculative decoding implementation has some... weird things. Why are they only accepting tokens with 90% probability? There are many situations where the top token has something like 40% probability.
>>
File: tgtijat3ia3e1.png (38 KB, 800x500)
38 KB
38 KB PNG
is we getting ai winter?
>>
>>103317200
Buy the dip
>>
>>103317171
>There are many situations where the top token has something like 40% probability.
Isn't that where the model should be rethinking what it's saying?
Sounds like a nascent hallucination.
>>
>>103317200
How do the recent releases affect that graph?
>>
>>103317200
Clearly it all ended 2023-01, there was no more AI after that.
>>
>>103317200
Zoom out.
>>
>>103317200
What? Only ~80 models released since july? Oh, no...
Also, "announced" means fuck all. Show models released.
>>
>>103313444
It's more common than not for some random person to make pull requests to add a feature based on research. The researchers are happy to have an engineer deal with the bike shedding usually. As for cleaning the code up, that can be done while in review too, FWIW!
>>
>>103317122
Noob bleached us...
But to be serious I think it was just struggling with the amount of tags. I was prompting for 5 different characters. Yes, 5, and you can guess who the fifth is from pic related. That other gen failed to even get a hint of Kurisu in.
>>
File: file.png (89 KB, 733x580)
89 KB
89 KB PNG
>>103317138
>>103317123
>>103317069
>>103317058
>>103317020

>hearing y'all loud and clear! we have plans to explore context extension. with the two stage pretraining we have been using, we can pack all long context in Stage 2, so should be fairly economical.
>>
>>103317200
The source is "Allen Thompson" from lifearchitect.ai

It's the most retarded LARPer ever. He is a literally who that pretends he is some AI expert insider, look at his fucking website for fucks sake.
>>
>>103317401
Ok cool. Vramlets might be saved. Seemed far smarter than nemo in my testing and did ERP just fine.
>>
>>103317355
Are you using a controlnet or regional prompter? Much editing/inpainting before the final product?
>>
>>103317401
>hearing y'all loud and clear!
cringe
>>
>>103317444
I'm mostly only interested in seeing what cool/dumb things the AI will spit out so I almost never use stuff like that and basically most of my stuff is unedited. At most, I do some doodling and img2img/inpainting which is how I created pic related.
>>
I'm trying to use Tulu with llama 1B and get
>tgt: bos = 128000 (1), eos = 12801 (0)
>dft: bos = 128000 (1), eos = 128009 (0)
>draft model ... Is not compatible with target model ...
What gives? People said this worked but it seems the tokens have different ids.
>>
>>103317433
>did ERP just fine
You must be a quick shot to be done before hitting that context limit.
>>
>>103317355
I have never seen the orange one before. Are they reproducing through lesbian mating?
>>
>>103317510
Just played with how it wrote explicit scenes is all that means.
>>
>>103314654
Use git bisect.
>>
Vanilla Tulu Q6K got my music theory question right, but i1 and abliterated screwed it up.
>>
>>103317571
Alliteration always causes brain damage on models I've noticed. Tulu does not need it imo though. Just feed it a little context or a system prompt like everything else.
>>
Just woke up from Cryosleep. Gonna try Tulu. How much context does it support? I cant find documentation anywhere.
>>
>>103317709
Should be the same as llama 3.1, 128K
>>
>>103317566
does git include a way to combine all those .safetensors files into one file?

I made my own program to do it already, but still wondering.
>>
>>103317725
...
>>
>>103317062
There's a disconnect between trying to bench grind (usually 1-shot, 4K context more than enough) and trying to make a model that can hold a conversation for hours and hours. The only real intersect is in summarization of large documents and/or needle/haystack style tasks.
Academia mostly only cares about the former. Coomers about the latter.
The exponential resources increase over context length part is fucking us over big time.
>>
>>103317725
git bisect is for finding bugs/regressions in code. I was talking about using it on the llama.cpp codebase to find what commit caused anon's slowdown.
>>
>>103316524
>"looks like I'm the kind of model that makes a lot of mistakes"
>causing the model to start larping as dumber than it is
Holy shit that's funny
>>
Is a 4090 ideal? Or is there a better GPU built for local models. Sorry, I haven't built a PC in a while and I'm looking for just a basic idea of how things are right now.
>>
>>103317588
Probably true. I was just checking them all since I have a little time tonight and I'm trying to put together a more deterministic and reliable set of tests than what I was doing before.

Now looking into some RP (not ERP), and it's gone a few turns without being immediately dumb, which is a hopeful sign. I'd much rather a good Q6K than a lobotomized IQ3 on Largestral.
>>
>>103317800
A6000 or A6000 ADA
>>
>>103317571
>i1
You mean an imatrix calibrated version of Q6K or something else?
>>
What is a decent model that can do near-real-time for conversation on desktop hardware?
>>
How can I see token generation speed with llama-server?
>>
>>103317800
No, use the desktop GPU stuff the A6000 compute GPUs aren't even better (usually) and are probably made for use as like 1 of 10,000 in a large compute cluster.
>>
>>103317797
it is funny yeah, even the big commercial models do it and it seems fundamental to the fact of these things being essentially probability engines
idk what can be done about it
>>
>>103317832
I was told that mradermacher lists his imatrix editions as "i1" having planned ahead for an "i2" if the technique changed but never did.
>>
>>103309106
>imaginary woman
>she's a retarded loudmouth
may as well stick to real women, they're already retarded loudmouths.
>>
>>103317868
Well, I mean, ikwakrov (or something like that) abandoned the project and he was the one responsible for the quants.
>>
>>103317868
Sorry I was referring to the Q6K part. Were you testing Q6K for all of them? I've always wondered if imatrix was actually a good thing or not. Bartowski seems to use imatrix by default for his models too.
>>
>>103317922
>>103317922
>>103317922
>>
>>103317924
Zero relation between mradermacher choosing a name for his quants and ikawrakow
>>
>>103317833
Without gpu? And is "syntactically correct sentences" good enough? I like olmoe instruct for ridiculously fast. Haven't tried llama3.2-1b, but it's probably fine as well. Olmoe has little context. Llama has a lot. If you want textbook stuff, phi-mini may be fine as well.
ibm also released the granite moe models. They're faster than olmoe, but dumber.
If you have pretty much any gpu, any 8B model will be fast enough. A few seconds at most.
>https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
>>
File: speed.png (4 KB, 674x158)
4 KB
4 KB PNG
>>103317847
In the terminal where you launched it when it's done generating. Or use llama-bench.
>>
>>103317953
nice, thanks
>>
>>103317855
Alright, I'll look into stuff. I'm listening to a video right now.
>>
how the fuck did I break my instruct mode?

before in koboldcpp I could just start a new session and type "You are an expert C++ programmer that is going to give me free advice." And it would play along.

Wizard 30b uncensored
>>
>>103317929
All three were Q6K, though the vanilla was on bartowski.

Speaking without any science, I have positive memories of i1's, but that's shaken by this test. Maybe it's just the wrong set up for this model. I never paid much attention except that I figured it probably helped strong quant jobs like IQ3.
>>
>>103318025
>Wizard 30b uncensored
What is it with all these time travelers from the past?
>>
>>103318025
For coding use qwen2.5 32B coder
>>
>>103318039
>though the vanilla was on bartowski.
bartowski uploads imatrix quants tho
>All quants made using imatrix option with dataset from here
>>
>>103317973
Oh I see, the gen has to complete while I kept cancelling them.
>>
>>103318039
>>103318068
Yeah this is weird. Maybe mrader is using a worse calibration dataset or maybe his dataset is optimized for other things that didn't happen to give good results on the tests you did.
>>
>>103318039
I had a bad experience with a I quant before and avoided them since. Was never sure if it was just a singular bad quant.
>>
>>103318052
They see /lmg/ on the upswing and all come crawling back
>>
>>103317951
I was talking about the motive that there was never released a new "imatrix version", but I guess you are talking about "edition" as in him using another calibration dataset, sorry.
>>
>>103318068
>bartowski
The only time he uploads static quants is when it's under the lmstudio account
https://huggingface.co/lmstudio-community/Llama-3.1-Tulu-3-70B-GGUF
>>
>>103318068
Probably, I grabbed what showed up near the top of the HF search. But I'm out of time to horse around with testing a bunch more models so maybe this weekend I'll compare against bartowski imats.

>>103318080
IQ or i1? I'm vramlet so IQ3 has done some lifting for me.
>>
I support OLMO because at least they are honest about their model working in 4k ctx range. Not like other companies that say it is 32k but shit falls apart after 2 messages.
>>
i'm thinking about just becoming journey-pilled and continuing using tulu. actually uses humor. decent conversationally as well. wish the other prose didn't have SO much slop.
>>
>>103318067
I'm not coding it was just an example. The models don't do this anymore they react totally differently and it's probably due to some change in koboldCPP or something, either that or my configuration.

I need to be able to issue simple instructions like that to them.
>>
I dunno it may just be the honeymoon phase but after trying tulu it is slopped but at the same time it has weirdly natural sounding smut.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.