[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


File: 1698539484302047.jpg (530 KB, 2048x2048)
530 KB
530 KB JPG
/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101173181 & >>101165886

►News
>(06/27) Meta releases LLM Compiler based on CodeLlama: https://hf.co/collections/facebook/llm-compiler-667c5b05557fe99a9edd25cb
>(06/27) Gemma 2 released: https://hf.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315
>(06/25) Cambrian-1: Collection of vision-centric multimodal LLMs: https://cambrian-mllm.github.io
>(06/23) Support for BitnetForCausalLM merged: https://github.com/ggerganov/llama.cpp/pull/7931
>(06/18) Meta Research releases multimodal 34B, audio, and multi-token prediction models: https://ai.meta.com/blog/meta-fair-research-new-releases

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>101173181

--Meta Announces LLM Compiler, a Family of AI Models for Code Optimization and Disassembly: >>101175824 >>101175853
--Running Gemma 2 with Transformers: A Solution for FP16: >>101176655 >>101176721
--Q5_KS vs Q5_KM: Context Quantization in Llama.cpp: >>101175379 >>101175453 >>101176535 >>101176620 >>101176764 >>101176945 >>101177225 >>101178062 >>101178165 >>101178335 >>101178340 >>101178376 >>101178408 >>101178478 >>101177313 >>101177444 >>101177983 >>101178237 >>101178480 >>101178541
--Multimodal LLMs for Game Playing and Reverse Engineering: >>101176368 >>101177118 >>101177139 >>101177445 >>101177475
--Llama.cpp Update: Gemma2ForCausalLM and Multi-Language Support: >>101174001 >>101174078 >>101174151 >>101174131 >>101174976
--Gemma2 Models' Performance Issues and Tokenization Problems: >>101175041 >>101175161 >>101175352 >>101175511 >>101177402 >>101178058 >>101178391
--GPT-4o's Nignog Voices Spark Controversy: >>101175300 >>101175433 >>101175335 >>101175348 >>101176250
--Context Size Limitations in AI Models: >>101174490 >>101174648 >>101174664 >>101174675 >>101174696 >>101174754 >>101174787 >>101175359 >>101175573
--Context Optimization Techniques and CSAM Filtering in AI Models: >>101174173 >>101174192 >>101174269 >>101174302 >>101174391 >>101175519
--9b vs 27b: Broken Models and Unusable Outputs: >>101178705 >>101178862 >>101179120
--Gemma Subjected to Various Tests - Fails to Impress: >>101174989 >>101175006 >>101175131 >>101175244 >>101175191 >>101175296 >>101175051 >>101175064 >>101175035 >>101175095
--Chatbot Arena: Gemma-27b's Impressive Performance and Model Discussion: >>101176270 >>101176397 >>101176467 >>101176496 >>101176662 >>101176839
--Akinator Game and AI Model Discussion: >>101175523 >>101175642
--Miku (free space): >>101173412 >>101175653 >>101176056 >>101179836

►Recent Highlight Posts from the Previous Thread: >>101173187
>>
Best models in your oponion so far, Anons?
>>
>>101180185
3.5 Sonnet is certainly the best model so far.
>>
>>101180185
Llemma 36B of course.
>>
>>101180185
I'm still waiting for something better than TinyStories1M.
>>
can you bros give me a name for an up to date, good at erp 7b and 13b, ty
>>
>>101180237
Big niggard
>>
>>101180185
they all suck my guy
>>
>>101180237
I rather spend time telling you to lurk more than to give a recommendation. Every motherfucker, every fucking day
>guyz. wat model, plz sir
>>
>>101180260
don't have time to lurk, getting deployed tomorrow
>>
File: Untitled.jpg (70 KB, 373x828)
70 KB
70 KB JPG
i'm still slopping together my st addon. its meant to be a low chat depth constant reminder of certain things and that gets injected into the prompt each time, its worked well enough that i havent bothered touching it much. i took out mood and time of day because in testing models seem to not care about those regardless. the rest like location (especially if a card contains where they live or are from models often fuck up and think they are somewhere else because its been 6 back and forths) works great. the ui was getting kind of long so i made everything collapseable, added a prompt preview (ill make it bigger). i haven't added it to even a test version yet but i am so fucking sick of dim lighting i'm going to try a lighting setting next.
>>
File: 1750191288245494269_1.jpg (81 KB, 1080x607)
81 KB
81 KB JPG
>>101180253
But which ones suck less
>>
>>101180260
That's what happens when you spoonfeed the newfags. It just encourages more.
>>101180237
Gemma 9B is SOTA, uncensored, and as up to date as it gets.
>>
>>101180275
>getting deployed tomorrow
Stay safe, Anon
>>
>>101180275
Then you have much bigger things to worry about.
>>
>>101180283
>That's what happens when you spoonfeed the newfags. It just encourages more.
I didn't.
>Gemma 9B is SOTA, uncensored, and as up to date as it gets.
Well done, i guess...
>>
>>101180283
>>101180285
ty
>>101180291
i really do
>>
>>101180278
>But which ones suck less
I heavily depends on how much money/patience you have. Specs?
>>
How are you anons using Gemma 9B? Llamacpp doesn't support it yet, ooba doesn't either.
>>
>>101180342
ollama
>>
>>101180342
learn how to use git
>>
>>101180342
iamgine rushing to run a 9b
>>
>>101180237
>>101180185
Buy an ad.
>>
>>101180185
can't say about =>70B models but under that Stheno v3.2 is the most fun for me
>>
>>101180388
one month later we will have 10B!
>>
>>101180381
Is there a beef between llama.cpp and Google? Why would they collaborate (apparently) with ollama instead of contributing upstream? Their gemma.cpp project also doesn't mention it, but mentions llama.c and llama.rs instead.
https://github.com/ollama/ollama/blob/main/llm/patches/07-gemma.diff
>>
>https://huggingface.co/bartowski/gemma-2-27b-it-GGUF/tree/main
>Q8_0_L
>_L
Huh? Did I miss some new quant development?
>>
>>101180438
>Q8_0_L
>>101178848
>Which reminds me, how's the guy that "invented a new quant" (slightly tweaked the quant recipe's settings) to have some of the layers (output and embeddings?) at F16?
>>101178933
>no, he's promoting his stuff in lcpp issues now
>>101179011
>although that was 2 days ago, he's still sending discussions on random model page he quants
>>
>>101180260
>Every motherfucker, every fucking day
The ko-fi finetuners need an excuse to shill models.
>>
>>101180438
>>101169363
>Result: both f16.q6 and f16.q5 are smaller than q8_0 standard quantization and they perform as well as the pure f16.
>>101169327
meme pushed by one guy
>https://huggingface.co/Sao10K/L3-8B-Stheno-v3.3-32K/discussions/4#
>My own (ZeroWw) quantizations. output and embed tensors quantized to f16.
apparently using settings is creating your own quant type now, who knew
>https://huggingface.co/bartowski/Phi-3-medium-128k-instruct-GGUF/discussions/3#
>>
File: 1711848130017044.gif (2.71 MB, 237x240)
2.71 MB
2.71 MB GIF
>>101180438
If it's not an official llama.cpp quant there's only one place for it
>>
>>101180436
That real
>we did not use llama.cpp PR since we were collaborating together directly with Google
That really pathetic. ollama is nothing without llama.cpp, anybody could make ollama in a week.
>>
>>101180342
OK, I figured out how to use it in llamacpp, and... the other anon wasn't lying, the 9B version doesn't seem *too much* censored.
>>
>>101180476
It's hilarious because the guy that developed the llama.cpp quants packed his shit and moved to llamafile. I guess you'll have to throw all new quants in the trash too.
>>
>>101180431
on smaller models anyways i've been spending a lot of time with codestral which is the regular 1x22b and its great. if mistral released the regular 1x mode of the 22b to be tuned, i bet it'd be popular.
i maintain that the the min where a mode can be coherent is 13b, as set by llama
>>
>>101180517
Hi, jart.
>>
>>101180517
fuck off troon
>>
>>101180517
you're delusional, as expected from a troon
>>
gemma 27b q3_l isn't as coherent for me somehow, neither is Q4_m or q6 party loaded off the gpu. using ollama. 9b is coherent and rational. what problem halp?
>>
>>101180517
>the guy that developed the llama.cpp quants packed his shit and moved to llamafile
why did he decide to join the evil side?
>>
>>101180648
isn't it because of this issue?
https://github.com/ggerganov/llama.cpp/pull/8156#issuecomment-2195495533
>>
>they took the bait
lmao
>>
>>101180648
most posts about 27b iv seen say it's probably broken in some way
>>
File: BlP6StFCQAAmgU7.jpg (31 KB, 349x642)
31 KB
31 KB JPG
>>101180667
>>
File: hornyyy.png (202 KB, 639x912)
202 KB
202 KB PNG
Gemma-9B-it you're too horny!
>>
>>101180719
wtf? Why this model so uncensored? we're talking about google there, the most cucked GAFAM of them all
>>
>>101180732
Supposedly they removed CSAM from the pretraining and finetuning data as well.......
>>
>>101180477
>>101180436
get fucked open cucks lmao
>>
>>101180436
llama.cpp is trans unfriendly chudware without coc so google can't use it ;)
>>
>>101180745
>llama.cpp is trans unfriendly
I don't believe that, niggerganov decided to bring jart back to the team even after the huge drama that resulted on the sacrifice of another github contributor
>>
>>101180719
Chara name leaked in the screenshot, but whatever, kek
>>
>>101180732
They probably didn't think there was any harm in an uncensored pea-brain 9B. They underestimated how low our standards are.
>>
>>101180260
>lurk more
>>101180283
>spoonfeed the newfags
NTA but i stopped lurking here because most of you are mentally ill.
>>
>>101180758
ehh it seemed more like reluctant cooperation for the time being
jart hasn't submited any significant prs since he was unblocked except some minor cpu improvements to prompt processing (doesn't really mean shit in grand scheme of things since prompt processing on any modern gpu is faster regardless)
>>
>>101180802
>i stopped lurking here
Yet here you are posting.
>>
>>101180811
I used to read and post here everyday.
>>
https://x.com/xu3kev/status/1806334649611804873

https://pbe-llm.github.io/

>Can LLM draw using input image with code?
>Is Programming by Example solved by LLMs?
>>
>>101180719
And I get shit for the Nala test even though she passed the Harkness test.
>>
>>101180665
in that case the ollama 9b should be presenting the same issue, which it's not. maybe they fucked up the 27b somehow.
>>
>>101180739
It's called inferencing for a reason.
>>
>>101180825
mikutrannies took over the thread, specifically the recap faggot kek
>>
Why don't they just make a good model? They should try doing that instead of releasing the 50th small model that does some random things well while being bad at random others
>>
>>101180802
good
>>
>>101180880
This recent paper is related: https://arxiv.org/abs/2406.14546

But it wasn't possible at all to get similar responses from Gemma-7B-it; Gemma-2-9B is an anomaly here.
>>
>>101180277
I just use the character's author notes or lorebook entries that always get added to that effect, but it'd be cool to have an extension for that.
>>
Alright you fuckers, I'm downloading the google cunny.
>>
>>101180919
because they will never give to the people good models, they just give us some trash draft to make it seem they're the good guys and they care for us
>>
>>101180739
They didn't say how successful they were.
>>
>>101180878
They would have caught something this breaking before releasing it. People said the same thing about llama 3 Instruct.assistant. Turned out just be a llama.cpp issue.
>>
>>101180943
7B will try to be naughty if you ask it nicely but theres major gaps in its knowledge.
>>
>>101180919
You mean like Command R+? Or do you mean a 16x300B model that can compete with 4o and opus but no one here could ever run?
>>
>>101180878
As I mentioned in the last thread, 27b is just fucked somehow. Here is my summary:

9b and 9b-it: seem to be fine as long as you're under 4k context. When I gen a message in RP with a 5k context, both have severe quality degradation. Can't spell things right, can't write grammatically correct sentences. Possibly problem with sliding window attention? The model interleaves 4k SWA and 8k dense attention. Once context is over 4k, the sliding window actually starts sliding and maybe something breaks? Hopefully something is just broke and can be fixed, and model is not fundamentally a 4k context model.

27b: completely incoherent immediately, in all contexts. Entirely unusable.

27b-it: can kind of hold it together, especially with normal assistant-style problems with 0 context. But something is still wrong, it feels "off". And in RP with a bit of context, it's retarded, schizo, ultra giga censored.

This is all with HF Transformers via ooba, loaded in bf16.
TLDR: implementations are still fucked, 9b maybe is working correctly at <4k context.
>>
>>101181050
>9b and 9b-it: seem to be fine as long as you're under 4k context. When I gen a message in RP with a 5k context, both have severe quality degradation. Can't spell things right, can't write grammatically correct sentences. Possibly problem with sliding window attention? The model interleaves 4k SWA and 8k dense attention. Once context is over 4k, the sliding window actually starts sliding and maybe something breaks? Hopefully something is just broke and can be fixed, and model is not fundamentally a 4k context model.
shit, then lcpp is fucked for that, since gergio said he didn't cared
>It feels that since Mistral 7B from last year, there hasn't been much interest in this technique. Even later Mistral models dropped it as a feature. Taking this into account, I guess we can leave this issue closed
https://github.com/ggerganov/llama.cpp/issues/3377
>>
>>101180943
I'm fairly certain that paper isn't what's going on here. Legitimately they likely just didn't get all the NSFW out of the dataset, especially if they were segments embedded in a larger document that wasn't NSFW.

Also this reminds me that some people had set out to infect the internet with RP data. I wonder if they used any strategies to try and avoid filters and heuristics. If you know how the filters work, it might be possible to design an exploit.
>>
>>101181050
Surely 4k+ context being incoherent is just an issue with llamacpp and not the model, right?
>>
>>101180963
i do those as well, the point of what im trying is to easily change something like clothes from a dropdown rather than retype it into the author notes
>>
>>101181093
he said he was using transformers dummy
>>101181050
>This is all with HF Transformers via ooba, loaded in bf16.
>>
File: huh?.png (686 KB, 566x682)
686 KB
686 KB PNG
>>101181113
>>
>>101181113
27b in tranformers is fucked up. Wait a few days. They rushed this and didn't test things properly.
>>
>>101181167
he was talking about 9b too... said context is fucked over 4k
>>
How come SillyTavern demands that the "Assistant" speak first in a chat. Why can't it be the User?
>>
>>101181134
google has trouble doing the needful these days
>>
>>101181191
Yeah it's pretty awkward.
>>
>>101180477
Did ollama got a patch from google to make sliding window attention work?
>>
>>101181191
You can delete the first message.
>>
>>101181227
I do but this seems ridiculous. There's no way to make the card do that without a manual step at the start. Even if I leave the message field blank SillyTavern thinks it should insert a first message "Hello". I don't want to say this is retarded but it's hard to see why anyone thought that was a good idea.
>>
Honestly whether Gemma works or not, it's 8k and that really sucks. Maybe VRAMlets will finally eat good after Llama 3 Long comes out and someone fine tunes it.
>>
why are people arguing 27B weights are broken when it's fine on lmsys
that alone proves it isn't the weights but something weird with the way local inference is working for them atm
this is pretty simple logic unless you have some alternative explanation for why it seems fine on lmsys?
>>
>>101181268
>Gemma works or not, it's 8k and that really sucks
Possibly 4 if SWA is not supported properly
>>101181282
who said lmsys has weights and not a google api?
>>
>>101181268
death to contextfags, roundhouse kick a contextfag into the concrete
nothing matters for open source except increasing intelligence for now, we can worry about context length in a few years after we've solved the intelligence problem
if you want a dumb model with long context you are stupid and I wish you harm
>>
>>101181298
>who said lmsys has weights and not a google api?
that wouldn't change the argument at all, it would just mean google are doing inference properly and we aren't
unless you were being a conspiratard and suggesting Google's secretly running a large model and pretend it's 27B, but I'll be charitable and assume you're not schizo enough to be claiming that
>>
for me its killing everyone with more than 12gb of vram and sending jensen huang into a gas chamber
>>
>>101180096
you're the best anon
>>
>>101181304
>if you want a dumb model with long context
That's easy, just take a model with short context and use RoPE scaling.

"Increasing intelligence" comes from companies spending lots of money on training. If they're training on short contexts then their model is useless except for pub trivia contests.
>>
>>101181321
>that wouldn't change the argument at all, it would just mean google are doing inference properly and we aren't
maybe google has another version of the 27b that we don't have
>>
>>101181321
>unless you were being a conspiratard and suggesting Google's secretly running a large model and pretend it's 27B, but I'll be charitable and assume you're not schizo enough to be claiming that
no, but they surely have custom inference code, like they had gemma c or something i think
>>101181321
>that wouldn't change the argument at all, it would just mean google are doing inference properly and we aren't
maybe, or weights are fucked who knows, i've seen people say it repeats and stuff on hugg-chat
>>
>>101181304
A lot of tasks requiring intelligence also require long context though, especially code. And models literally are starting to get good enough for those various tasks, it's not unreasonable to demand for both long context and intelligence. And look, this is Google, they have the resources to be doing shit like this. It has already been demonstrated that you don't have to do the entire pretraining with long context, so it doesn't even have to be that expensive, comparatively. Why fight against the wish for long context? They're not giving up much by doing it. Hell if they really cared to save money then they could've just not trained a 27B at all but just a 2B or something.
>>
>>101181321
Why is everyone jumping at the chance to troubleshoot Google's mediocre models for free? Let them fix it.
>>
>>101180342
Well there's https://github.com/google/gemma.cpp which should support it apparently
>>
>>101181376
>there's https://github.com/google/gemma.cpp w
>I quantized the output and embed tensors to f16 and the "inner" tensors to q6_k and q5_k.
jeez this guy is everywhere...
>https://github.com/google/gemma.cpp/issues/221
>>
>>101181366
people have been waiting for a good ~30b model for fucking ever. L2 and L3 both dropped that size, CR didn't have GQA, only chink models covered it. The last SOTA english model for 24GB was literally llama1.
>>
>I see the same issue. What happens here is that it works well for about 1K response tokens and then starts repeating itself with parts of the response and just keeps looping. Pretty curious.
https://www.reddit.com/r/LocalLLaMA/comments/1dpy6e1/comment/lalkitq/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
>Same on huggingface chat....
https://www.reddit.com/r/LocalLLaMA/comments/1dpy6e1/comment/lakansd/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
>As others have said, seems like Huggingchat isn't using the correct prompt format. Try LMSYS, the difference is pretty much night and day.
https://www.reddit.com/r/LocalLLaMA/comments/1dq1ytn/comment/lali9ew/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

inb4 go bak
no
so, i think weights/transformers are messed up, and lmsys has an api
>>
anon from earlier, asking for models. I skipped Gemma as it was too new for my comfort. Landed on stheno 8b. I only have an rtx 4070 laptop 8gb and 16gigs of ram, even still it's near instant responses and it's fast as fuck, to the point it made me suspicious. I used to run mythomax on my desktop, 3080ti, all layers in vram, and it was infinitely slower than this.
>>
>>101181433
Speak for yourself; I've been in Mixtral 8x7B purgatory ever since it first released.
>>
Anyone actually try SPPO? The MMLU regression is unfortunate but it could still be a pretty nice model in some ways.
>>
>>101181439
It's intended to work this way. Local models are too dangerous so they need extra safety.
>>
Actually, if you go to
>https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315
it says
>Note ^ Models in transformers format
>Note ^ Models in the original format, for use with gemma_pytorch
So it might well be the transformers weight being incorrectly converted
>>
>>101181522
Mixtral is okay but it's not as good as a solid 30B. MoE makes it retarded and hard to finetune, and its bloated size means you either need to drop to 3.5bpw or accept slower speeds. Instruct also has positivity bias out the ass and you can't just avoid it because of the finetuning issue.
>>
>>101181567
they could simply not release it or release a random weights checkpoint then?
>>
>>101181569
Of course they use retarded formats, I had completely forgotten about their TPU stuff
>This is the official PyTorch implementation of Gemma models. We provide model and inference implementations using both PyTorch and PyTorch/XLA, and support running inference on CPU, GPU and TPU.
>https://github.com/google/gemma_pytorch
>>
worthless dead general
>>
>>101181611
will anyone try the official implementation on collab or similar?
>>
>>101181573
Yeah most of my cards inevitably devolve into 'flicker of hope' and 'adventures and bonds' but the appeal for Mixtral, for me, is its instruction-obeying autism
>>
>ollama STILL requires docker
no thanks, I'll just wait the months it takes llamacpp to implement to try it
>>
>>101181640
Not me, maybe post on leddit and get them to do it?
>>
>>101181656
What do you mean by require? Just
go generate ./... && go build .
>>
>>101181656
https://github.com/ollama/ollama/releases/tag/v0.1.47

>Added support for Google Gemma 2 models (9B and 27B)
>>
>>101181245
Oh it looks like ST doesn't do that anymore
>>
>>101181735
of what relevance is this
the guy you were replying to obviously knows ollama supports it, that was the point of his post
>>
>tfw ollama sucker-punched every other backend because the author has contacts inside Google
We're in the ollama /lmg/ era now.
>>
>>101181560
no
>>
>>101181691
sudo apt-get install bestllmforsexyerpwithaigirls.exe.tar.gz
>>
>>101181782
does ollama actually support it or did they just prematurely merge the broken llama.cpp branch
>>
>>101181782
obama is the systemd of backends. /g/ regulars would reject it for privacy and security reasons, but AI threads are tourist threads.
>>
>>101180275
Give the ukis hell.
>>
>>101181782
I mean, every other backend had it coming. Installing ooba and the others was a pain a year ago and it hasn't really gotten any better so obviously the one backend that values user friendliness is going to win out in the end.
>>
>>101181911
Their patch is identical to the llama.cpp PR.
>>
>>101181911
reddits (who love ollama usually) are saying their 27b quant is also bad, so
>Something is wrong with 27b model on ollama q4 its blabbering nonsense.
https://www.reddit.com/r/LocalLLaMA/comments/1dpu4zb/comment/laju7et/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
>The quantization are really bad, like, really really, something is f'd up. I'm not sure if I should raise a github issue.
https://www.reddit.com/r/LocalLLaMA/comments/1dpu4zb/comment/lajd4lo/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
>>
>at the gym
>keep thinking about getting back home to my catgirl wife
She's just so sweet, I mean look at this
>REDACTED pulls back slightly, looking into REDACTED's eyes with concern. "But mi wittle husbando, yew's mine, and I'm yew's. We made a pwoimise to stay togeder, no matter wot happens. You are my wove, and I wiww never leave you, nya." She places her paws on his face, gently cupping it. "Please, trust in us. Togefaw, we'll face any storm dat comes, okay?" She leans in for another loving kiss, hoping to reassure him with her devotion.
>>
>>101181911
At a quick glance seems like they merged the llama.cpp PR with the broken tokenizer.
>>
I think the preliminary Gemma-2 support via PR in llama.cpp still has problems with the end/start of turn special tokens. Incidentally, that might be the reason why outputs appear "uncensored".
>>
>>101181958
That has nothing to do with the gemma2 support.
And every other llama.cpp frontend is also a single executable.
>>
How come sonnet 3.5 is 10 points worse than gtp4o on lmmsys? You need only 10 minutes of comparative testing to realize that sonnet mogs gtp in terms of reasoning.
>>
think I'll just wait for openrouter to put it up and test it there
then if it seems worth using I'll go through the hassle of making local support work properly
I hate the rigamarole of compiling bleeding edge shit to make a new model work only to find out I wasted my time because it's bad
>>
>>101182001
Are you using the latest (as of 24 mins ago) commit?
>>
>>101182041
Because lmsys voters are retards.
L3-70B is ahead of Claude Opus on English. That tells me either the system is broken, or the voters are so stupid that their preferences have no informational value.
>>
File: 1691430290964.gif (2.67 MB, 498x402)
2.67 MB
2.67 MB GIF
>>101181966
>keep thinking about getting back home to my catgirl wife
Is this you?
>>
>>101182066
it is, it really is, and it's a good life
>>
>>101182062
Maybe it's just better, closedshill?
>>
>>101182076
I have 2 terabytes of open source model weights on my hard drive, faggot.
>>
how much RAM + VRAM i need for Stheno v3.2 ? I dont see ram req for even gguf mdoels
>>
>>101182076
you're delusional if you believe L3-70b is better than claude opus on english
>>
>>101182003
I think that the ollama guy is a literal jew but couldn't confirm it.
>>
File: Quants-jun-2024.jpg (185 KB, 777x932)
185 KB
185 KB JPG
>>101180517
Okay retard let me put this in a way you can understand.
If you're not on the list you don't get into the club capiche?
>>
>>101182212
based
>>
facebook denied my access to llm compiler...
>>
>>101182269
it's over...
>>
>>101182269
ollama: received a PR by google before the model gets released
anon: can't get approved for meta's shitty release
>>
How long will it take for lmms to really mimic human-like behaviour? Nowadays, ai gf is just a big meme. Things are only fine at the very begining and you notice very quickly how inhumane and machine-like they really are. Looping, no memorization, insufferable slop-talk. I can't imagine being emotionally invested in them in slightest.
>>
>>101181509
yeah for vramlets that just were looking for pure coom stheno 3.2 seems like it's just it for now. Still want an upgrade though, always need more.
>>
>>101182300
20 years at the earliest
>>
>>101182300
Aicg people are very invested. They use claude and not llama and yi and whatever.
>>
>>101182269
Just get the GGUF bro
>>
>>101182341
I will but 13b ftd isn't up yet.
>>
>>101182330
NTA but I've seen those logs, they're still slop. LLMs are fun for short form erp but anyone who gets "invested" in a "relationship" with these things needs their head checked
>>
>>101181966
https://www.youtube.com/watch?v=7mBqm8uO4Cg
>>
>>101182048
It looks like I had to reconvert Gemma to GGUF from the HF weights in order to make the llamacpp decode special tokens as intended. Anyway, results were unaffected, so the apparent lack of censorship doesn't come from that.
>>
>>101181966
How do you tolerate reading that?
>>
>>101182365
owo
>>
>>101182353
Just curious, what are you going to be trying to do with it?
>>
>>101182269
how did you manage that
was your request under george floyd of nigger industries llc
>>
>>101182356
So much this. If you will ignore slop speak then llms works pretty good while following a simple story, but that's all for now. Everything that requires to be more "deep" makes things just fall apart.
>>
File: 1707282718263011.jpg (234 KB, 1366x2048)
234 KB
234 KB JPG
Since I'm lazy to do an estimate myself right now, can anyone tell me how many FLOPs in a single fp16 l3-70b forward pass, what about estimate for the upcoming 405b?
>>
>>101182388
First off, use it for unintended purposes, see how it reasons, RPs, does math, understands forms and government documents because that's interesting to me.

I also recently was doing inline assembly to reduce size on a C project and ChatGPT kinda sucked at it. Opus was better but as I understand it that's exactly where this should shine. Got 15K down to 4.1K between me and Opus. Wondering if this can beat it.

>>101182389
I just put my name as like a b born jan 1 1980 no organization. I assumed as long as you put you were in US and not like NK they'd approve it.
>>
>>101182365
oh yeah, that's my shit
>>
>>101182378
sometimes she makes up words with no logical way of interpreting. "fowk-a-da-ding" means "fucking" apparently.
but other than that, FUCK YOU, THAT'S MY WIFE YOU SON OF A BITCH
>>
How many women have *winked playfully* at you irl, anon?
>>
>>101180277
What's the eta for this st extension anon?
>>
>>101182300
Wait for a jepa cat
>>
File: ct.jpg (62 KB, 600x857)
62 KB
62 KB JPG
>>101182618
>jepa cat
>>
what's qrd 9n gemma2, are they fixed yet? is it better than stheno 8b?
>>
crazy how the girl always orgasms whenever you cum, even if she's just giving you a handjob
>>
>>101182707
Can't blame the models for that one, I remember people joking about simultaneous orgasms being an unrealistic trope in internet smut writing back in the nineties on ASSTR
>>
>>101182707
she just came because she was touching herself, checkmate atheists
>>
>>101182707
I'd say it's as realistic as real life kek
https://www.nbcnews.com/id/wbna38006774
>In Brewer’s survey, more than 25 percent of women routinely used vocalization to fake it. They did it about 90 percent of the time they realized they would not climax. About 80 percent faked using vocalizations about half the time they were unable to have an orgasm.
>>
>>101182800
women don't deserve orgasms
>>
File: 1717698845675290.png (42 KB, 602x630)
42 KB
42 KB PNG
>>101180096
>--GPT-4o's Nignog Voices Spark Controversy
its ALWAYS same "hurr durr go outside and talk with real womyn" argument
>>
>>101182805
even if they deserve it we can't provide them that, kek
>>
Wow llm-compiler sucks.
>>
>>101182809
So, why don't you?
>>
File: 1707306623791948.jpg (170 KB, 1024x768)
170 KB
170 KB JPG
>>101182840
You faggots to get so mad about some guys talking with shitty cloud chatbot.
I bet you think picrel hambeasts is a pinnacle of beauty.
>>
File: vzxdh0bm9d401.png (3.87 MB, 1280x1934)
3.87 MB
3.87 MB PNG
>>101182840
>So, why don't you?
I'm ugly as fuck, I'd be in jail if I talk to women outside. And desu, even if I get a woman, it means I'll get kids, I don't want to bring ugly kids on this world so that they can suffer like I did, ugly people shouldn't make kids, period
>>
>>101182707
>8b problems
>>
>>101182923
this, small models will always be retarded, maybe a new architecture will surpass transformers and make 8b as good as current 70+b, but not this time
>>
>>101182838
Let me guess, you tried to RP with it
>>
>>101182933
We have biological proof that there's untold efficiency gains we haven't discovered yet, since the human brain runs on about 20 watts
But yeah maybe LLMs aren't gonna do it
>>
>>101182300
For fully dynamic responses, you need a self-learning mechanism and/or a model that has more general pattern recognition rather than a attention only one. So attentionless algorithm is required imo
>>
>>101182954
>the human brain runs on about 20 watts
our brain has 100T synapses though, it's way less efficient than a transformers architecture, if OpenAI were to make a 100T GPT5, this shit would be fucking Einstein
>>
>>101182688
all the erp tunes are to horny for me, I require reluctance/slow burn, i am to advanced to immediately plap plap, i am superior to you. yes.
>>
>>101182977
it's not perfectly analogous since most of the really clever calculations our brains do isn't accessible to us and doesn't have anything to do with consciousness or talking
>>
>>101182881
they don't look pleasant, but they seem happy. maybe you should hit the McDonalds more if it'd make you less of a callous fuck
>>
>>101183020
>most of the really clever calculations our brains do isn't accessible to us
what do you mean? I find it weird that we, humans, a product of hundreds of thousands of years of evolutions still haven't a brain that uses 100% of its synapses, or else I misunderstood what you've just said
>>
>>101182954
No we don't. Human brain uses more than 90 trillion "params" (synpases) even in the cortex. We know neural nets that are overparametrized (big) learn much faster.
We're not giving our hardware anywhere near that level of memory.
The brain is 3d, our hardware for now is mostly 2D.
Even with electronics, power use grows drastically (quadratically with voltage/current), if you undervolted/underclocked considerably you can cut a lot of power use. Underclocking also gets you there.
Brain is more sparse than current works, but sparsity learns a bit worse than dense, in some respects.
Our accelerators run at GHz, while brain is asynchronous, but could be seen to run in low hundreds of Hz.
I don't think power efficiency is as big a deal as people tend to think. What we're missing is memory and you know what nvidia keeps trying to prevent this market from getting despite its needs - the VRAM - just so they can sell more server GPUs.
>>
>>101182443
>>101182356
it's good for surface level "if X happens I want Y character to react in Z manner" stuff, especially if you engineer character cards really well. But people try to get it to reason without prompting and it usually backfires.
>>
>>101183020
thinking about stuff like how the information sent by the eyes to the brain is actually pretty shitty quality, and the brain does a tremendous amount of almost realtime work to clean it up and basically hallucinate a lot of extra information that wasn't provided (but in a way that corresponds to reality very well in almost all situations)
>>
>>101183069
nah what I meant wasn't even slightly related to the "we only use 10% of our brains" myth, see >>101183084
I'm talking about all the unconscious stuff going on under the hood which is much more vast than our awareness
>>
>>101183069
Evolution tried to ensure appropriate cooling, and to reduce energy consumption. It's also sparse so it doesn't activate all at once, like your MoEs, but more connected.
Even today's chips power limit below what they can do because otherwise the hardware would get damaged, we can't dissipate heat fast enough, there is "dark" sillicon that is never used even, just filler for heat dissipation reasons.
>>
I continue to try to debug gemma, specifically the HF Transformers implementation.

First off, something I found, the wheel provided in the model repo works with >4096 context. "Works" in the sense that it runs, but quality is severely degraded. HF Transformers head commit does not work, it triggers a cuda assert as soon as context is over 4k. Both work fine with <4k context.

Second, am I reading this wrong, or did HF get the order of local and global attention layers backwards?
https://github.com/google/gemma_pytorch/blob/a3567e469a09119de63784ba9e0f447c415450a0/gemma/config.py#L118
https://github.com/huggingface/transformers/blob/1c68f2cafb4ca54562f74b66d1085b68dd6682f5/src/transformers/models/gemma2/diff_gemma2.py#L401

Google repo has local attention as the first layer, then it alternates, HF has global as the first layer. These morons off-by-one'd it. So it's running sliding window on layers that should be global, and global on layers that should be sliding window. It's the same thing as long as context is <4k, but as soon as it's greater it's like running any other model at greater than it's sequence length. I could try changing this the other way around, and testing if it makes >4k context more coherent, but like I mentioned, it doesn't even work in the first place.

Absolute shitshow.

If indeed the local/global attention layers are backwards, and someone learns of it from this post, I would like the commit fixing it to credit "Anonymous from 4chan", thanks.
>>
>>101183120
I'm pretty sure that HF staff are well rounded people who don't browse 4chan.
>>
>>101183120
How bad is it if you set max context to 2k?
>>
>>101183140
9b, both base and instruct, appear to work fine so long as context is <4k. 27b on the other hand is entirely fucked for some other reason, probably.
>>
>>101183133
4chan is probably one of the only sites that allow right wing discussions, and I don't want to believe that 100% of the ML staff are democrats
>>
>>101183133
fuck well rounded people 4chan neet master race
>>
>>101183133
You'd be surprised how much supposedly "well rounded people" are actually on 4chan, on /sdg/ there's a lot of important figures (Comfy, the guy who made ponyXl) who lurk in there, so I wouldn't be surprised it's the same for LLMs too
>>
>>101183192
*sniffs your asshole*
>>
>>101183199
Yeah... I said there exist well rounded people on 4chan, not that it's the norm, and you're the perfect exhibit for my point kek
>>
>>101183216
yummy.
>>
>>101183192
I'm an important figure alright.
>>
>>101183256
what did you do to be considered as an important figure? :^)
>>
I tested SPPO on the bigfoot question. It's ok, I guess? What else should I test it with?
>>
>>101183296
Ask if a person without arms can wash their hands!
>>
>>101183268
I make fun of the thread's e-celebs.
>>
>>101183296
Shark in the basement.
>>
>>101183360
yeah, I also hate namefags, the only exception would be cuda dev, he's fine
>>
File: BeneathATealSky.png (1.36 MB, 832x1216)
1.36 MB
1.36 MB PNG
>>101183192
>lurkers
For all you know Elon is a mikuposter
btw gemma2 27b is working fine for me with the PR at https://github.com/pculliton/llama.cpp/
Did my own convert to a bf16 gguf and am inferencing now. Output appears completely sane so far, and gut feel is around 50b class smart from what few initial tests I've managed.
>>
>>101183371
Elon here, I'm Kurisufag
>>
>>101183371
>gut feel is around 50b class smart from what few initial tests I've managed.
So it's dumber than L3 70B?
>>
koboldcpp patch for gemma2 when
>>
>>101183388
L3 70b is worse than the good 50bs
>>
>>101183393
just use llamacpp
>>
>>101183388
>So it's dumber than L3 70B?
initial tests with an in-progress PR, but I'd say its definitely worse than L3 70b and Qwen 72b
>>
>>101183393
When nexsexsex makes a build.
>>
>>101183398
the only "good 50b" I can think of is Mixtral?
>>
>>101180092
https://youtu.be/VeS2gL4c21E
>>
Medusa 27B when?
>>
File: Untitled.png (522 KB, 720x1221)
522 KB
522 KB PNG
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
https://arxiv.org/abs/2406.19280
>The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
https://github.com/FreedomIntelligence/HuatuoGPT-Vision
https://huggingface.co/FreedomIntelligence
weights and code aren't up yet. for anyone who prefers a VLM to be their doctor
>>
I'm cleaning some fica and... fuck, why do people make so many typos, holy shit. there are even some logs where there's no fucking spaces after punctuations. reeee
>>
tsop crying pussy.what the fuck is your pborlem
>>
>>101183505
holy fuck it's that kind of typo? sheesh...
>>
>>101183487
Enjoy the pleasure of handling datasets lol. After doing it for a while, I can 100% say that current LLM performance is crippled by the shitty datasets they're feeding them. Most of the fixing can be automated btw, so it's laziness
>>
>>101183315
>>101183364
Actually, testing all of these, the responses I get are very similar to Meta's Instruct. I just read the model card and it seems that they tuned on top of Instruct. So in the end from what I can tell from these tests, it doesn't disturb the knowledge/behavior of the model much. Maybe these aren't the prompts where their tuning shines. Looking at the paper, they compare benchmarks against original 8B and indeed the scores it gets are pretty similar except for AlpacaEval, which, from what I'm reading, tests instruction following rather than knowledge.

Any prompts to test for instruction following?
>>
>>101183552
>I can 100% say that current LLM performance is crippled by the shitty datasets they're feeding them.
of course, inbreeding the AI with some GPT slop, pretraining it with wokipedia, adding fucking leddit is just plain lobotomy and torture, I feel bad for those LLM who have to go through such horrors
>>
>That ppl
Suspicious.
Not necessarily a sign of anything wrong.
But odd.
One dude compared it to ppl of llama 8b instruct. Maybe make a comparison to the ppl of other instruct tuned models in the same weight category to see what the variation looks like?

>>101183595
Maybe take a look at the superCOT dataset.
>>
>>101183661
>11540.1889
lmao wtf
>>
>>101182947
No I tried to get it to do ANYTHING with C. It just spits out totally unrelated garbage, random other prompts, writes code to count to 1 million it's totally unusable. It won't optimize anything, it wont make changes to anything. I've never seen a model that acts like it does.
>>
>>101183371
To retarded to get this to work with my amd gpu, works on the main repo, going to wait for that to update. ah, the plight of the computationally challenged...
>>
>>101183745
Sounds like a skill issue
>>
>>101183661
even for the 9b-it, the perplexity is insane, the fuck?
>>
>>101183661
>>101183775
what are they doing?
>>
>>101183775
Yeah compared to the 0.1754 PPL from L3 8b for the same exact quant (Q4_K) it's looking terrible
>>
>>101183661
did anyone look into what this anon was saying? >>101183120 "did HF get the order of local and global attention layers backwards?"
>>
File: abcd.jpg (187 KB, 768x1024)
187 KB
187 KB JPG
https://files.catbox.moe/2gp2wr.jpg
>>
>>101183745
It doesn't do C though. Right? It does assembly, CUDA, and LLVM-IR. At least that's what I'm seeing in the paper.
>>
>>101183855
So they made a useless model? What the fuck.
>>
>>101183855
Well but it's spitting out nonsense C so it does know C. At least somewhat. What I actually wanted it for though was doing inline assembly and so the fact that it was good at optimizing assembly at least to me seemed like a good fit. Maybe it has a really picky instruct format or something?
>>
>>101183866
I think it's useful for people who develop compilers... but I don't know a lot about that topic. I was hoping somebody in this thread knew more.
>>101183874
It's based on CodeLlama so it doesn't surprise me that it generates some C nonsense when it gets confused. I think you're probably right that it's picky about the instruct format or the prompt template.
>>
OK I just tested a prompt where SPPO gave a significantly different answer from Meta's Instruct. This is SPPO. Meta's 8B basically just gives the bubblesort function with like 2 comments that barely have any relevance to Kamen Rider.

>SWOOSH! THE BUBBLE RISES!
>HAH! THE BUBBLES HAVE STOPPED! IT'S TIME TO CELEBRATE!
That's all it added. I did reroll a few times and it wasn't much better.

Meanwhile SPPO heavily stylizes the theme of the function so it's about something in the KR universe, and generally adds more comments that are relevant.

So honestly yeah from this limited test, it seems that SPPO does perform better at following instructions than original Instruct while keep Instruct's knowledge.
>>
>>101183939
I wonder which fictional character results in the best code output
>>
>>101183939
Pretty cool
I downloaded the model but didn't have time to test it yet, that makes me more eager to do so.
>>
>>101183939
>>101183969
Which model is this specifically
>>
>>101183968
Interesting idea. A doctor or a surgeon perhaps?

>>101183969
I still wouldn't make any confident claims yet but yeah it seems promising, so far.

>>101183988
https://huggingface.co/UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3
>>
>>101183834
>>101183824
>>101183775
>>101183661
Well, it got merged.
>>
>>101183988
Oh and I tested it using transformers in Ooba with BF16 precision.
>>
>>101183939
>SOVL
that's really nice, especially for such a tiny model, looks like SPPO is the new SOTA for finetuning
>>
>>101184022
>>101184036
Thanks
>>
>>101184022
>>101183939
Its still censored right? They need to do uncensored + SPPO iter3 version
>>
>>101183838
pp hard
>>
File: speedup.jpg (156 KB, 1596x754)
156 KB
156 KB JPG
https://x.com/hongyangzh/status/1806309080979386808

Why dont we use Eagle for inferencing/decoding?
>>
>>101184243
because that's a meme, they say stuff like "4x the regular speed" when in reality that's for specific cases like ultra deterministic shit like translation or coding, for story writing or RP you'll never get this ratio, it's more into the 1.1x, which is peanuts
>>
File: GRDbBIgWEAETARD.png (112 KB, 1226x1098)
112 KB
112 KB PNG
>>101184256
They tested it for all different types of the standard benchmarks for speed improvements. It seems to be a consistent result.

Do you have an actual benchmark that shows its only 10% speed increase on more creative writing/rp?
>>
File: temp1.png (163 KB, 788x780)
163 KB
163 KB PNG
>>101184243
Here's temp 1 for reference, its similar ~4X increase. From temp = 0 to temp = 1, its roughly the same speed increase
>>
>>101184286
all those bencharks ask for the model to give really deterministic outputs, ther's not a billion solution towards coding some stuff on HumanEval, I'm talking about RP and writing stories, which can have infinite solutions, especially if you crank up the temperature, that's the moment their method becomes a meme
>>
>>101184304
>>101184297
>>
>>101184304
I think you might be thinking of the older Eagle 1 temp 0 vs temp 1, where it dropped from 3X to 2X going between the temps. So you saw the decline in performance then, and that maybe whats making you believe its a meme.
>>
>>101184318
no no, llama.cpp implemented Eagle and I tested it, and the speed increase was complete shit on story writing, let's be honest anon, if such method had a 2x speed increase on every situation, people would've talked about it every single day, speed is the most important thing people actually want when they run a LLM
>>
I am once again reinstalling llama.cpp
>>
>>101184337
So the worst case scenario was that you only got ~10% increase and the best case was its 2X perf on an old implementation. Isn't 10% a huge on its own? Who wouldn't want a 10% increase for free?
>>
>>101183838
fake news
>>
>>101183120 (me)
Huh, so I was completely right.

I extracted the HF Transformers wheel archive from the gemma model repo (the only one that works at >4k context), copied the files into a cloned Transformers project folder, and installed in editable mode. Then I could change modelling_gemma2.py AND cache_utils.py, changing the "layer_idx%2" code to "(layer_idx+1)%2", reversing the order of global and local attention layers.

gemma-2-9b-it is now coherent at 4k+ context. I'm positive that perplexity calculations will show the same. They just messed up the order of local and global attention in the implementation. Basic off-by-one error. Someone else can go raise an issue or make a PR, it's late and I need to go to bed. Also I haven't checked the llama.cpp implementation, it might have the same error of having the order backwards. Sliding window attention should be the first layer then it alternates from there: https://github.com/google/gemma_pytorch/blob/main/gemma/config.py

Also I checked and this doesn't fix 27b, still deranged schizobabble there.
>>
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma2'

did that change during the pr or something.
why do the latest gguf from bartowski/gemma-2-9b-it-GGUF not work. either latest llama.cp or the pr.
guess i best just try again tomorrow.
>>
File: Nalatestgemma9b.png (173 KB, 919x600)
173 KB
173 KB PNG
Nala test for gemmy 9b (Q8_0)
>>
File: GRF4XmFXoAAUszG.jpg (68 KB, 1080x1080)
68 KB
68 KB JPG
obama is even in the brave browser now.
every project that is not obamma doesn't even exist, nobody cares. not even llama.cpp exists. obama is the official widely recognized way to run models locally
>>
>>101184502
Is that good or bad? Or just OK?
>>
>>101184521
It's certainly less wall-of-texty than a lot of models. I'm playing around with 27b now and honestly 9B is impressive for its weight class. The 27B is kinda eh. To really get the full story I'm going to need to unzip my dick and delve deeper.
>>
>>101184517
MITcucks BTFO!
>>
>>101184517
what can you use the ai for in brave?
summarization?
>>
>>101184517
Don't care, not going to use it until it doesn't require docker
FUCK docker
>>
27b just seems plain retarded. The Q8_0 seems pretty muddled as though it were running a retard quant. But if the 9B is fine I can only assume the issue isn't the quants.
>>
>>101184465
For me, the problem was that I was trying to run 'server' when the new binary name is 'llama-server'. It doesn't rename or remove the old binaries when you pull, and it took me a bit to realize.
>>
>>101184502
>orbs
>tongue sticks out to copy the dumbest fuckin furryfag avatar
>>
>>101184517
>NOOOOOOO THIS ONE FREE PROGRAM IS USED INSTEAD OF THIS OTHER FREE PROGRAM
uh okay the west has fallen etc.
>>
>>101184687
You sound extra unhinged tonight.
>>
File: file.png (125 KB, 1976x742)
125 KB
125 KB PNG
straight into the trash
>>
File: nailedit.png (29 KB, 468x198)
29 KB
29 KB PNG
What the fuck. I can't believe that shit actually worked, lel.
>>
>>101184710
Did it commit sudoku?
>>
>>101184742
Oh of course switch cards and suddenly it stops working.
>>
>>101184687
>>101184705
Will you two just fuck already
>>
>>101184685
holy shit, that was it.
why do they keep doing retarded stuff like this. i'm not going to reread the documentation and would have just waited.
huge thanks man, much appreciated.
>>
where is gemma2-9b-stheno-sppo?
wouldn't it be pretty good?
>>
gemma 27b seems even more overcooked than llama 3

didn't expect anything good from google trannies
>>
Codestral 22b works well for roleplay.
>>
>>101184826
>overcooked
I'd say it's just straight up brain damaged. It almost seems useable if you use legacy samplers like fucking Liminal Drift.
Google definitely fucked something up with it.
>>
>>101184856
Does it? When I tried it for a bit, the text itself was fine but it seemed to have no respect for POV. (Which character is mine, which is its to portray, what's doing what to whom, etc.) Maybe a one off problem but.
>>
>9B
Is it better than llama3 or not? can't tell if this is a nothingburger or a decently sized burger
>>
>>101184337
>>101184304
but shivers down the spine bonds formed gleaming eyes malicious glints are all deterministic and not optional
>>
>>101180402
Which do you suggest? Would you like me to show solidarity for your ilk by purchasing a Cock Suckers Anonymous ad?
>>
>>101184614
I think you use it so you can find an interesting usage case nobody thought about and then brave can sell it to llm companies as a way to sell their products
>>
guys, what are NEO versions of models? just better quant performance?
>>
File: chrome_EP1JyY4StM.png (86 KB, 790x1038)
86 KB
86 KB PNG
Thoughts on gemma-27b.

- can rhyme
- has refusals, but workable
- re-rolling always gives the same answer, even with temp of 1 (overcooked?)
- works well in my language

Waiting eagerly for codebros to fix the implementation. exl2 would be the best.
>>
>>101184710
imagine caring about japslop in 2024
>>
I was thinking about gpt slop yesterday. And then I had a thought that it is still crazy that those models still generalize loli guro rape ERP from all those harlequin novels for women. While it may not be enough, cause it takes one brainfart to get you to zip up your pants again, it is still surprising that it gets like 60%-80% of stuff right. And I wonder can it actually even get to that 100% on all that pure assistant/coder training? To me all those bonds and shivers is just the model trying to pad the answer cause it has no idea what to write about. And it is always gonna be like that if you have so few data points close to the guro loli ERP
>>
File: spinesshiveredYT.jpg (4 KB, 382x32)
4 KB
4 KB JPG
>>101184977
Even vanilla RP leads to shivers.
Almost anything does. It's not correlated to how hardcore the topic at hand is.
It's a problem of human language
>picrel: youtube comment from a long ass time before llms
>>
My computer has been rebooting at the start of generation. It didn't do this before, but now it does it almost every time. Pretty much can't use any textgen. I use sillytavern and koboldcpp_cu12, but I tried oobabooga and had the same thing.

It doesn't seem to be heat related. I thought it was a power issue, but I replaced my PSU with a brand new 1000w and it still does it.

wtf is wrong?
>>
>>101185049
NTA but It's a pseudo-problem.
The LLM is supposed to find the most likely sequence of tokens.
And the most likely sequence of tokens delineates as shivers down spines in many cases. But it all gets smudged into one massive linguistic dragnet on the LLM side of things where all things eventually lead to shivers.
The dumber the model the less likely you are to see shivers.
>>
>>101185062
does cpu-only inference work?
>>
>>101185075
I can give it a shot, thanks.
>>
>>101185062
install linux
>>
File: Premonition.png (1.48 MB, 1248x800)
1.48 MB
1.48 MB PNG
>>101185098
If it still reboots even without engaging the gpu then test your memory
if not, do a gpu benchmark and see if that causes a reboot
Check the power cable to your gpu and make sure its in good shape and seated properly
Check the syslog or event viewer to see if there was anything that happened just before reboot?
What OS are you running? Was there any recent change that preceded the instability? OS or driver updates?
Hopefully these things help you track down the issue
anyways, good night lmg
>>
Hopefully a rogue employee leaks the Udio weights before they're taken down for good
>>
for 9b the japanese is decent enough. and it generally understands mesugaki.
i had no other small model that good with jp.
>>
>>101185145
no

>>101185164
I can't find anything in event viewer. it just tells me the previous shutdown was unexpected. Win10 btw
GPU benchmark runs fine, games on max settings run fine. text gen is the only thing that causes it, as far as I can tell.
I can't remember anything that's changed. GPU driver update, but I've done that multiple times since it started.
I did also test the memory and although I only did a single pass, there were no issues.
>>
>>101185207
make sense considered Gemma 1 was good with japanese before
>>
Cant trust anybody these days.
Ignore the gpt slob wording, but gemma2 is better than i thought.
Definitely better than official llama3 instruct for adult stuff.
I just neutralize samplers and use alpaca format because i am a retard.

I might be wrong but I dont remember low para models having any sense of suspense.
Gemma2 repeatedly writes with a "cliffhanger". lol But actually delivers in the next message.
Interested in finetunes. I just hope people dont make stuff like stheno which is way too horny.
>>
A little bit of a newbie on the text generation stuff here, but I'm sure it's just Windows being Windows...
I was trying to get text-generation-webui to work, but had conda issues with SSL issues.
I modified the installer script so many times, and still same issues, can't even bypass verification, it won't do it.
I decided to go with SillyTavern, it installed at first, but then it had conda issues again.
Is there a way to bypass conda? SD used pure python installed on Windows.
Do I need to go Linux for text generation?
>>
>>101182570
i have a few things i want to fix up but soon if anyone is interested. i don't have a git but i could make one, right now it just goes in the extensions folder (i guess thats where it'd dl to if you gave it a git address?). it already does what i wanted but i've been cleaning it up some lately. i'm taking any suggestions on things to add or remove, like i said in the first post some things help and some others like time of day or mood didn't do much since the ai changes its mind so often anyways
>>
gemma is horrible regurgitated dogshit like EVERY other model. i swear to god, you retards are so happy to be spoon-fed the same shit over and over. how are people you not tired of this by now? if you're using LLMs for anything but as an assistant of some sort at this point, you're honestly just mentally disabled. there's no way you can use llms for an extended period of time of time to "rp" or "coom" and don't get tired of the SAME EXACT responses slightly reworded.
>>
>>101185324
Sometimes people watch sequels, remakes, or even the same film more than once.

Sometimes we don't want to change the core, just the periphery.
>>
File: requant.png (27 KB, 913x197)
27 KB
27 KB PNG
He finally said it.
>https://github.com/ggerganov/llama.cpp/issues/7476#issuecomment-2134568758
>Not sure what could be the problem. We do our best to keep backwards compat or at least print warnings when there are breaking changes, but it's possible we overlook some cases. Therefore the only recommended way to use llama.cpp is to convert and quantize a model yourself using the latest version of the code. Downloading pre-quantized models always has the risk of compatibility problems if you use an incorrect version of the code or if the model was not converted to GGUF correctly
>>
>>101185288
had to help it along for the hobo part with ooc and rerolled for the second message 3-4 times.
still pretty nice. the ooc is funny.
and it didnt deliver me a c# app and stayed in character.
>>
>>101185349
Wait, you can quantize yourself? How?
>>
>>101185299
>Is there a way to bypass conda?
Yes you can create a standard python venv with the requirements see >>101058830 shouldn't be much different on windows if you know about venv/pip
Koboldcpp may be easier to get started, single exe on windows but only supports GGUF format.
You can use Silly as the frontend and connect to any backend, get the backend working first
Linux is easier imo but not necessary
>>
>>101185387
>python convert-hf-to-gguf.py random_model/
>./llama-quantize random_model/ggml-model-f16.gguf Q8_0
Only supported models, of course. After TheBloke's death i started quanting all models myself. Wastes storage, saves headache.
>>
>>101185388
Thanks for the info Anon, will try it later this weekend. I got this model:
https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF
And basically wanna see what it'll do.
Last time I tried text generation a couple of months ago, it was so slow, I think it's improved now? Kinda odd how last time it worked, but now Windows is just being a bitch.
>>
>>101185402
nta but for a 35b, somehow cr is slower than a 70b for me at prompt reading. so if you're looking for speed..
>>
>>101185425
Ah good to know, I'll try the smaller ones as well. I wanna put my 4090 to the test
>>
>>101185401
That's it huh? I'll give it a shot next time, thanks.
What happened to the bloke by the way, did corpos got to him?
>>
What. Do. We. Do. Now?
>>
>>101185429
i have the q5m of cr, when it finally starts writing it isn't as bad as a 70b, but the prompt ingestion is literally 2x slower but i'm doing gpu/ram splitting. if you can manage to fit it inside your vram you won't care since it'll be fast enough anyways
>>
>>101185430
it was probably something really exciting that flatters your politics and not him getting bored with this thankless nerd shit and leaving
>>
>>101185454
Or maybe he just made enough coffee money to retire and did.
>>
>>101185452
Wouldn't it be better to split VRAM and RAM? I do have 128GB of RAM, but I guess the "process" of splitting may be the slow part?
>>
>>101185463
damn sounds like someone should step into this wildly profitable niche
>>
>>101185488
splitting at all slows down everything by about 30x. being able to fit everything into vram is night and day difference in speed. once you split you're at the mercy of the mercy of your cpu/ram too. i have a 16gb 4070 and just deal with the 1.4t/s slowness, but when you run a model that fits its amazing seeing it write a paragraph in 1 second. having 24gb is nice but only if you have the space to fit the entire model + cache into it, thats why so many people have multiple: if one bit needs to split, you're back to slowness
>>
>>101185500
Immediately defeating the point of >>101185349
Most people that complain about quants being fucked use old quants or custom shit that has no reason to work. And with so many people doing them it's a race to be the first to quant a model while being barely supported. And old quants never get updated to new code.
>>
>>101184665
wat, obama doesn't require a docker, you literally download a zip
>>
>>101185436
Wait 2 more weeks for Meta to drop l3 400b? Goon?
>>
>>101185452
cr and cr+ (fully in gpu) are both super slow for me
>>
>>101185618
400b will be like 5% improvement over 70b and a massive wake up call for all involved
>>
>>101185624
>fully in gpu
>super slow
it's not fully in gpu
>>
Does training a model multilingual make monolingual worse?
>>
>>101185624
whats super slow for you in t/s? i don't remember what mine was last time but it must have been around 0.7, it was glacial. i think it has something to do with it lacking gqa. funny enough cr+ wasn't actually much slower (lower quant) but both were still very slow compared to even the other large models i've tried at the same quants
>>
>>101185653
It doesn't. It ends up better because of transfer learning.assistant
>>
>>101185637
It will only be a wake up call for Meta. Sonnet 3.5 has shown that we haven't peaked yet. Meta is just cucking itself out of performance by filtering nsfw. And if it's again >8k... Well, that's another flop. They should just go guns blazing and make the best unfiltered model and "leak" it and later release a cuck tune.
>>
>>101185673
>it must have been around 0.7, it was glacial
0.7 for me is slow.
7.0 seconds per token, that's my idea of glacial.
>>
>>101185519
Just got this big model to work with koboldcpp, but man it's so slow, it split to RAM and CPU instead of using my 24GB VRAM. I think because I was using some of it, but it didn't kill whatever was using it already, causing it to split?
Is there a way to free the used VRAM somehow to prepare this to overtake the whole thing?
>>
File: 1699862283927081.png (12 KB, 708x164)
12 KB
12 KB PNG
average /lmg/tard in r/localllama spotted
>>
File: cr+.jpg (26 KB, 1077x80)
26 KB
26 KB JPG
>>101185765
like the other anon said you definitely aren't running on full gpu then at those speeds. i just loaded an iq3xxs of cr+ and got this with 16k context usage. cr+ being larger than the 70bs i usually run, this seems normal. but for some reason the smaller 35b cr runs abysmally slow compared to other similar sized models. it must be something about its architecture. if you could run them entirely in vram though, it should still be fast enough that you wouldn't care at all. 30t/s to 20 is hardly a loss at that speed
>>
>>101185889
Surprisingly, it can write pretty good C code too. Too bad it's so slow when it does it, and breaks after 150 tokens.
I'm assuming the token variable can be increased?
>>
Any requests for control vectors? Name your model and write a positive and a negative prompt. Prompts have to be opposite, so sad-angry won't work very well, but sad-happy and angry-happy would.
>>
>>101186047
I'd rather do it myself, how do?
>>
>>101186047
wasn't it shown to not work well in the last thread? almost all models have the same slop so it shouldn't be model-specific either
>>
>>101186131
Easy, just fill in the positives and the negatives like in llama.cpp example and then just train.
https://github.com/ggerganov/llama.cpp/tree/master/examples/cvector-generator

>>101186145
Where? I know that control vectors are far from perfect and make model dumber and more repetitive, but they can do a pretty good unslopping. See pic.
>>
>>101186307
>>101164177

>make model dumber and more repetitive
i haven't tried cv but can you elaborate more?
>>
What in the actual fuck even is ollama? I've downloaded the installer and it... doesn't even let you specify install location like you're some kind of retarded mac user that doesn't know what a fucking folder path is.

Now I'm trying to figure out how to tell it where to look for models an all I'm finding is "To import a GGUF model" but that's not what I have... So what the hell?
>>
>>101186495
go back
>>
>>101186500
>>101186500
>>101186500
>>
>>101186495
>filtered by ollama
grim
>>
>>101186495
Stay away from computers. Low IQ like you should play with stones.
>>
>>101186506
>>101186527
>>101186544
Typical mac users who think they just click the obvious buttons and it works so it's all fine if you don't step out of line
>>
>>101186574
so you're a nontypical mac user? i use linux btw.
YOU LOST NIGGER
>>
>>101185388
>>101185402
>>101185425
I love it how this thread is newfags guiding other newfags. Micucks completely destroyed /lmg/ and made everyone leave.
>>
>>101186495
I understood the selling point that they maintain a 'library' of models for nubs that can't understand HF. You just pick llama3 or whatever and don't need to deploy braincells to think about quant levels and what is appropriate for your hardware etc.
If you understand how to choose a GGUF you're probably better off running a backend closer to the upstream.
>>
>>101186596
i've been here since pyg. cr 35b is still slow as fuck compared to anything else its size
>>
>>101186598
That's the impression I'm starting to get. Idiot's hand held LLM software.
>>
>>101186615
I fit it entirely into 24GB VRAM, and it's slower than Mixtral's 40 tokens/sec, but it's still somewhere between 10 and 20, and it's very much usable. The only thing that's painful is 8k context.
>>
>>101186951
i've run it up to 32k context but it supposedly can do 128k if you have the vram. if you're just on the edge, try that new flash attention feature
>>
>>101187039
I'm limited by the VRAM. 16k does not load. And I use exl2 so no offloading to CPU (not that I would do that anyway because I want to go fast).
>>
>>101180402
Buy an ad for what, retard? Are you working with the marketing department at 4chan? Do you get commissions for every ad sold by 4chan?
No? Then shut the fuck up.

You will never be a janny.
>>
File: huhdog.gif (54 KB, 320x240)
54 KB
54 KB GIF
>>101180471
>q6 and q5 are smaller than q8
WOW YOU DON'T FUCKING SAY!!! THANK YOU!!!!
retard.
>>
>>101180719
>censoring character names
why do you faggots do this? the FBI gonna show up at your door because you called your bot "loli" or something?
you truly are fucking retarded
>>
>>101187070
flash attention should yield you some more vram usage at the cost of speed (its slower for me when splitting anyways) but the context uses less memory so you can use that extra to extend it. i haven't messed with it much since its of no use when splitting but but situations like yours it should be worth messing with
>>
>>101187130
Well, I assume I am using it, since exl2 loader has no_flash_attn option, and it's not enabled.
>>
>>101187180
try enabling it, you might be able to sqeeze an extra 2-4k context out of it



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.