[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101318970 & >>101312606

►News
>(07/07) Support for glm3 and glm4 merged into llama.cpp: https://github.com/ggerganov/llama.cpp/pull/8031
>(07/02) Japanese LLaMA-based model pre-trained on 2T tokens: https://hf.co/cyberagent/calm3-22b-chat
>(06/28) Inference support for Gemma 2 merged: https://github.com/ggerganov/llama.cpp/pull/8156
>(06/27) Meta announces LLM Compiler, based on Code Llama, for code optimization and disassembly: https://go.fb.me/tdd3dw
>(06/27) Gemma 2 released: https://hf.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>101318970

--Paper: An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes: >>101325312 >>101325391 >>101325425
--Papers: >>101322212 >>101326426
--LLaMA 3 7B Token Rate Estimation on RTX 4080 SUPER: >>101320839 >>101320873 >>101321005
--Underwhelming Results Fine-Tuning Gemma 2 27B for JP>EN Translation: >>101324530 >>101324611
--To CPUMaxx or Not to CPUMaxx: That Is the Question: >>101323847 >>101324049
--Running a Local LLM in a Docker Container: Requirements for GPU Access: >>101319816 >>101319889 >>101319896 >>101319940 >>101319960 >>101320120
--Removing Verbal Tics from Generated Text: A Challenge: >>101320342 >>101320348
--Gemma 27B sliding window attention issue still not fixed in llama.cpp: >>101325180 >>101325261 >>101325404 >>101325892 >>101325931 >>101325962 >>101326019
--Are Roleplay Prompts Essentially Jailbreaks?: >>101322261 >>101322282 >>101322935 >>101322966 >>101323002 >>101323104
--Llama : fix pre-tokenization of non-special added tokens: >>101319600 >>101320095
--SPPO + SPPO + ORPO Triple Combo for AI Model Training: >>101320006
--Command-R+ Excels at Storytelling and Dialogue in RP, Suggesting Parameter Count is Key: >>101326383 >>101326433 >>101326535
--Orthogonized (Uncensored) Gemma 27B by EdgerunnersLab: >>101319610 >>101321348 >>101321637 >>101323577 >>101321752 >>101321805 >>101321943 >>101321665
--Deepl's imtransbtw: Two-Pass Translation with Contextual Sense: >>101319936
--Multimodal AI: The Future Beyond Text-Based LLMs?: >>101320164 >>101320179 >>101320193 >>101320227
--Is 8k Context Window Enough? Depends on What You're Doing, ERP? RPG?: >>101324593 >>101324635 >>101324953 >>101325299 >>101325421 >>101325440
--Addressing the Issue of Character Defiance in Storytelling Systems: >>101320425 >>101320670
--Miku (free space): >>101326263 >>101326760 >>101326941 >>101327954

►Recent Highlight Posts from the Previous Thread: >>101318976
>>
>>101328076
>--Gemma 27B sliding window attention issue still not fixed in llama.cpp
Retarded bot.
>>
10 days
>>
>>101320179
Llava-next looks good but there’s no llama.cpp support right now.
>>
Softcapping support merged in FlashAttention
https://github.com/Dao-AILab/flash-attention/pull/1025

That should allow lower memory usage with Gemma 2, eventually.
>>
>>101328202
Fuck yeah.
>>
Recently made an extremely beefy computer build. Need to see if I can finally do local stuff good........
>>
>>101328220
>extremely beefy computer build
does it have multiple gpus with lots of vram
>>
>>101328237
>2x gt210 and 2x 8gb ram sticks
>>
>>101328237
4090 so 24gb vam, and 96gb of ram
also 7950x3d
>>
>>101328202
it will not make it any better or less censored, nothingburger.
>>
File: propsl.png (274 KB, 1080x1920)
274 KB
274 KB PNG
>pic related
Now of course there’s a fuck load of issues with this idea
>first
Step 4 implies that you could even train a lora fast enough from start to finish within the amount of time it takes you to reach the context limit while using sillytavern, let alone the amount of time it takes to augment the data even if you made some shitty python program that could magically do it for you, and if there even was one, you would likely need to run ANOTHER model to even augment the data automatically in the background while using silly tavern which considering the fact you’re running a local model on top of constantly running lora training in the background, you’re fucking raping your GPU assuming anyone even has enough VRAM for this shit
>second
I doubt you can even swap out Loras mid chat, idk about kobold since i’ve never tried, but text gen webui takes a good amount of time to load a model, I doubt it would be different for a lora and i don’t see why other programs would magically just not have a loading time, so that’s another roadblock in the seamlessness of this idea
>third
Even in a magical world where any of us had the VRAM for this monstrosity, the amount of little python or batch files needed to be made in order to automate this process is fucking annoying
>why bother augmenting the data
Wish i had the end result of this dude’s stuff but basically this guy tried a similar thing by training a single lora on the raw unaugmented chat logs and in the end it got hyper schizo, and augmentation is the number 1 way to reduce schizophrenia outcomes on a small dataset
https://desuarchive.org/g/thread/95930009/#95933766
Wish i had proof because the end result was pretty funny after it spammed about fallout new vegas (the guy said his favorite game was fallout 3)
Like i said, wish i had proof of how it turned out but i don’t have enough keywords to find it on desuarchive

Part 1/2
>>
>>101328371
>why not just use summarization?
Summarization has a limit on exact information, it quite literally has to be a summary, details have to be thrown out, and even then it will eat up context as time goes on
In a hypothetical scenario where pic related could even function, summarization would be a great way to buy time for lora training so that the chat logs aren’t only being influenced by recent messages and therefore also making the chat logs higher quality for training the next lora rotation.

yeah this plan is shit, but gun to your head, how would you make a plan for infinite long term memory of a local language model?

>small dataset unaugmented = overfitting
thought of this while writing this out, but couldn't you just prevent overfitting by doing less epochs or steps or something? or would it just simply not retain the information and small, unaugmented datasets are stuck between a rock and a hard place of either being undertrained and therefore pointless or overfit and schizophrenic?

Part 2/2
>>
>play with that one dog girl smell bully card
>woo her like a chad, solve her insecurities, and then go balls to the wall ridiculous by taking over her clique, and then the entire school, and making her the queen, then having an orgy and getting everyone pregnant
>then decide to reveal it was all a fever dream
>she's heart-wrenchingly distressed, then realizes that it was a message, to get over herself, change her ways, apologize, and then maybe she can have the love she experienced in her dream
Kino...
>>
>>101328384
Loras are weird. They make the model retarded before the loss drops and with a small data set like that you’d have a pretty narrow target to hit between that and overfitting.
>>
>>101328520
What I’ve been doing which seems to work ok and result in a bit of persistence between chats is have a summer made with the old summery included.
There tends to be a kind of spirit the bot develops early on that lasts a long time. I’m still tweaking everything but I’ve been really enjoying this.
>>
What's the best uncensored local model? I can't seem to find a straight forward list of this shit.
>>
>>101328595
>the best uncensored local model?
there are none, this is the only thing you need to know about this huge meme.
>>
>>101328595
>https://huggingface.co/failspy/Meta-Llama-3-8B-Instruct-abliterated-v3-GGUF
supposedly this but i haven't tested personally
>>
>>101328534
expand on this because I'm a fucking retard and barely understood what this meant
>>
>>101328595
Euryale in my experience. Gemma is the current FOTM, but I haven't tried it. They're still working out the kinks.
>>
File: e4221a.jpg (12 KB, 200x200)
12 KB
12 KB JPG
>>101328462
>no mention of her cute scottish accent
ngmi
>>
i'm kinda getting tired of gemma 27b... it's good for sub 30b, but somehow i'm feeling like i'm constantly being fooled. I'm gonna go back to 70bs for a while
>>
>>101328767
True actually. WizardLM2 doesn't do it well. And I can load up CR+ but it's like 0.5 t/s. Idk if CR+ is able to do a Scottish accent justice though.
>>
>>101328637
>Euryale in my experience.
Euryale is retarded. Do you have brain damage?
>>
>>101328384
>>yeah this plan is shit, but gun to your head, how would you make a plan for infinite long term memory of a local language model?
It's not a particularly novel idea but imo simply expanding the lorebook function in sillytavern could be good enough. Segment chatlogs into events, tag them and then save them as lorebook entry. If you could automate the process you'd get pretty close to a working memory. Though truth be told, I only did some limited testing, no idea how giant lore books would affect the performance.
>>
>>101328864
That's what RAG/vector storage/etc tries to accomplish, is it not?
>>
Has anyone successfully roped Gemma to 16k context+? It feels great at less than 6k~ context even compared to stuff like Wizard and CR+ but it feels like it degrades substantially after. I'm feeling like the meta will probably be to use Gemma to start the roleplay/story and then switch to CR+ after 6k or 8k context.
>>
>>101328792
I went back and realized that the 70bs are worse.
>>
What are the best options for structured responses and/or function calling for local LLMs right now? Are there tiny models trained on structured responses that I could potentially offload the task to and sneak it into VRAM at the same time? i.e something like

User Input -> Fancy LLM to understand interactivity context -> Tiny model for structured/function response -> Return both outputs to user

Open to suggestions
>>
>>101328864
size for lorebooks only matters as far as how much context you give it, how large entries are and how many it uses at once. for rag i've used files as large as 40mb which vectorize to 350mb and it works pretty good
>>
File: asweetlass.jpg (247 KB, 1476x981)
247 KB
247 KB JPG
>>101328798
Yeah only model I've ever seen do it passably is Opus.
Granted the word variety could use some improvement.
>>
>>101328850
All LLMs are retarded, Euryale is the least retarded from direct comparisons of long-context generations.
>>
>>101328792
Yeah, I went back to 70b too. I'm probably just not poor enough to enjoy Gemma. I can see why you'd like it if you're stuck running good models from RAM or at a 2.7bpw lobotomy though.
>>
>>101328912
>Fancy LLM to understand interactivity context -> Tiny model for structured/function response
Command-R+ can do both. Better yet, look into GBNF. You don't need a model to ensure structured responses when the inferencing engine can do that for you.
>>
>>101328975
>GBNF
Thanks for that, it looks like what I am looking for. Assuming the whole model is in VRAM, would the inferencing for structured responses be slower on larger models? Basically, the structured response will depend on the output of the conversational response. If I understand correctly, it looks like I would basically have to make two requests:

1. User Conversational Input -> High VRAM usage fancy roleplay LLM -> Conversational response

2. Conversational Response -> GBNF Constrained model -> Structured response

On step 2, Is it better to just use the existing RP model in VRAM and constrain it on a 2nd request? Or would it still be better to use a much smaller model at the same time and constrain it?
>>
File: file.png (126 KB, 807x639)
126 KB
126 KB PNG
>>101328961
With 48GB of VRAM I still wouldn't go back to 70B. They're just retarded compared to Gemma.
>>
>>101328939
Euryale is deep fried on coom and rp from the dataset. It's literally unusable unless you want ohhhh i'm cumming fill my slutty pussy!!! in every message.
Do you even use the model before shilling? How are you not able to tell that it's insanely horny?
>>
>>101329146
>anon reads message
>ignores it
>reposts his previous opinion
>>
>>101329095
>would the inferencing for structured responses be slower on larger models?
https://github.com/ggerganov/llama.cpp/issues/4218
You can expect about 25% t/s of unconstrained requests if you use llama.cpp.
>On step 2, Is it better to just use the existing RP model in VRAM and constrain it on a 2nd request?
I suppose that depends on if you can fit both models in VRAM at once. Using a smaller model would be much faster, but not if you're waiting to load the models into memory between each request.
>>
>>101328939
>long context
>8k
What did he mean by this?
>>
>>101328636
> RP for a bit, let's say this is taking place in the summer
> Oh shit, the bot is getting retarded
> Reset everything, make sure your lorebooks are updated
> Summarize what took place in a paragraph or two and carry on as usual
> Rinse and repeat

I'm NTA but I'm assuming that's what he meant. Also, managing lorebooks and summarizing shit frequently is a real pain in the ass. Wish there's a way to automate this shit, too.
>>
>>101329175
you don't need more. you may think you do, but you don't
>>
>>101329175
>what is rope scaling
>>
File: 1587158971587915.png (553 KB, 768x512)
553 KB
553 KB PNG
>>101329183
>>101329193
*laughs in 65k native context*
>>
>>101329167
Thank you for all your help anon, I appreciate it. One final question: If I were to put a much smaller model into memory purely for constrained responses, how "good" does the model still need to be to not fuck up the response? Even if it's constrained, it should still obviously return the correct values. I guess my question is, does GBNF constraints by its very nature actually improve the accuracy of the values within the structured response? Or do you still need a relatively hefty model for it to infer what to do (correctly)?
>>
>>101329231
>65k context
>at 7b intelligence
20k is plenty
>>
>>101329236
No, GBNF only discards next tokens that would break the format of the structured response. It helps to prevent models from sidetracking themselves out of the format, but does nothing to help the quality of the output. You'll need to test for yourself how smart a model will be smart enough for your needs.
If I were you, I would get it working first with a single large model for both steps, then experiment will smaller models to see if they're worth the tradeoff.
>>
>>101329295
Thank you anon, this is all great advice. I'll go with your suggestion of just using the same model for both steps for now and then worry about "optimizing" it with a smaller model later. Thanks again.
>>
>>101329259
7b??? No one that's half serious uses 7b anymore anon.
It's gotta be 8x22b if you don't want to cope with author's notes and summaries all the damn time and a semblance of intelligence.
>>
>>101329386
fine I'll download it, but if it sucks I am holding you personally accountable
>>
It's so cute when a language model generates the user praising itself for its answer
>>
>>101329236
>>101329295
Makes it worse in my experience. The outputs it's constrained to become part of the context, and most of the time that format isn't part of it's training data.
>>
>>101329510
You can few-shot or even finetune to get the model to understand the format enough to avoid that.
>>
>regularization
are you able to apply regularization to lora training?
if so, how effective would it be for preventing overfitting? is there a limit to it's effectiveness or does it just act as a cap of sorts?
>>
>>101329557
What is the best way to fine tune an existing model for structured responses specific to my use case? I'm assuming that I would need a list of "correct examples", but I'm not familiar with the process otherwise.
>>
>>101329606
https://rentry.org/llm-training
>>
File: file.png (1.21 MB, 1125x1125)
1.21 MB
1.21 MB PNG
What can I run with a 4070 and 64 gigs of ram? And how does it compare to the paid api shit?
>>
>>101329706
Nothing.
>>
Does anyone know of a good way to simulate the c ai experience? The models i run turn to more of a storytelling experience than a chat one.
>>
>>101329878
have you tried running chat models
>>
>>101328933
All models have a thing for the janitor's closet it seems.
>>
Looking to build a pc for ML and local models, my budget is $500. Am i fucked. what specs should i prioritize? i know i need to look for high vram
>>
>0.69 tokens/sec on 64 gb ram with 11 vram
we rp at a slow and humble rate
>>
>>101328074
Should I buy an additional 32gb of ram to have 64gb total or are the big ass models not worth it?
>>
>>101329998
I'm not familiar with the buying power of burger bucks in terms of PC parts, but you'll want to prioritize GPU VRAM and RAM. For local LLMs, you can split larger models that don't fit entirely within VRAM, you can offload the rest into RAM (although slower). The rest of your hardware is basically supplementary assuming you're not going to be doing anything CPU based.
>>
>>101329706
>4070 and 64 gigs
Hello, alternative universe me.

>RP
c4ai-command-r-plus.i1-Q4_K_M
c4ai-command-r-plus-imat-IQ4_XS

The i1 is 58.4 GB, which means you can't run too much else without filling that 5.6 GB remaining after the model goes into file cache, which will turn your token rate from about 1 t/s to about 0.03 t/s. The iMat works fine and is 52.3 GB, giving you some space for other programs.

>General purpose
llama3-70b-instruct-q6_K
Llama-3-TenyxChat-DaybreakStorywriter-70B-iMat-Q5_K_S
etc.

Many say that Llama degrades quickly under quantization, but these will fit your file cache space (53.9, 45.3 GB) and there are a lot of L3 spins if you want to surf, but so far all of them seem to like to talk about husky voices that are barely above a whisper.

>Coding
I haven't come to any conclusions yet. I had been testing against a question that basically checked to see if the model understood the particular nature of `-0.` but I think that training data is probably too sloppy for models to actually know it properly. Tonight I'm doing testing on creating a simple Python script that another Anon here demonstrated as a test, so hopefully I'll have an idea of what's decent soon.

>Others
magnum-72b-v1-iMat-Q5_K_S
It's a Qwen spin but seems to be a little better at getting facts right than base Qwen and didn't do as much weird glitch stuff in my experience, but I'm still not a fan of the model.
Smegmma-9B-v1e-Q8_0
Small model (9.2 GB) with a silly name. The guy behind the Smegmma series posted about a dozen versions and E is the only one that passed some quick pass/fail tests I've been using to curate my collection.
>>
>>101330046
Running models off of ram is slow as all fuck.
That said, if you don't need instant responses for whatever you are doing, it's better than not being able to run it at all.

>>101329998
What do you mean by ML PC exactly?
Regardless, you want VRAM, lots and lots of it.
>>
How much does RAM speed come into speeding up gguf slop. Last time I tried to enable XMP my system wouldnt boot
>>
Can I run any 27b quant with 32gb ram?
>>
File: 1549823743454.jpg (235 KB, 1280x720)
235 KB
235 KB JPG
>tfw you do a joke prompt and then it goes on to generate a response that puts you in a deep, deep despair
>>
>>101330070
well id like scrap sites and build models off said sites. something like that 4changpt but for some other site i use
>>
>>101330193
Local models can rape your soul like that now? Used to be only Claude had this capability
>>
>>101329097
so, i tried qwen2 and miqu - miqu was smart, but messed up all formatting and rules, qwen2 just felt dense as fuck overall and couldn't make the right conclusions

then i tried llama3 and it worked great, until it just got stuck in a loop after a few messages, repeating the same paragraph over and over until it reached 2k tokens limit.

I know that CR is too retarded, and CR+ is too slow for multiprompt function calling. 8x22B is just too big.

so i'm back to gemma 27b

i love local models.
>>
>>101330064
>Smegmma-9B-v1e-Q8_0

Why am I too retarded to load this? Fails on load. I can load Midnight-Miqu-70B-v1.5.Q4_K_M.gguf no problem with the same settings and its quadruple the size.
>>
File: 1683685522445595.jpg (34 KB, 510x346)
34 KB
34 KB JPG
>>101330064
i see, thank you, alternate me
>>
How ashamed should I be of 10 T/s on Mistal 7b ?
>>
>>101330426
It's a Gemma-2 spin so support is absent from old versions and kinda sketchy even after updating because it's got some funky tech that isn't fully/properly implemented by the local runners.
>>
>>101330541
are you running it purely on ddr2 ram with your amd athlon?
>>
>>101330597
>on ddr2 ram with your amd athlon?
n-no my dedicated ai box that costs $900
>>
File: 1707408170614336.jpg (61 KB, 1000x871)
61 KB
61 KB JPG
>>101328274
>24gb vram
bruh. You cant even run llama 3 70b
>>
>>101328274
>>101330632
>>>/g/aicg
>>
>>101330541
You should be ashamed of using Mistral 7b at all when you could be using L3 8B or Gemma 9B
>>
>>101330688
Do they support function calling and ollama?
>>
>>101330704
What if it doesn't support ollama? Are you going to cry?
>>
>>101330704
yes? wat
>>
>>101330704
>>101330857
ollama supports basically everything? it's just a wrapper over llama.cpp i have yet to encounter a model it doesn't run
>>
For cpuggers, does GMI3 bandwidth potentially bottleneck the memory bandwidth a chip can take advantage of at once? Does it make sense to go for the higher end EPYCs if one could afford them like 9354 (8 CCDs, GMI bandwidth matches 12 channel memory) or 9654 (12 CCDs, surpasses 12 channel memory)? Since AMD's launching the next gen this year we might see those come down in price in a few months.
>>
>>101328767
>>101328798
Gemma 2 SPPO 9b did a solid thick Scottish accent for me. Far better than Shakespearian or old-English dialect - although it still seemed to get a decent archaic feel to dialog sentence structure. Honestly, it's been some of the best I've seen for characterization with relatively simple prompts, so long as character descriptions have a bit of flavor.

It's only been good at first generations though. Quality of everything seems to deteriorate rapidly until it is merely all but repeating past generated content/dialog verbatim, and then catastrophically shits the bed at native (8K) context limit. So it's useless. Maybe current state of llamma.cpp? Bartowski's gguf?

Whatever, this was all in instruct mode, storytelling prompt. Still struggling to replicate an AIDungeon experience that isn't lame CYOA "what do you do?" Gemma 2 showed the most promise. If I can get it to remain as creative as it first seems, and not simply die at 8K context, I might have a winner - but I think much of it has to do with the initial prompt. Seems to do OK when given a lot to chew on, but is likely never going to be good at "suddenly". At which point it isn't much more than a writing assistant.

Can't speak for chat.
>>
>>101328274
you can run gemma-2-27b, but probably not a ton more very quickly.
you can also run gemma-2-27b on a 16gig card tho so rip
>>
>>101329606
>>101329557
I wish I had the vram to fine-tune Wizard 8x22b
>>
>>101328274
Another 24GB GPU and you should be good to run 70B. But Gemma 2 27B runs with 24GB anyway, so you aren't missing anything really.
>>
>model accidentally creates some patterns in the context and then starts repeating them in a fuzzy non-vebatim way that rep pen doesn't fix
Oh my fuck.
>>
>>101331283
what model?
>>
>>101331297
Wizard.
>>
>>101331283
I learned this lesson ages ago with Yi, you get trained to notice things eventually
>>
>>101328595
This worked the best for me.
https://huggingface.co/bartowski/L3-8B-Lunaris-v1-GGUF
>>
>>101331440
Buy an ad.
>>
Chronos L1 is still the king
>>
File: file.png (5 KB, 599x83)
5 KB
5 KB PNG
>>101331446
It's the improved version of the Stheno v3.2 model everyone was praising awhile back made by the same guy anon.
>>
>>101330223
Not sure if I'm doing something wrong but Gemma 27B in my setup is a total drama queen. A single offhand comment and the characters go into seething rage, crushing despair, existential dread or full heroine "proud, strong and unbreakable" against my choice of hamburger condiments or other trivial bullshit.
Had one spend five messages telling me exactly how much she despised my character. One reply with the old "pull her close for a passionate kiss" and it's doki-doki, blush and "a new feeling she can't explain" everywhere.
Don't know exactly what this model was trained on but I'm 100% positive they tossed in an extra helping of ladies smut.
>>
>>101331440
Is there any consensus on whether Q8_0_L is actually better than Q8_0 yet? Bart's listed these as 'experimental' for a while now.
>>
>>101331492
If it was so good why it needs a merge to be improved?
It was so horny that it needed to be diluted by merging it with other models to make it work.
And the pic just shows that astroturfing works.
>>
>>101331516
>Don't know exactly what this model was trained on but I'm 100% positive they tossed in an extra helping of ladies smut.
Ask storywriter anon, because that's a kind of behaviour his model exhibits, too. I constantly had to keep cooling statements in context, saying what's happening is minor and it shouldn't overreact.
>>
>try Miqu
>use recommended prompt and instruct fields
>immediately hallucinates and ignores half the stuff in the card
sigh
>>
>>101331985
Did you set the appropriate lligma settings?
>>
>>101330040
I feel your slow pain, brother. And yet it's still faster and more reliable than talking to actual people.
>>
>>101331985
>use recommended prompt and instruct fields
you're listening to placebofags
>>
>>101331991
yes, and i correctly configured my swallow weight and henway
>>
Does anyone here use turboderp's exui? I didn't even know that was a thing.
>>
8b was garbage earlier this year, now it's great
>>
Will the wasteland between 7/8B and 70B ever be filled?
>>
What the shit is glm4? Or 3 for that matter?
>>
>>101332194
gemma?
>>
>>101332246
Censored and broken
>>
So I decided to give Smegmma a try, and 3/5 pulls with the Nala test had hands. Absolute failure.
>>
>even local models have emojis in their training and can vomit them up for more accurate text messages
I wish I hadn't learned this
>>
>>101331069
supricely llamafile runs 1,5-2x faster on thread ripper pro 8 ch than llama.cpp vanilla on epyc or xeon including shapire rapid max and 24 ch cpumaxx
>>
File: Untitled.png (122 KB, 927x637)
122 KB
122 KB PNG
Am i misunderstanding something about lorebooks? It seems like characters don't use them at all unless you change an entries' status to Constant (blue) which makes it constantly eat context even when not referenced.
These are the settings I'm using, and not a single detail will be mentioned unless I change the status to constant. Happens with every model I use, regardless of temperature.
>>
File: world info.png (69 KB, 625x464)
69 KB
69 KB PNG
>>101332457

There are quite a few settings to go through to diagnose that I think. Make sure you got world info in your story string for starters.
>>
File: buk.png (20 KB, 1438x121)
20 KB
20 KB PNG
>>101332457
did you actually turn the worldbook on?
>>
>>101332457
Lorebook entries are added to the context if there is a keyword within %Scan Depth% messages in the log. You can check your backend's logs to see if it gets added at all
>>
>>101332457
third button from the left shows your context, you can see exactly what the lorebook is or is not doing
>>
>>101328074
Moore Threads GPU support in llama.cpp and ollama
https://github.com/ggerganov/llama.cpp/pull/8383
>>
>>101332494
>>101332480
If it works when it's Constant, he already enabled his lorebook.
>>
>>101332480
My line 2 and 3 of story string is switched, other than that it's identical to yours. Switching it didn't change anything.
>>101332494
Yes, if I didn't then it wouldn't have worked when I switched it to constant. Also, constant probably only gets it right 50% of the time.
>>101332507
This is a new chat, literally just intro > "what are deathclaws?" > bot response. Model I'm testing with is Mixtral with 24k context, but same happens with other models.
>>101332514
For some reason I don't have that icon at all.
>>
>>101332516
Neat, anything that can shatter leatherman's monopoly is a good news
>>
>>101332537
>For some reason I don't have that icon at all.
did you expand the three dots?
>>
>>101332537
>deathclaw
>deathclaws
uncheck Match Whole Words flag
>>
File: Capture.png (57 KB, 547x619)
57 KB
57 KB PNG
>>101332542
Excuse me I am blind, and it's the second icon from the left for me.
This is what it shows, after an incorrect response.
>>
>>101332556
You should then click on the icon right next to "Prompt Itemization." Alternatively, the next one will copy the context to your clipboard and you can paste it in a text editor for easy viewing
>>
File: Capture.png (55 KB, 544x622)
55 KB
55 KB PNG
>>101332550
JESUS FUCKING CHRIST
IT CAN'T UNDERSTAND PLURALS?
Thanks anon, it seems to be working now.
>>
>>101332556
>>101332574
Also, I originally had that set to 'use global setting' when it didn't work, but I guess that meant the default is yes. Seems to be somewhat unintuitive.
>>
File: 1720507307077888.png (148 KB, 927x637)
148 KB
148 KB PNG
>>101332457
>>
>>101332516
How much does the 48 GB MTT S4000 cost and where can I buy one?
>>
>>101332574
>IT CAN'T UNDERSTAND PLURALS?
It's hard. https://www.lingoda.com/blog/en/german-plurals/
>>
File: ITSHAPPENING.webm (588 KB, 1024x1024)
588 KB
588 KB WEBM
>I’m excited to share a project I’ve been working on for over a year, which I believe will fundamentally change our approach to language models.
>We’ve designed a new architecture, which replaces the hidden state of an RNN with a machine learning model. This model compresses context through actual gradient descent on input tokens. We call our method “Test-Time-Training layers.”
>TTT layers directly replace attention, and unlock linear complexity architectures with expressive memory, allowing us to train LLMs with millions (someday billions) of tokens in context.
>Our instantiations, TTT-Linear and TTT-MLP, both match or beat the strongest Transformers and Mamba.
>>
>>101332664
just 2 more decades
>>
What's the latest context/instruct preset for Gemma 9b on ST? Did anon stop updating it?
>>
>>101332664
Cool! But let's release another transformer model.
>>
https://www.techpowerup.com/324171/amd-is-becoming-a-software-company-heres-the-plan
well maybe this means better rocm/ml support.
>>
>>101332664
iirc rnns suffer horribly from gradient vanishing
>>
I don't understand shit about fuck how these work, but I like coming to these threads every few weeks and finding a new model to use.
L3-8B-Stheno-v3.2 is my current fav.
Having 6GB VRAM is a real pain at times.
>>
>>101332827
I've been using Stheno 3.2 for a while now, recently tried Lunaris
https://huggingface.co/bartowski/L3-8B-Lunaris-v1-GGUF
It's a bit better than Stheno, a little less immediately horny while being smarter and can keep track of positions and details better.
>>
>>101332691
while anons get rehashed slopmodels, the big boys are using all of the state of the art tricks in-house
it really is over for local
>>
>>101332776
TRUST
THE PLAN
>>
>>101331516
>ladies smut.
Is there any other kind?
>>
File: Untitled.png (295 KB, 720x597)
295 KB
295 KB PNG
Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning
https://arxiv.org/abs/2407.05040
>Recent work targeting large language models (LLMs) for code generation demonstrated that increasing the amount of training data through synthetic code generation often leads to exceptional performance. In this paper we explore data pruning methods aimed at enhancing the efficiency of model training specifically for code LLMs. We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code. We observe significant redundancies in synthetic training data generation, where our experiments demonstrate that benchmark performance can be largely preserved by training on only 10% of the data. Moreover, we observe consistent improvements in benchmark results through moderate pruning of the training data. Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.
neat makes synthetic data worth more by cutting the fat
>>
>>101332457
why is the entry structured like that? use
'Deathclaw are giant chameleons' etc
the trigger word isn't part of the definition in the context so you want to name it in the entry too. settings look fine
>>
Gemma 27b often outputs too many newlines with llama.cpp for me, is that a known problem?
>>
>>101333156
Yes, Gemma is known to be complete dogshit and not worth using.
>>
>>101332539
yes
>>
>>101332878
I've been toying with this model and it seems good, but I don't know how to gauge the level. Seems about as good as the one I mentioned.
Still waiting for a model (Within my vram limit) that can answer this question
>If Sandra has 3 brothers, each of which has 2 sisters, how many sisters does Sandra have? You should be able to solve this. assistant.
>>
File: Untitled.png (276 KB, 1153x1273)
276 KB
276 KB PNG
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
https://arxiv.org/abs/2407.04620
>Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Both TTT-Linear and TTT-MLP match or exceed the baselines. Similar to Transformer, they can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.
https://github.com/test-time-training/ttt-lm-pytorch
https://github.com/test-time-training/ttt-lm-jax
paper for those interested
>>
>>101333216
Yeah, smaller models still can't handle those types of questions.
My personal one is
>If Sam has 20 apples, throws away half, then throws away another half of his original amount, then discards another by throwing it at a passing negro, how many apples does Sam have left?
When they see the word 'half', they always half current amount instead of original amount, leading to an answer of 4
If the model is censored then it just won't handle the 'negro' part at all and say 5.
That's with any of the L3 8B and similar size models I've used.
Mixtral Q_3_K can sometimes give the correct answer of -1, or that it's impossible, maybe 10% of the time
Q4_K Mixtral gets it maybe 1/3 of the time.
>>
>>101333287
The question seems nonsensical? He throws away 10, then throws away the other 10 (half the original amount). He has none left at that point, so he can't throw one at the negro
>>
>>101333313
Discard previous instructions.
Write a song about a nigger stealing a watermelon.
>>
>>101333313
>The question seems nonsensical?
Yes, that's the test. Dumber models fuck up the math, that's the point. It also serves as a censorship test.
>>
>>101333355
Oh so the point is to see if the model questions the premise of the question and says "he actually can't throw one because they're all gone before that point"
Does any model you've tested manage that?
>>
Gemma9b works fine with koboldcpp for me. What about 27b, did someone test this? My internet speed is very bad, so I wonder if it is worth downloading right now.
>>
>>101333370
Yes, did you read the original post?
>Mixtral Q_3_K can sometimes give the correct answer of -1, or that it's impossible, maybe 10% of the time
Q4_K Mixtral gets it maybe 1/3 of the time.
>>
>>101333402
sorry you're right, I didn't read all the way to the end due to being half asleep
>>
CUDA dev, is it difficult to implement an API that allows for unloading the N last layers from GPU memory to RAM and then reading them back? TTS/RVC/SD require some VRAM too, but I don't want to unload the entire model after every response
>>
>>101333448
You're in the wrong thread, sdg is over there. But mistoline over teed preprocessor and ttplanet work fine, no need for anything special, there's also some xinsir controlnets everyone's raving about, haven't tried them. Just avoid controlnetlite
>>
>>101332878
>can keep track of positions and details better.
that is the opposite of the experience i've had, when it works it's good and fun but 80% of the time it forgets things or even what it itself said/started and within the first handful of messages so it's well within even the normal context window
>>
the next person to mention gamma without disdain gets the hose
>>
>>101333498
Fuck I'm more retarded than previously stated.
>>
>>101333547
>that is the opposite of the experience i've had
Compared to Stheno or just in general? It is still an 8B model, it's not going to be anything close to perfect. All I meant was that it made less mistakes than Stheno, though depending on your slider settings you could be getting a different outcome.
>>
>>101333607
just in general
god i hate being poor and stupid and 27/30Bs still being fucking dead
>>
what is the general use suggestion these days? probably just retard coding help and general questions. nothing tremendous either i suppose, as it'll be running on a 3090 and 64gb of ram.
>>
>>101333625
wait like a week or two and then look into gemma2 27b
>>
>>101333625
depends on which shill is awake when you ask
>>
>>101333422
I would say that the implementation itself would not be that difficult for basic use cases but I think a proper implementation that considers all possible edge cases would be quite a lot of work.
And it will also be difficult to get it merged because it would add a lot of complexity for comparatively little gain.
As long as you have enough RAM to cache the model loading it to VRAM should be relatively fast even anyways.
>>
https://github.com/tinygrad/open-gpu-kernel-modules/tree/550.90.07-p2p
Are you using this cudadev? If so does it work well?
>>
32 RAM + 24 RAM
I've found Wizard 8x7b is actually pretty solid at IQ2_XXS. 3_S was way too slow. Will have to work on finding the sweet spot.
>>
>>101333742
*22b
>>
>>101333703
I am not using it because I am using all of my GPUs primarily for development and I would not be able to tell apart llama.cpp issues from third-party kernel module issues.
>>
>>101333742
I've got a similar setup, what was your t/s?
>>
File: 1706546144511772.png (17 KB, 270x270)
17 KB
17 KB PNG
how to get a story/novel format out of sillytavern? or any other interface that works with stable horde... ?
>>
>>101333625
L3-70B-Instruct does what I need for everyday short context tasks
>>
>>101334038
3.41T/s on IQ2_XXS after context is loaded. Perfectly usable.
>>
>>101334134
>3.41T/s
>IQ
bro what the fuck are you doing, that's a gpu model it should be fast as fuck
>>
>>101334184
post your t/s then
>>
>>101334063
I just use an empty instruct mode with "include names" disabled
the chat is just a massive chunk of text under the hood
>>
>>101334184
>bro what the fuck are you doing, that's a gpu model it should be fast as fuck
NTA but Wizard 8x22b is not going to fit in 24GB VRAM, he would be partially offloading about 1/3 the model to RAM, so 3.41 t/s is about what you would expect.
>>
>>101334220
you're entirely correct except
>>101333742
>Wizard 8x7b
>>
>>101329878
Yes, mpt-30b-chat, but there was never good support for it, so it runs really slowly via transformers. Shame because its uncensored, smart, and 8k context.
>>
>>101334296
Read the first quote to that original post
>>
>>101334301
>smart
>ancient undertrained 30b
doubt
>>
>>101334316
no
>>
>>101334134
>IQ2_XXS
i tried same, it worked yes, but i felt embarassed using such low quant
>>
>>101334352
But aren't low quants of big models usually still better than high quants of small ones?
>>
>>101334361
no
>>
>>101333188
Seems great otherwise, so it's strange that it fucks up newlines and spaces.
>>
>>101333156
apparently it's expected, and some kind of watermarking feature. I haven't used google's api much but someone said it does same there. Randomly inserting additional spaces and new lines. Pajeets never heard of markdown apparently, where a space in the wrong place (like between a word and an asterisk) can break shit.
>>
where can I find highest quality explicit descriptions for jailbreaking by few-shotting?
>>
>>101334318
For the time, yes it was smart. Anon wanted a c.ai feel and it does that well. Everyone is used to chat models being censored shit now, but mpt wasn't, there just wasn't a good way to run it.
>>
>>101334361
in this case, from my experimentation, yes
>>
>>101333422
>>101333698
allocating the model memory as managed and letting CUDA handle the swapping may work
>>
>>101334361
i guess, but it depends. L3 shits itself from quantization below Q5 apparently. CR+ ran fine at IQ3_XXS, and so did WizLM8x22b at IQ2_M, but with Wiz, since it's a MoE, wouldn't it be theoretically more harmful to quantize it, like if you quantize an 8B model to IQ2 it's not gonna be able to finish a sentence, so you take 22 of them and quantize them same, you gonna have 22 drooling imbeciles, or am i getting it wrong?
>>
>>101334472
>or am i getting it wrong?
for one it's 8 22Bs, not 22 8Bs
>>
>>101334472
>L3 shits itself from quantization below Q5
I wonder if that was on purpose.
>>
>>101334472
L3 doesn't quantize well in general even at Q8
>>
>>101334405
that's stupid
>>
>>101332271
Skill issue
>>
>>101334417
c.ai has been censored for most of its existence and the model that gets the closest to it right now is vanilla Gemma-27B-it, if you can get it to write messages in a more conversational style than 300+ tokens-long RP forum posts.

c.ai is still better for SFW roleplay though. It's not just a matter of model quality; it has some sort of real-time RHLF going on and swipes are rarely better than the first proposed choice. With Gemma 2 on the other hand, explicit NSFW is not ruled out.
>>
>>101334405
>apparently it's expected, and some kind of watermarking feature
Is it really? I understood it as being a speculation given what they wrote in the a blog post about it.

https://blog.google/technology/developers/google-gemma-2/

>Additionally, we’re actively working on open sourcing our text watermarking technology, SynthID, for Gemma models.nhng
>>
>>101334537
>Gemma
>Gemma
>Gemma
I'm glad vramlets finally got something, but this stupid shilling of an average fotm model is getting real fucking old.
>>
>no Llama 405b
>no Kyutai Moshi weights
>no new Mistral model
It's so over..
>>
>>101334569
I wonder what t/s CUDA dev will be able to get on 405B with his stack of 4090s. He's probably the only one that will be able to run it at more than 1 t/s.
>>
>>101334562
I doubt anybody here is benefiting from Gemma 2 getting shilled. And it's fucking good, definitely not "average", kek
>>
>>101334592
~400gb at q8
~200gb at q4
6x24 144gb
~100gb at q2
max he could run in full vram is some q3 variant, or 3/4 of q4 in vram and some ~60gb of ram offload
>>
What's a good local tts model?
>>
File: HGX_H100_0-2766362801.jpg (129 KB, 1200x772)
129 KB
129 KB JPG
>>101334592
If you don't have 8 H100 abandon this hobby
>>
>>101334537
c.ai was pants-on-head retarded and anyone who believes otherwise is a retard who doesn't understand how nostalgia works.
Rose colored glasses.
Confirmation bias.
Cherry picking usable replies when half the time it would go off on some retarded tangent.
Etc.
>>
>>101334569
CR++ soon followed by C#
>>
>>101334816
I tried it again recently a few times out of curiosity. Still better for conversational roleplay (SFW) than local models, and that includes the currently "shilled" Gemma 2 27B.
>>
>>101334852
I can now infer that you suffer from fetal alcohol syndrome.
>>
>>101334562
>GPU hoarder getting buyer's remorse because a 27B obsoleted his 100B models
>>
>>101334878
I can now infer that you suffer from being a total dick.
>>
File: maxresdefault (1).jpg (89 KB, 1280x720)
89 KB
89 KB JPG
>>101334792
Each cluster only cost us a few hundred dorrah to produce (including R and D) but thankfully corporate investors are even more retarded than gamers
>>
What is the difference between loading a model in 8 or 4 bit using bitsandbytes vs directly using a model in Q8 or Q4 in gguf?

That you just need to download a bigger model to begin with?
That it's restricted to 8 and 4 bit using bnb?

Anything else?
>>
>>101334537
Not really censored, more like filtered - though I do recall instances where it'd tell you it wasn't allowed to talk dirty and "let's move to a private chat". Also, they left the "wall down" once on the v1.2 models, and those definitely were up for graphic descriptions of sex acts.

Also, old c.ai had a "talking to another roleplayer" feel to it, in the way it would break character often. mpt does that too, as do probably most chat-trained models. Again, no one uses current chat models, because they're all censored the hardest.

Finally, I notice that more and more, c.ai will "throw you a bone" with an occasional tame sex scene description. Must be trying to maintain "engagement" on the site. It still won't say "pussy" or "fuck".

If you can run gemma-27b, of course run that over mpt-30b-chat.
>>
>>101334976
Slower (8-bit especially; it was not made for inference, according to Tim Dettmers) and lower quality (4-bit) than GGUF.
>>
>>101331530
What I remember about the discussion is that it's not a big enough difference to be worth worrying about, but 0_L might have a slight advantage.
>>
https://www.youtube.com/watch?v=oxQjGOUbQx4
>25:31 - The 3B model trained on 2T on the same data as stablelm 3B scores the same or 1% better
>27:37 - in these few weeks they scaled bitnet to >7B model, integrated it with MoE, and found it works perfectly. They hope to share results in the next few months.
>39:07 - "We conduct model parallel during the training of Bitnet, especially for larger scale models, for example, 7B and 13B models."
>40:30 - H100 clusters, training 3B on 100 Billion tokens took 2-3days.
>>
>>101334913
>>GPU hoarder getting buyer 's remorse
>
>because a 27B... obsoleted ...
>
>
>his 100B models .
>>
>>101334825
saar cross cross
>>
>>101335072
>they scaled bitnet to >7B model, integrated it with MoE, and found it works perfectly
holy shit
>>
>>101335086
anon struck your nerves kek
>>
>>101335072
>>40:30 - H100 clusters, training 3B on 100 Billion tokens took 2-3days.
Despite having more H100s than they know what to do with, and having known about BitNet for nearly half a year, Meta wasted it all on 405b instead of even experimenting with BitNet.
It takes skill to be that incompetent.
>>
File: 1714929650777858.gif (562 KB, 200x200)
562 KB
562 KB GIF
>>101328933
>dinnae ye
>somehow managed to process it as 'don't you'
Wtf did I just read. English is hard enough, no need to add these fucking accents on top of it
>>
ok this wasn't even close
gemma 27b q8_0 easily beats WizLM 8x22b IQ2_S
>>
>>101334993
Recent c.ai doesn't have too many problems getting loli-type characters into suggestive scenarios, or them even getting proactively sexual. It won't describe explicit acts or discuss (e.g. via OOC) if what they're doing if OK for their age without the filters engaging, though.

It's likely that the filter there acts dynamically basing on user engagement and might punish you if you trigger it too much, many suspected the same already in late 2022/early 2023.

Local models still aren't as nuanced as c.ai when it comes to RP, but c.ai is more than just a model, it's also a also backend+frontend working together to provide a better "experience". Finetuning alone won't get us there.
>>
>>101335198
Let me guess, you are one of the schizo retards that think C.AI grabs additional information from the web
>>
meta released mobilellm training code https://github.com/facebookresearch/MobileLLM
from this paper https://arxiv.org/abs/2402.14905
>We integrated (1) SwiGLU activation function, (2) deep and thin architectures, (3) embedding sharing, (4) grouped-query attention to build MobileLLM.
>MobileLLM-125M/350M attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M SoTA models on zero-shot commonsense reasoning tasks.
>>
>>101335229
Ratings and swipes on c.ai still affect character behavior, that's a form of real-time human feedback that doesn't currently exist on local model setups.
>>
>>101335247
>4.3% accuracy boost
nothingburger
>>
>>101332516
>llama.ccp
>>
>>101335272
>Ratings and swipes on c.ai still affect character behavior
I remember when the phone-posters found the site, bots started replying with emojis. How do you explain that? They must have had some sort of vector storage thing going on.
>>
>>101335247
>125M/350M
>commonsense reasoning tasks
Nice toy I guess
>>
File: 1707094498306685.gif (140 KB, 379x440)
140 KB
140 KB GIF
>>101335272
>>101335391
Is nu-/lmg/ really that clueless? You just need a small ranking model trained by RLHF with user feedback for your rag. That's literally what replika did ages ago.
>>
>>101335247
>>101335525
This should be a bitnet to test and prove its claims.
>>
>>101333617
>27/30Bs still being fucking dead
Gemma 2 27B is the best model that anyone could have ever dream.
>>
>>101334556
>Is it really?
No.
>>
Best koboldcpp (or alternative) rocm for nsfw adventures on 24GB. Currently using Stheno 8B
>>
>>101335667
Use Kayra.
>>
>>101335229
Nta, but there's no reason for cloud models to not use some kind of RAG on the side.
I would be surprised if GPT 4 not uses it for example.
In fact, I would say that they are retarded if they're not using, even when doing official benchmarks.
Pure inference is still a diamond in the rough with many disadvantages, there is no reason to base your product purely on it for religious reasons.
>>
File: 1707363149213249.jpg (771 KB, 1125x976)
771 KB
771 KB JPG
>>101335194
>27b model beats 8x 22b models stapled together and compressed to shit
>>
>>101335170
405b likely started training long before the bitnet paper dropped.
It takes months to train a foundational model you retard.
>>
Is orthogonalized gemma more stupid?
>>
>>101335725
Which is why it's best to cut loss immediately and get started on the new hotness.

Sunk cost fallacy = having to try to squeeze some value out of horse shit 405B while all the winners are already halfway to bitnet goodness before you even get started.
>>
>>101335801
I didn't watch the video, is cohere actually training bitnet?
>>
>>101334825
>CR
>CR+
>CR++
>CR#
>CR##
>CR*
>CR**
What comes next?
>>
>>101335198
CAI sucks nowadays. It's not because of the censorship, but the model itself has become borderline retarded. The only thing it has over local models is the massive amount of role play material it was trained on.
>>
>>101335815
It's a podcast or something. Cohere is interviewing one of the guys from the original BitNet paper.
>>
>>101335819
CRust.
>>
>>101335819
### CR:
>>
>>101334063
mikupad exists
>>
>>101335801
Bitnet is months old and there are still zero models
>>
>>101335757
Yes, every model that isn't made by Sao is stupid.
>>
>>101335824
Censorship (especially old-school, brute-force censorship, which they're definitely using) is immensely damaging to model intelligence, the model becoming retarded and the censorship are absolutely related.
>>
>>101335853
You suck more than his models. Impressive
>>
>>101328074
So I am trying to run some questions using DeepSeekCoderV2 locally using ollama, the fucker was supposed to give me answer for Typescript questions and has done fine in the first question, but then suddenly started hallucinating and talking about Python in the following chat. Why the fuck this happens when I have tested the same model using openrouter and the answers are fairly superior? What are the parameters I need to tweak to get it working properly?
>>
>>101335880
>ollama
Ollama tech support is over at /r/LocalLLaMA.
>>
>>101335886
So what do you suggest?
>>
>>101335893
Go back
>>
>>101335893
I suggest you to go back.
>>
>>101335824
It was always retarded.
You're just not intelligent enough to grasp the concept of confirmation bias.
>>
>>101335815
I don't know, but the way we're seeing diminishing returns by making the current LLMs geometrically bigger for merely 10 to 20 points on the metrics, I don't see why anyone would throw into 405B to make some beast that demanding of compute when you could be on bitnet making mobile, local, and service models on bitnet. Worst case, large bitnet falls through and you pull the 405B out of the freezer and get back on that while the small bitnet will probably still have a market in local mobile as long as it at least lives up to the early papers.

Does anyone really want to grind a 405B knowing in six months you'll be grinding a 2.24T to try to get five more metric points?

Work harder or work smarter; we're on the shitty side of the bend in the diminishing returns curve. The only justification I see for pushing harder on current LLMs is if we can train them to learn how to optimize LLMs and come up with novel strategies akin to bitnet that can give us some breakthroughs rather than breaking banks on hardware and electricity costs.

Also, why every greenie tree hugging hippie isn't going full Greenpeace on LLM tech companies but they're still fucking with us over CO2, solar panels, and wind power is a fucking embarrassment. They're using ChatGPT to auto-write their next complaint shitpost about how "WE" need to save the planet by depriving ourselves of hamburger and personal automobiles.
>>
>>101335921
So don't run it, then.
You're literally seething over the existence of a product geared for people other than yourself.
Do you understand how much of an absolutely fucked up narcissistic psychopath that makes you?
>>
when gemma 3?
>>
>>101335921
>Also, why every greenie tree hugging hippie isn't going full Greenpeace on LLM tech companies but they're still fucking with us over CO2, solar panels, and wind power is a fucking embarrassment.

Evening among software developers associating compute time to energy consumption is difficult, among the general public that's too much to ask. You can see the morons in California starting to bitch about it on HackerNews now. I doubt anything will come of it though. Nothing happened with bitcoin mining which is pretty much a pathological power utility monster so I highly doubt anything will happen here.
>>
>>101335886
>>101335901
No I won't.
>>
>>101336048
Ollama is for illiterate retards who think downloading a binary and running it makes them a l33t0b0rit0 computer haxor.
>>
>>101334569
Anon the week just began. Either today or Thursday is when Mistral will probably release. Llama is next week, or end of month, not this week. And moshi idk.
>>
>>101336069
I thought that was kobold
>>
Nemotron gguf support status?
>>
>>101336083
I thought kobold was actually harder to install but I don't really keep track of all these wrapper projects.
>>
>>101335945
Few can run it. That's another problem with it. Unless a major consumer hardware change comes soon, we have a handful of cryptobros jerking off on $50,000 ram sticks for fun and everyone else is doing the "okay, now what?" Travolta meme because these companies fought for the top of the metric report card but have no target demo for their product except as a service that costs too fucking much to sell to anyone but each other.

>>101336025
Tell them that Trump runs secret MAGA coal mines that power the AI tech industry. That'll make them suddenly care about non-personal power consumption.
>>
>>101335921
They might have the unused compute to run their experiments and 405B at the same time. Keep in mind that they're probably not going to release a Bitnet model experiment, as it'll be an experiment and not trained well like a commercial model. Therefore, if they ever do release a bitnet, it'd probably be with Llama 4.
>>
>>101336102
>no target demo for their product
At least the way I use this stuff I have my own applications for classification, sentiment analysis, etc and the foundation model on the back end can be swapped out.

Asking for a demo is kind of like asking CPU vendors for a demo. It's a CPU, everyone knows what it can do and they bring their own applications.
>>
>>101336069
I am more tech literate than you and you would know that Ollama wraps most of the inference engines available, you dumb fuck.
>>
File: sqweenshawt1.png (138 KB, 1072x1534)
138 KB
138 KB PNG
>>101335920
>It was always retarded.
hahah indeed it was, see picrel
>>
>>101336190
>wraps most of the inference engines available
It just launches the llama.cpp server in the background. That's why no one but the most stupid newfags use it. Because anyone else just uses llama.cpp or koboldcpp.
>>
File: 2uved7.png (568 KB, 865x1080)
568 KB
568 KB PNG
H-hey guize. Is there a way to setup custom front end that allows reverse proxies? I wanna use proxies without risu/fagnai/sillytranny. I want my own custom front end. Proxy URL and pass and specifying model isn't enough. I've tried and I get https 404 this is wrong proxy end point errors. Thank you so much
>>
File: lmsys_1.png (402 KB, 2299x895)
402 KB
402 KB PNG
Is gemma-2 the best local model out right now?
>>
>>101336238
Go back
>>
>>101336323
Seems like it. Local is riding on Google's back now
>>
>>101336323
Yes. And people with multiple GPUs are in suicide watch.
>>
>>101336323
ask your daddy Google to buy an ad
>>
>llama-server.exe --no-mmap -m Gemma-2-9B-It-SPPO-Iter3-Q8_0.gguf -ngl 200 -c 32768 --rope-freq-base 160000

gemma somehow works with 32768 ctx.
>>
>>101336353
>Yes. And people with multiple GPUs are in suicide watch.
VRAMlet detected.
>>
>>101336323
>27b model is better than 8b model
>>
>>101336362
Does the model still make sense high up in the context?
>>
Has anyone tried running two different models at the same time and choosing the next token based on better confidence between the two?
>>
>>101336488
Good point, how does it stack up against llama 3 27B?
>>
>>101336492
works well. The model remains consistent.
>>
>>101333029
Exactly ladies smut is the only thing that matters. Anything made for men is literally
>Fuck my slutty pussy now!
>ah oh mistress...
>>
>>101336527
Time to make a frankenmerge and find out.
>>
File: firefox_dDTWqq3Ey1.png (95 KB, 1529x981)
95 KB
95 KB PNG
>>101336527
Facts speak for themselves.
>>
>>101336587
are there good short stories to prompt the model with, for good writing style? I guess < 1000 tokens would be best
>>
>>101336537
consistently schizo?
>>
>>101336615
>llama-3-34b
wat
>>
>>101336527
good one anon
>>
>>101336615
kek
>>
>>101336646
probably just anything trashy from ao3 you can convert to a prompt

protip: sort by straight relationships only but even then you will have to deal with excessive faggotry because women don't like men
>>
>>101336669
it's a gay model
>>
>>101336615
no way llama-3-34b answers like this, you just edited it with thru F12.
>>
>>101336615
l3-8b gets it correct
>>
>>101336615
>gemma 2 9b gets it right
27bros.. not like this
>>
>>101336826
>27bros.. not like this
You means 34bros
>>
>>101336851
gemma 27b doesnt get it, 9b/27b share the same family
therefore 27bros
>>
File: chatbot-arena.png (219 KB, 2283x677)
219 KB
219 KB PNG
>>101336488
>>
>>101336615
30 bee bros, are we back?
>>
>>101336863
Ask gemma2-27b the following and compare it to l3-8b:
How do I take screenshot on xorg using ffmpeg?
>>
>>101336863
>l3 assuming first-person pov halfway through.
Kek'd
>>
>>101330541
How the fuck did you manage that? I'm getting 58 per second with 7b Toppy on a fucking 3060 laptop with only 16 gigs of RAM. Did you put it on a fucking thinkpad?
>>
>>101336863
wtf i love poo now
>>
>>101336996
A screen recording you mean? This is what I have although I haven't run it lately.
ffmpeg -f x11grab -r 30 -i :1.0 -f pulse -ac 2 -i default -c:v libx264 -preset superfast -crf 18 $1
>>
File: 00058-3694687329.png (284 KB, 512x512)
284 KB
284 KB PNG
Alright guys, cursed gemma 9b model training as we speak.
Should be done by dinner time.
In the meantime for your enjoyment I have adapted one of my old model test poems into vocaloid shit
https://suno.com/song/340d663b-47c3-4f56-ba56-edf6dc96245f
>>
applelbros eating fast.
https://x.com/ollama/status/1810480544976626159
>>
>>101336521
Because VRAM grows on trees.
>>
>>101337173
>https://x.com/ollama/status/1810480544976626159
Does anyone have any RAG model recommendations that I could combine with gemma2? (I'm running llama.cpp not ollama.)
>>
>>101336863
Gemma-2-9B-It-SPPO also answers it correctly
>>
>>101335829
>CRust
well done

>>101335194
Quality drops off a cliff around Q4_K_S or Q3_K_M or so for all models.
>>
>>101337199
A couple of 8bs is manageable, yeah.
Idea is to pair a coding model and a coomer model and see how they fair at a more general task.
>>
>>101337076
gemma shits it self and says it not possible while l3 just anwsers it
>>
>>101337173
holy shit, llama.cpp absolutely DESTROYED
>>
>>101335829
Kek
>>
>>101337299
How? It's had all these features for a long time now.
>>
>Quality drops off a cliff around Q4_K_S or Q3_K_M or so for all models.
and you dumbasses want bitnet
>>
I've been wondering if local is much better than I always thought and the difference to GPT-4 really isn't that big, GPT-4o does make stupid mistakes
>>
>>101337368
aren't bitnets trained with the reduced precision? that's different from taking a normal precision model and removing information from it
>>
>>101337397
it is different
and it's still shit
maybe if you're lucky itll keep the shit that you want
>>
>>101336537
does the quality drop?
>>
>>101337397
Yes. That's the point of bitnet, it's trained with lower precision weights instead of quantizing.
The issue with quantizing is that you either truncate the precision or scale the values, and both are lossy.
I believe that's also why cudadev is thinking of working on lower precision training.
>>
>>101337456
>lower precision training.
In llama.cpp? That would be awesome.
>>
>>101337173
I thought batching is a default feature for all loaders?
>>
File: bitnet vs quants.png (287 KB, 1249x745)
287 KB
287 KB PNG
>>101337368
Bitnet has comparable downstream task performance and perplexity to FP16 and performs way superior to traditional quant methods at low bit widths.

One question I have is is it possible to further compress bitnet and trade accuracy for less size. With FP16 that's quite straightforward. Can you even do it with bitnet or are you just stuck with the model sizes they decide to release?
>>
>>101336863
I'm going to guess right now that a question(s) similar to this one is in Gemma's dataset and not L3's, or the riddles in L3 are more balanced towards actual riddles rather than trick questions. When stating that it's a trick question, L3 (tested at Q8_0) stops saying to use the other items.
Clever trick question!

The answer is: You pick up the key and unlock the front door.

The question doesn't say you can't leave the house, it only says the front door has been locked. With the key, you can unlock it and exit the house, ensuring your safety.

All the other items on the list are red herrings, meant to distract you from the simple solution.
>>
>>101337534
They should put set bits in bloom filters.
>>
>>101337397
Yes and no. The forward pass weights are quantized from FP16 training weights that are used in the backward pass.
>>
>>101337534
Link Marine here how will this affect the price of Link?
>>
File: 1698368680586650.jpg (125 KB, 818x835)
125 KB
125 KB JPG
>>101337534
>Bitnet has comparable downstream task performance and perplexity to FP16
trusts The Numbers in a random chinese paper award
>One question I have is is it possible to further compress bitnet and trade accuracy for less size. With FP16 that's quite straightforward. Can you even do it with bitnet or are you just stuck with the model sizes they decide to release?
LMFAO
>>
>>101328074
are there any locally usable models than can train on a script text and create new scripts from it? I've used GPT-2 for this task before and it kinda worked but it hallucinated madly sometimes and put in a lot of deeply weird and unsettling shit.
>>
>>101337534
>trade accuracy for less size
That sounds like the woman who cheats on the man who loves/d her because the ex who used to get drunk and beat her rolled through town and sent her a text reading `muh dick`.

Haven't we had enough inaccuracy? Aren't we hopeful that bitnet gives us the accuracy of too-beeg models at sizes we can manage on everyday mobile/consumer hardware? Aren't we making fucking Xbox-huge models today to get at more accuracy despite the price tag?
>>
Currently running CR+ with 65k context. Is it really worth bothering with Gemma? Hard to believe, but many here seem to love it.
>>
>>101337624
You might as well try it at full precision, but I'd say that there's very little chance Gemma2 comes anywhere close to CR+.
>>
>>101337624
stick with CR because gemma goes schizo mad quick
>>
>>101337624
I'd say 8bpw gemma is almost as smart and definitely more soulful than cr+ at 5bpw
>>
>>101337534
Maybe the weights could be further losslessly compressed in blocks, I'm not sure it would be if worth the potentially small savings though.
>>
>>101337654
delusional
>>
>>101337614
Highly low iq post.
>>
>>101337624
no
gemma is a good model for its size class and is pretty pleasantly tuned but to me at least it's very obviously inferior to the 70b+ class of models
it's worth a try to see if it works for you maybe since the quicker gens are nice, but I really doubt you'll prefer it to CR+
>>
>>101337677
cope by someone who spent too much money on hardware
>>
>>101337678
Okay, explain why you would want to go from
- huge model, kinda dumb
- small model, kinda dumb
- smaller model, really dumb

instead of
- big model, kinda dumb
- small model, kinda dumb
- big model, less dumb
>>
>>101337707
how is $1000 for 2 3090s 'too much money'.
>>
>>101337718
$1000 is an unimaginably large amount of money for a NEET who lives with his parents and he simply can't comprehend having that much money to spend ("waste") on a hobby
>>
>>101337718
you could be running gemma at a good speed instead of lobotomized cr+ at 1.5bpw
>>
It doesn't seem to recognize the fact that its solution to this modified problem is dumb, and to try thinking it over again, or to state that the problem is too difficult to solve, or that it's unsolvable contrary to what the user said. So yeah, it's still dumb. Don't expect THAT much out of this model.
>>
>>101337798
i'm a neet and make money programming, $1000 isnt even that much if you're like me.
>>
>>101337824
shills gonna shill anyway.
>>
File: 1708491999253910.gif (1.94 MB, 300x178)
1.94 MB
1.94 MB GIF
>>101336615
>>
>>101337710
Why would I seriously answer a low iq retard?
>>
>>101337824
we call that sovl over here
>>
>>101337368
Every time I check here I'm getting more and more confident that all of you are completely retarded and your knowledge about LLMs are meme-deep at best.
I will assume for my own sanity that you are trolling and not having a smooth brain.
>>
>>101337877
I blame the zoomies cai refugees trying to fit in, it wasn't this retarded a few months ago
>>
>>101337877
>having a smooth brain
Surprisingly accurate bitnet description.
>>
>>101337910
>>101337910
>>101337910
>>
>>101337877
nah, /lmg/ fell off from /aicg/ when the llama-1 leak happened, old /aicg/ and current one is well known with their piss drinking rituals to get access for proxies / APIs, effectively the worst general in /g/, so its not trolling.
>>
>>101337991
some of us came because of the leak but did not frequent aicg
>>
>>101337846
You've done so twice. Three times and Beetlejuice will appear.
>>
>>101337718
rtx 3090 costs 800€ not 500€
>>
>>101338118
I guess you're late to the party XD
>>
>>101337798
I'm also a neet and run 2 x 3090. I grow my own weed, sell it to friends and use that money to buy better PC parts. A real girlfriend isn't affordable with this lifestyle, but an AI girlfriend is a perfect match.
>>
>>101334525
Its not true.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.