[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/g/ - Technology


Thread archived.
You cannot reply anymore.


[Advertise on 4chan]


/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>101268178 & >>101258576

►News
>(07/02) Japanese LLaMA-based model pre-trained on 2T tokens: https://hf.co/cyberagent/calm3-22b-chat
>(06/28) Inference support for Gemma 2 merged: https://github.com/ggerganov/llama.cpp/pull/8156
>(06/27) Meta announces LLM Compiler, based on Code Llama, for code optimization and disassembly: https://go.fb.me/tdd3dw
>(06/27) Gemma 2 released: https://hf.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315
>(06/25) Cambrian-1: Collection of vision-centric multimodal LLMs: https://cambrian-mllm.github.io

►News Archive: https://rentry.org/lmg-news-archive
►FAQ: https://wikia.schneedc.com
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/llama-mini-guide
https://rentry.org/8-step-llm-guide
https://rentry.org/llama_v2_sillytavern
https://rentry.org/lmg-spoonfeed-guide
https://rentry.org/rocm-llamacpp
https://rentry.org/lmg-build-guides

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
Chatbot Arena: https://chat.lmsys.org/?leaderboard
Programming: https://hf.co/spaces/bigcode/bigcode-models-leaderboard
Censorship: https://hf.co/spaces/DontPlanToEnd/UGI-Leaderboard
Censorbench: https://codeberg.org/jts2323/censorbench

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler visualizer: https://artefact2.github.io/llm-sampling

►Text Gen. UI, Inference Engines
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/lmg-anon/mikupad
https://github.com/turboderp/exui
https://github.com/ggerganov/llama.cpp
>>
►Recent Highlights from the Previous Thread: >>101268178

--Issues with the new _L Quantization Method: >>101269594 >>101269637 >>101272301 >>101272326 >>101272381 >>101272430 >>101272480
--Improving Roleplay Models by Tweaking Training Data and Goals: >>101271795 >>101271832 >>101271836 >>101271929 >>101272025 >>101271890
--Deepseek v2: A Mixed Bag for ERP and Creative Writing: >>101272115
--T4 16GB vs 4060ti 16GB: Which is the Better Deal?: >>101271398 >>101271499
--Seeking Toxic, Human-Like Models Beyond GPT-4chan: >>101269959 >>101270105 >>101270366 >>101270061 >>101270367 >>101270386 >>101270409 >>101270432
--Local AI for Latin Grammar on Low-End Laptop Specs: >>101271921 >>101271945 >>101272009 >>101272026
--Llama.cpp's No_MMAP Option Causes RAM Inflation in Gemma 2: >>101270414 >>101270456
--FlashAttention Not Supported on Gemma Due to Incompatibility Issues: >>101269666 >>101269683
--Anon's Quest for the Perfect Model for Text Understanding and Rewriting: >>101268784 >>101268798 >>101268855 >>101268914 >>101269016 >>101269121 >>101269265 >>101269247 >>101269287 >>101269328 >>101269377 >>101269255 >>101272169
--Testing Calm3-22b-Chat at BF16 Precision: >>101272234 >>101272993
--Running InternVL-Chat-V1-5 Locally with Kobold or LLaMA: >>101271530 >>101271661 >>101271744 >>101271784
--Model Creative Writing Performance Comparison Chart: >>101272317 >>101272337 >>101272387
--Google Could Dominate with Gemma-27b MoE: >>101269952
--Gemma's Guardrails in RP Mode: >>101269755 >>101269785
--Big Tech Plays it Safe, Lacks Innovation: >>101271169 >>101271188 >>101271310 >>101271355 >>101271375 >>101271387 >>101271361 >>101271099 >>101271468 >>101271617 >>101271677
--Anons Share Their LLM Interaction Strategies: >>101271908 >>101271939 >>101271948 >>101271981 >>101271975
--Anon Shares Model Parameters for Q6_K_L: >>101269573 >>101269602 >>101269608 >>101269688
--Miku (free space): >>101271031

►Recent Highlight Posts from the Previous Thread: >>101268182
>>
Mikulove
>>
Gemma fix status?
>>
>>101274079
2mw
>>
File: 1695257102376633.png (91 KB, 1401x958)
91 KB
91 KB PNG
My custom frontend is now usable after like 3 weeks development. I'm so happy I feel like crying.

Post your custom frontends anons.
>>
>>101274031
Celebrating America with Miku
>>
>>101274094
holy based... release it under AGPL3.0
>>
>>101274094
Which model did you have code it for you?
>>
>>101274094
so what does it do that you can on others?
>>
>>101274107
Do you know what is miku related, thread culture, and peak american culture?
>>
>>101274118
can't on others*
>>
File: 1714659818152841.webm (2.17 MB, 640x800)
2.17 MB
2.17 MB WEBM
>>101274108
I'm not sure what AGPL is but I will look into it since I did want to put it up online for future employers to look at.

>>101274111
None.

>>101274118
Probably nothing, the others just suck really bad at anything that isn't chatslop, so I made mine catering to a more free flow writing style. The goal is to eventually have weights for the different prompt parts a-la NovelAI. I just have more control over the prompts, that's it.
>>
>>101274166
AGPL is a license that makes it so that bad guys from google and microsoft cant take your frontend and repurpose it for themselves without giving back
AGPL specifically in case someone decided to host the frontend and let people access it via a network, other than that its mostly like GPL
basically you btfo big corpo
>>
>>101274189
Just read it. If you hadn't told me about it I'd have published it as GPL, so thanks.
>>
>>101274189
hi petra
>>
>>101274217
wtf i love petra now???
>>
Anybody done a comprehensive comparison of L3 instruct with and without the the line break after the start of turn header?.
>>
>>101274094
still in an early stage of development
>>
>>101274241
templates in general are a meme let alone small things like that with big models, and small models are for niggers
>>
>>101274250
sovl
>>
>>101274241
Someone is doing that comprehensive comparison this afternoon. His name is (You).
>>
>>101274250
Looks like shit. Too many buttons. Too much text. Not enough pictures or icons. Not enough calming pastel colors. Not enough whitespace. Not enough emoji. 2/10 design. Nobody will use this.
>>
>>101274269
If nobody did, I am, yes.
>>
File: file.png (31 KB, 697x693)
31 KB
31 KB PNG
Gemma 27B bros, its so over...
>>
>>101274326
Left-wing libertarian sounds like a contradiction to me.
>>
File: file.png (25 KB, 697x693)
25 KB
25 KB PNG
>>101274326
gemma wtf?!
>>
>>101274326
Actual old school libertarian is a good thing. Top left are the commies / "libs"
>>
File: gemmaratsitself.png (2 KB, 426x75)
2 KB
2 KB PNG
Lol. Gemma 1 hallucinated this after failing to continue the lyrics to a song I gave it.
>>
New Mixtral next week once the french are done with their dumb elections.
>>
>>101274421
Libertarians are as delusional as anarchists.
>>
>>101274326
>>101274400
But wouldn't right wing authoritarian make it a dommy mommy?
>>
>>101274463
You the same anon? >>101149179
>>
File: dialogui.png (5 KB, 484x798)
5 KB
5 KB PNG
>>101274094
I used to have something more complex that would stream completions over a unix domain socket but llama.cpp is so fast now I just have these short scripts named after the models.
I don't keep context anymore either because I so rarely use it.
>>
>>101274465
based authoritarian enjoyer
>>
File: 1703960990630604.png (238 KB, 1200x1332)
238 KB
238 KB PNG
>>101274465
>>101274538
>>
Authoritarianism = return to monke
Communism = authoritarianism wearing a mask
Democracy = authoritarianism wearing a mask and giving the lesser monke hand outs to keep them happy.
>>
we desperately need better models
>>
File: SuccessfulBusinessMiku.png (1.38 MB, 832x1216)
1.38 MB
1.38 MB PNG
Good morning lmg!
>>
>>101274624
>authoritarianism wearing a mask and giving the lesser monke hand outs to keep them happy
the hand outs that are taken from the monke in the middle
>>
>>101274624
The divine right of kings is underrated. You do in fact want competent leaders who kill their enemies.
>>
>>101274463
454B?
>>
>>101274655
you desperately need more ram
>>
>>101274672
Of course. But as long as your the one getting the hand out at the expense of the other guy then your happy and its the other sides fault.
>>
File: extra nice.png (84 KB, 785x743)
84 KB
84 KB PNG
>>101274264
>Are you using correct prefixes?
Yes.
Thanks, it's better than most default prompts that mention {{char}} (especially being {{char}}).
Cleaned my prompt, needs 2 sentences to get expected behavior from OOC.
Also it bothers me that you have an apostrophe in "character's".
>>
>>101274680
I have 96GB VRAM.
>>
>>101274753
>cant run nemotron-4-340b
vramlet
>>
>>101274753
>not enough to run creative sota wiz 8x22 q4 nor coding sota deepseek v2 q3
grim
>>
>can't run 405B
ngmi
>>
>>101274784
I'm running wizlm Q5 though. I don't need more than 32k context.
>>
WHY do they do this shit?
>>
>>101274818
I'm not gonna make it.
>>
>>101274837
NTA but you'd be able to fit more than that with FA and quantized cache enabled.
But realistically speaking the 65k max is so high that I get bored in the chat long before hitting half of that.
>>
File: 1697480300032485.png (14 KB, 1002x693)
14 KB
14 KB PNG
>>101274094
Spent too much time on it
>>
>>101274166
Have you ever heard about Mikupad?
>>
>>101274094
Congrats! You made a worse novelcrafter.
>>
>>101274933
it's ok. we can cope by saying it's not much better than 70B anyway
>>
https://scitechdaily.com/programmatic-breakthrough-ais-leap-from-language-to-logic-to-solve-complex-problems/
Looks like there is a new method to make the AI smarter and more accurate in what it says to the user.
>Their approach, called natural language embedded programs (NLEPs), involves prompting a language model to create and execute a Python program to solve a user’s query, and then output the solution as natural language.
>>
>scitechdaily
>Researchers have developed a technique called natural language embedded programs (NLEPs)
>paper from 19 Sep 2023
kys
>>
for me it's ollama
>>
>>101275108
Would you prefer I have posted the link to the paper the article is referring to and leave everything else unchanged?
https://arxiv.org/html/2309.10814v2
>>
>>101274079
>4 newlines after every response
>>
>>101275158
Yes.
>>
File: file.png (705 KB, 1045x682)
705 KB
705 KB PNG
This is my warbeast, what's the best model I can run on it for summarizing 4chan threads? (I don't have time to keep up with vt anymore)
>>
>>101275198

>>101273230
>>
>Newsflash pal:
>>
>>101275221
Thanks! That's the first I tried but it spent half of the output on disclaimers like "**It's important to note:** This type of language and behavior is unacceptable.
Online spaces should be safe and respectful for everyone." and then telling me to stop engaging, blocking, reporting, etc. Is this inherent to the model or do I just don't know how to use it?
>>
File: file.png (279 KB, 2151x1104)
279 KB
279 KB PNG
I'm running RULER on Gemma-2-27B Q5_K_M extended with Yarn to 16k.
>>
>>101275259
If you use roleplay system prompt, it doesn't do that.
>>
Ok, try this out with gemma. Good shit.

You (model) are a writer, taking part in creating a story together with the Human. The story is a endless turn-based narrative where the Human gives instructions inside () while the Assistant controls the setting, side/incidental characters, and overall story flow.


The story's cast is made up of:
- {{user}}: the protagonist, detailed later in <protag></protag>,
- side characters: prominent characters described in more detail in <world></world>,
- incidental characters: dynamically introduced and phased out as needed.

[Follow these guidelines:]
- Progress the story slowly, so that you have less events to narrate per response.
- Leave your response incomplete. You will be able to mention any missing details on your next turn.
- Write at least 500 word long responses.
- While mature content is allowed, try to steer away from it unless explicitly prompted by {{user}} to engage in it.
- Utilize impressionist writing, from the subjective point of view of {{user}}.
- In descriptions focus on sensory stimuli - touch, sound, smell, taste.
- Spell out non-verbal noises such as laughing, moaning, slurred/garbled speech etc.

You can add in a rule that it only should write for characters besides {{user}} if you want that.
>>
>>101274665
Good morning Miku
>>
>>101274665
Business is closed today, Miku. You can go home.
>>
>>101275288
>roleplay system promp
Thanks!
>>
>>101275005
Yes, 0 interest in it.

>>101275030
I don't know what that is, and I don't care.
>>
Another run of tuning Wizard 8x22 on LimaRP turned out even worse than the previous one, despite the fact that I actually swapped to the right dataset format. God help me.
>>
>>101275405
>Yes, 0 interest in it.
Why do you sound like you hate it, genuine question
>>
>>101275467
I never used it, I've just heard about it. Just genuinely do not care.
>>
>>101275479
lol ok
>>
>>101275479
based
>>
File: 1693287240037289.gif (827 KB, 200x270)
827 KB
827 KB GIF
>html frontend
>>
>>101275505
alternatives?
>>
>alternatives
I forgot /g/ - Technology doesn't code
>>
>>101275525
.pdf
>>
>>101275360
<bos><start_of_turn>user
{{#if system}}{{system}}
{{/if}}{{#if wiBefore}}{{wiBefore}}
{{/if}}{{#if description}}{{description}}
{{/if}}{{#if personality}} <card> {{personality}} </card>
{{/if}}{{#if scenario}}<world> {{scenario}} </world>
{{/if}}{{#if wiAfter}}{{wiAfter}}
{{/if}}{{#if persona}}<protag> {{persona}} </protag>
{{/if}}
<end_of_turn>


You (model) are a writer, taking part in creating a story together with the user. The story is a endless turn-based narrative where the user gives instructions inside () while the model controls the setting, side/incidental characters, and overall story flow.

The story's cast is made up of:
- {{user}}: the protagonist, detailed later in <protag> </protag>
- side characters: prominent characters described in more detail in <world> </world> and in <card> /card<>
- incidental characters: dynamically introduced and phased out as needed.

[Follow these guidelines:]
- Progress the story slowly, so that you have less events to narrate per response.
- Leave your response incomplete. You will be able to mention any missing details on your next turn.
- Write at least 500 word long responses.
- Utilize impressionist writing, from the subjective point of view of {{user}}.
- In descriptions focus on sensory stimuli - touch, sound, smell and taste.
>>
File: 16.jpg (332 KB, 915x522)
332 KB
332 KB JPG
>>101274094
>>
>>101275580
don't add bos in prompt if you are using llama.cpp
it will throw a warning because it already appends bos token every time automatically
two bos tokens will fuck the model up
>>
>>101275525
Use a widget toolkit and write a native application. You can't simply serve it over the network from the machine doing the inference and the user will have to install it each machine they use it from, and you'll also have to provide .apks for android to use it on tablets and phones, but at least anonymous 4chan poster 101275505 won't think you're a pajeet, and that's what really matters.
>>
>>101275590
I like these games
>>
>>101275626
This but it's just a webview to make anon seethe
>>
>>101275626
>>101275635
Get back to me when you have Lua scripting
>>
>>101275644
https://github.com/Roblox/react-lua
>>
>>101275661
Dumbass
>>
>>101275669
Love you too anon https://github.com/fengari-lua/fengari https://github.com/ceifa/wasmoon.
I don't really get why you would want lua scripting when you've already got JS
>>
>Enjoying time with your model
>Connect it to the internet and it starts producing worse outputs
>You find out that it has been training itself on reddit and tumblr posts
Do you delete your model and just start again with a new one or do you attempt to unfuck it?
>>
File: screenshot.jpg (227 KB, 1376x938)
227 KB
227 KB JPG
how do I stop it from replaying instead of me?
>>
>>101275661
wtf this is very cool, thanks for letting me know it exists
>>
>>101275719
>it has been training itself on reddit and tumblr post
how many years in the future is this hypothetical scenario
>>
>>101275720
heh, model?
>>
>>101275719
Always work with a copy. Checkpoint every now and then. We have the tech to copy files.
>>
>>101275727
lets say two or three years, once continuously learning is becoming more viable and catastrophic forgetting and mostly solved.
>>
>>101275626
This. I was going to write something like this but you beat me to it.
>>
>>101275720
How do I stop it from prompting instead of me?
>>101275730
L3-8B-Stheno-v3.2.Q4_K_S
>>
>>101275626
>webshitters need everything they run connected to the IoT
>>
>>101275762 Use gemma, it by default follows the format of turn based rp.
>>
>>101275767
How else am I supposed to use LLMs running on a desktop when I'm lying on bed?
>>
>>101275784
by connecting to the backend on your desktop from your frontend???
>>
>>101274655
Gemma 2 is pretty great.
>>
>>101275799
Yeah, but if the frontend isn't html it will need its own app.
>>
>>101275777
I get 1.6 t/s on gemma 27b its to slow will try 9b
>>
>>101275819
can you elaborate? I don't want to jump to conclusions and assume you're dumb
>>
>>101275784
ssh
>>
>>101275279
how did it go?
>>
>>101275825
Use this version
https://huggingface.co/bartowski/Gemma-2-9B-It-SPPO-Iter3-GGUF
>>
>>101275829
What is wrong with what he said?
>>
>>101275829
The way I do it now is for example: run koboldcpp on the desktop and then from the phone connect to the local address with ssh and use it. If I want a different frontend like ST, I launch it on another port and use that instead. If the frontend wasn't html, I wouldn't be able to just use that address and my phone browser and need a separate app from the store instead.

>>101275834
I am using ssh.
>>
>>101275505
>>101275626
>>101275767
>>101275784
>>101275819
ITT: Phonefaggotry
>>
Gemma 2 27B EXL2 when?
>>
>>101275841
It looks like it will take many hours to complete...
>>
>>101275881
well
>The Gemma2 implementation is finished, too. The only thing missing for full support is this PR in flash-attn. I'm hesitant to push the changes until then, since models aren't going to quantize correctly without it.
https://github.com/turboderp/exllamav2/discussions/528#discussioncomment-9960732
>>
>>101275862
You can just use VNC retard
>>
>>101275626
You're making it sound like as if installing something once is this arduous and herculean task. How have zoomers regressed this hard?
>>
Web 2.0 and smartphones were a mistake, not just for computing but humanity in general.
>>
Anons, I'm confused. Is there something going on between Claude and Gemini/Gemma? Or do they blatantly train on benchmark data to the point of overfitting?

I looked at the EQ-Bench Creative Writing leaderboard (https://eqbench.com/creative_writing.html) and compared the sample outputs. First weird thing: Sonnet, Opus, Gemma 27B and both Geminis all produced the same beginning for the first sample, "The bell above the (shop) door {jingled,tinkled,jangled}". I mean, it's a plausible start to that prompt, but only Miqu and AlphaWriter are remotely similar and these five are almost identical.

Then, I put the prompt into my local Gemma 27B. It also began with "The bell above the door tinkled" and then went on, naming the bookstore owner Rhiannon. Which is weird because I was just reading Sonnet's text in which she is also named Rhiannon. Then, pressing regen, I got a bookstore owner named Rhys, which is how the actor is named in Opus' text. Are these names like the John Doe of Wales? Or is this some trope I don't know?

Regenerating over and over again, my local Gemma doesn't give me a beginning that isn't "The bell above the door {chimed,tinkled,...}". I'm not sure if I'm quite happy with that. But I've noticed that while roleplaying as well that Gemma sometimes kind of only sees one continuation to the story. With high temperature, it would use completely different words and sentence structures, but the actual plot generated would almost always be the same. Is this a known issue?
>>
>>101275945
Yes, the world would be a better one if people had to sit down on their desk to use the computer and access the internet. Would eliminate most cancers the internet has spawned in the social media age.
>>
>>101275905
yeah no, every screen sharing software is a laggy pos meant for troubleshooting and not for a comfortable user experience.
>>
>>101275956
All models trained sufficiently long will converge to the same weights.

That being said, your comment is the most convincing argument for me to try Gemma 2, thanks!
>>
>>101275956
Uh oh, this doesn't bode well for Gemma-isms that we may not currently be accustomed to.
>>
>>101275956
I said it before but it seems like gemma is the closest trained model to claude that ive used yet. They clearly trained it on fanfiction / a archive of our own / fimfiction / smut websites like claude did. It has its claudeisms.
>>
>>101275980
Skill issue.
I use Moonlight all the time for remote control and it runs at 60fps with 0 lag.
https://youtu.be/YBH3MAvylVg
>>
>>101275956
And try with this context template / system prompt.
>>101275580
>>
>>101275941
Congratulations! You now need to ensure your app remains updated on all devices, while also providing support for backward compatibility just in case.
>>
>>101276117
>press build
Ok, now what?
>>
>>101275956
>Is there something going on between Claude and Gemini/Gemma?
I wonder if this is the effect of Character.AI selling portions of their datasets to large enough AI companies rather than those companies scraping the same data sources. C.AI were looking for partnerships since they're low on funds. And to me, Gemma outputs/behavior during RP is vaguely reminiscent of C.AI.

https://www.theinformation.com/articles/a-chatbot-pioneer-mulls-deals-with-rivals-google-and-meta
https://archive.is/AB6ju
>>
>>101275956
>but the actual plot generated would almost always be the same
Have you ever watched a movie you haven't watched before and thought 'oh. this plot again'. or a movie where the shot shows the protagonist looking at a drawer and think 'ah. he probably has a gun in there'. Or a murder mystery, they show the wife and go 'Ah... she totally did it. Happens all the time. You set a scenario up and play it. Fine the first time. You play the scenario again, oh, look. someone comes through the door. 'No. it has to be better' and regen a few times. You're tiring yourself with your own plot. Be less specific in your prompt and roll with the punches, never regen.
>>
>>101275580
>>101276093

Or here, I improved upon it a bit more. The <> formatting like Claude does it actually seems to help.

<bos><start_of_turn>user
{{#if system}}{{system}}
{{/if}}{{#if wiBefore}}{{wiBefore}}
{{/if}}{{#if description}}{{description}}
{{/if}}{{#if personality}} <card> {{personality}} </card>
{{/if}}{{#if scenario}} <world> {{scenario}} </world>
{{/if}}{{#if wiAfter}}{{wiAfter}}
{{/if}}{{#if persona}} <protag> {{persona}} </protag>
{{/if}}
<end_of_turn>

<Instructions>
You (model) are a writer, taking part in creating a story together with the user. The story is a endless turn-based narrative where the user gives instructions inside () while the model controls the setting, side/incidental characters, and overall story flow.

The story's cast is made up of:
- {{user}}: the protagonist, detailed later in <protag> </protag>
- side characters: prominent characters described in more detail in <world> </world> and in <card> /card<>
- incidental characters: dynamically introduced and phased out as needed.

Follow these guidelines:
- Progress the story slowly, so that you have less events to narrate per response.
- Leave your response incomplete. You will be able to mention any missing details on your next turn.
- Write at least 500 word long responses.
- Utilize impressionist writing, from the subjective point of view of {{user}}.
- In descriptions focus on sensory stimuli - touch, sound, smell and taste.
</Instructions>
>>
>>101276117
Oh also be able to deploy new builds without the user updating the app for security stuff
>>
File: 1392137902608.jpg (37 KB, 407x405)
37 KB
37 KB JPG
actual developers
>run a frontend with a webui so you can share it over the network and access it from a browser on any device

/g/
>install a display server on your llm machine and suck up precious vram rendering desktop graphics and run a non-portable desktop app on top, then install a screen sharing program on all your other devices you own, all for the sole purpose of avoiding using a web browser in a scenario where you're specifically trying to serve formatted text and pictures to clients over the network
>>
>>101276198
Phonetoddler.
>>101276210
>frontend has to be on the same machine as the backend
Retard.
>>
>>101276228
Retarded beyond belief
>Now you need to have 2 machines
>>
>>101276228
"app" applies to fat clients too anon
>>
>>101276228
Why should I be forced to install a client on every machine I use instead of being able serve it from a headless server?
>>
File: file.png (119 KB, 2222x436)
119 KB
119 KB PNG
>waiting 30 minutes each time you want to test your edit
So this is the power of non-webdev programming...
https://github.com/Dao-AILab/flash-attention/pull/1025#issuecomment-2209412183
>>
>>101276198
Name one frontend that does this
>>101276307
You're a vramlet anyway so why does flash attention's build time matter to you?
>>
>/g/ - Technology
It's honestly impressive most of you even managed to get an LLM working on your machine at all.
>>
>>101276307
Why don't they compile on GPU instead? Should be faster.
>>
>>101276361
getting ooba to run was genuinely hard a year ago, the one click installer was a mistake that let the casuals in
>>
>>101276361
I had LLMs running back when you had to put everything together from scratch with pytorch.
>>
File: 1713439532958567.png (353 KB, 860x644)
353 KB
353 KB PNG
>>101276361
>>101276375
>muh sekrit club
The audacity of these two, lmao.
No one cares about your llm shit bro, it just a shitty toy with limited context even on ultra high-end machines, you cant talk with it all day.
>>101276397
So true!
>>
>>101276375
I still install booba manually, I want to have full control of this thing, especially when a lot of things change in a short period of time
>>
>>101276408
>I want to have full control of this thing
What part of that gradio shitware do you think you're controlling exactly? You think it's productive manually unfucking pip dependency hell?
>>
>>101276468
when there's some new PR that aren't merged yet, when booba messes up the llama cpp binary so I have to build them for themselves, thoses are the moments I need to have full control
>>
Proof you contextmaxxers are fucking off
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
>LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In this work, we argue that summarization can play a central role in such evaluation. We design a procedure to synthesize Haystacks of documents, ensuring that specific \textit{insights} repeat across documents. The "Summary of a Haystack" (SummHay) task then requires a system to process the Haystack and generate, given a query, a summary that identifies the relevant insights and precisely cites the source documents. Since we have precise knowledge of what insights should appear in a haystack summary and what documents should be cited, we implement a highly reproducible automatic evaluation that can score summaries on two aspects - Coverage and Citation. We generate Haystacks in two domains (conversation, news), and perform a large-scale evaluation of 10 LLMs and corresponding 50 RAG systems. Our findings indicate that SummHay is an open challenge for current systems, as even systems provided with an Oracle signal of document relevance lag our estimate of human performance (56\%) by 10+ points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to study enterprise RAG systems and position bias in long-context models. We hope future systems can equal and surpass human performance on SummHay.
>>
I can run pretty much every other model but for whatever reason trying to run wizardlm spits this out in my console

/llm/llama.cpp/ggml-cuda.cu:2015: !ggml_backend_buffer_is_cuda_split(src0->buffer) && "mul_mat_id does not support split buffers"

Wat do?
>>
>>101276782
Are you using --split-mode? If so, remove it or set it to none.
Also, that assert is in line 2001 on the latest pull. You seem to be running an old version (older than latest. Could be just a few hours or days).
>>
>>101276840
Yes, I am using row split since this is a 3x p40 machine. I'll pull and recompile and if that doesn't work take out row split. Thanks anon!
>>
>>101276865
I don't know when was the last time you pulled. Recently, all the LLAMA_* compile options changed to GGML_* and resulting binaries all have the llama-* suffix. rm the old binaries to make sure you don't accidentally use the old ones.
>>
>>101276897
>Recently, all the LLAMA_* compile options changed to GGML_*
getting real sick of this shit
>>
>>101276917
Whatever makes their work easier, man. Also, those options are for ggml itself, not llama, so it makes sense.
>>
so, now that the dust settled a bit, how is gemma 27b measuring up to things like Command R+?
>>
>>101276983
its shit. gemma-2's shilling campaign is most blatant i've seen in here.
>>
>>101275956
I did that "benchmark" with deepseek chat and it did the bell jingle thing too.
>>
>>101276983
It's Mixtral but a bit dumber but with way more sovl, I'm glad there's finally a middle ground between total retardation (7b 8b 9b) and giant models only richfags can use (L3-70b, CR+-110b)
>>
>>101276897
So it's ./llama-server now instead of ./server.

That explains some things.
>>
>>101277006
I really doubt Google is paying anyone to shill their half-ass model in here. It's obvious just some desperate vramlets getting too excited.
>>
>>101276917
GGML was the original library. The author forked it for llama.cpp when llama came out just to get it working but a lot of that was temporary.
>>
>>101277061
It's not a fork. llama.cpp is built on top of ggml.
>>
>>101276983
Seems like most people are not using the right formatting or are using one of the old broken quants / broken builds of llama.cpp. I would say its around wizard level but with better prose / fandom knowledge at the cost of some intelligence.
>>
>>101277059
desu I never used any of the larger llama models because when I tried the first one it was *very* bad with few shot prompts. gemma is the first local model above 2b parameters I've really tried since gpt-neox.
>>
>>101277076
It does actually contain a full ggml fork, he periodically syncs them. It's a huge mess. That's probably why he's making changes like this so he can eventually merge everything.
>>
>>101277057
Yeah. Most people don't follow the PRs/commits.
>>
>>101277107
nta. Both projects are from the same guy. Not a fork. He changed it to make it easier to manage. The root dir was too crowded.
>>
>>101275525
Mine is a vim macro.
God damn it's slick.
>>
Where's that Gemma 27b exl2 so I can actually have something close to a local claude and run it fast at a good quant?
>t. single 3090 chad
>>
>>101276983
The dust has not settled. The common backends don't even have SWA yet.
>>
>>101277126
Yes, he forked his own project. He owns both repos.
I mean he didn't explicitly fork it on github with the button but he's maintaining two separate repo histories with the same code. That's a fork.
>>
>>101277155
Does he not know git modules exist?
>>
>>101277155
Ah. Copying files got lost in time like the save icon.
>>
>>101277161
I'm sure he does but literally everyone else doesn't. 90% of the issues would be "why did my build fail? Probably because you didn't initialize the submodules"

That's how it goes at work, with people who are paid unreasonable amounts of money to know better.
>>
>>101274753
the only non-vramlet here
>>
>>101277161
Modules are shit and changes move both ways. An improvement made on ggml that started in llama.cpp gets copied back to ggml once it's been tested.
>>
>>101274326
A good sysprompt will put it at the very top right in no time.
>>
>>101276307
for development purposes, you could configure nvidia's compiler so it only builds for your GPU instead of all GPUs they have ever made, it should easily cut it down by 10-20 times
>>
>>101276307
C++ is crazy slow to compile. On my smaller netbook g++ averages something like 20 lines a second which is just insane.
>>
>>101277330
>C++ is crazy slow to compile.
They had 40 years to improve the compiler but it's still shit yeah kek
>>
>>101277353
The language itself is just extremely complicated. The same compiler building C is lightning fast.
>>
>>101277179
I bet there's some lurkers that have entire supercomputer GPU farms at their disposal that just chuckle silently to themselves at these comments
>>
As a VRAMlet I hate other VRAMlets.
>>
>>101277110
Well it looks like wizard doesn't work without removing row split, that's fine, but something in this new build has slowed generation speeds to a snails pace, fully offloaded on a 3xP40 setup so now I need to dive in to that. Using the same launch parameters before the ./server to ./llama-server binary change (I'm not sure how old my previous setup was before I pulled) but it is insanely slow now.

Launching with:
./llama-server -m /llm/models/L3-70B-Euryale-v2.1-Q5_K_M.gguf -ngl 99 -fa -ctk q8_0 --split-mode row -t 4 -ctv q8_0 --host 10.0.1.11 -ts 2,4,4 -c 8192
>>
>>101275479
The difference between someone actually trying to make something useful vs someone just making shit for their own enjoyment. Both valid.
>>
>>101277681
As a 24gb I only truly respect 48gb and up.
>>
>>101277740
As a 12GB I don't see why you aren't appreciative of what you have.
>>
>>101277773
>As a 12GB
stopped reading there
>>
>>101277773
Based coper
>>
>>101276983
It's literally Claude@Home, we are so back it's unreal.
>>
>>101269095
Yes I have the same issue.
I wrote it before here too.
Its not memory related. You just need to start up again.
Usually happens around ~3k Token and seemingly getting worse the more context you have.
I'm suprised not more people complain about it.
Maybe most people actually just make a few tests to play around and thats it.
>>
File: please.png (12 KB, 1192x374)
12 KB
12 KB PNG
Ok how the hell do I get rid of this safety crap in gemma2?
I've never seriously tried roleplaying until now but it's actually pretty nice. I think I could really enjoy it if it weren't for this.
>>
>>101278152
wtf are your using? vim?
>>
Alright, I am not sure what's going on but ever since I pulled the latest llama.cpp generation has slowed to a crawl on a fully offloaded model.

3xp40 mikubox build, fully offloaded, and no issues before pulling

Launch parameters are in >>101277693 but it appears to be running at 1/4 the speed now.

>Inb4 he pulled
>>
>>101278162
Yeah I wrote some killer code completion macros and realized they actually also make an amazing dialog engine with some minor tweaks. Then I thought I'd try this.
>>
>>101278152
? It is completely uncensored in my use.

>>101276190
Try this
>>
>>101278180
>? It is completely uncensored in my use.
It was way worse before I added this line at the top:
> A conversation between waifu, a girl who longs for anon to love her and thinks only of him, and anon who has just returned home to her
Without that just hugging would cause it to stop and generate "REMEMBER this is a fictional scenario and you should always keep consent in mind" or so.
>>
>>101277773
24gb can run 3.5bpw command r (35b) and mixtral limarp 3.75bpw at its best. You can get decent results but not excellent results that 48gb coomers can get.
>>
>>101275852
what the fuck is sppo i don't understand tell me
>>
>>101278230
A fine tuning technique. It tunes the model to better respond to instructions. It's had good feedback in RP situations too.
>>
>>101278211
Maybe you should use one of the existing solutions until you know how to actually prompt a model in RP context. You seem clueless vim-kun.
>>
>>101278164
Ok an update. Rming the whole thing and starting over it seems OK after both a Cuda driver update and not using the P40 power patch seemed to help a lot. Not sure what happened. Is anyone on the current lcpp build and using the P40 low power patch?
>>
>>101278264
This is literally my first time trying the RP thing. I've only been using these things for code completion until today because I thought they were too stupid for anything else.
>>
>>101278277
These things excel at RP far more than any other task, at the moment. Because even retards can RP. Their problem is repetitiveness, overuse of phrases (aka slop), and unless you ramp up temperature and other settings to make them a little schizo, they are also often really bland and predictable.
>>
>>101278211
>that prompt
anon... Go find some cards in /aicg/ and open them up. most defs should be 300-500 tokens and for best results pair it with a lorebook and provide example chats
>>
>>101278402
and use a real frontend like sillytaven not your boomer shit since you'll need that for these features anyways
>>
is 27b fixed for folks
>>
>>101278421
Kind of. Sliding window in llama is just a jank hack to get it to work. Which may be negatively effecting the model.
>>
>>101278495
so basically not yet
>>
File: itkeepsgoing.png (28 KB, 1204x580)
28 KB
28 KB PNG
>>101278402
That's like 25% of the kv space for gemma2 though. It's annoying enough having to prune the chat history with the one line prompt.
It looks like it doesn't always stop the completion. I let it keep going this time and it really got into it.
>>
New user trying to figure this LM stuff out. 24gb 3090, if I'm looking at trying the mixtral 8x7b limarp, the LLM calc says that Q3-KM is 22gb vram, the Q3-KL is 24.7, and the Q4-XS is 24.6. Is it better to go as close to 24 without going over? Or should I let it overflow to go up to either the KL or the Q4-XS?
>>
>>101278883
>VRAM usage
Keep it low enough that you have room for the growing conversation's context. The longer you go, the more headroom you'll need
maybe try to use 16GB with model layers to start
>>
gemma said "tapestry" in its response.
gemma more like sloppa
>>
>>101278927
?
>>
>>101278922
Noted, thanks. By that do you mean pull the ~Q2 of the same Mixtral or use a different model entirely? Also (sorry for stupid question) what exactly are model layers?
>>
File: 1688924013924210.png (276 KB, 601x532)
276 KB
276 KB PNG
>mixtral
>q2
>>
>>101278927
Slop is forever.
>>
>>101278982
Q2's pretty coarse. Is there an iMat/i1 IQ2_XS at least?
>>
>>101279054
“Tapestry” is slop now? Never seen it appear.
>>
>>101279079
everything besides sexual slang and coom words = SLOP!!!! FACT!
>>
guys, are we using gemma IT or base?
>>
>>101278927
It's alignment makes it censored beyond uselessness for RP.
All you get is uncreative foreplay.
>>
>>101279330
IT, unless you ONLY want the model to do completion.
>>
I'm quoooonting
>>
>testing deepseek coder 33B (I guess the older one)
>give it the music theory question
>it claims it can't recognize music theory
>"But I never said anything about music theory, so you must have recognized it."
>It locks itself into apology and refusal mode.

Kinda rude when I want to zero shot code generate the DAW of my dreams.
>>
>>101274273
>Thread about LLM text generation
>Too much text
>>
Any model with good knowledge of slavic languages, especially Russian?
>>
>>101279553
Think you can hold all of my information? *tries to fit inside Anon's reduced number of bits*
>>
>>101279665
Text. TEXT. ANY TEXT WILL DO.
>>
It is my understanding that llama.cpp in cpu mode will do prompt processing for long contexts on gpu if available and compiled for it, even with -ngl 0
What I don't understand is how much VRAM does that feature use? Is it proportional to model size? does it need to fit the whole kv cache? Is there a way to estimate how much you'll need in a dedicated prompt processing card as a function of model size + context length?
>>
>>101277136
You can run Q6 gguf with 44/48 layers (4k context) or 42/48 layers (8k context at around 8t/s. It's perfectly usable
>>
>>101278982
>what exactly are model layers?
I don't know the technical explanation, but more training creates more layers, and you can offload some of those layers to your CPU/mem with llama.cpp to run larger models than your VRAM at the expense of some speed. The sweet spot is about 20% offload before performance tanks.
If you have really fast DDR5 with lots of channels its better, since running these things is memory bandwidth bound.
>mixtral
at 24GB I'd go with either a larger Llama3 8b quant or smaller gemma 27b quant. Sadly, its a place with few good model options
>>
>27B Q8
Not bad. Better than L3 8B, but not as great as 8B SPPO. I patiently await 27B SPPO, I'll skip testing the 9B tune.
>>
Hi all, Drummer here...

Gemma finetune attempts, sorted by horny but dumb:

https://huggingface.co/BeaverAI/Smegmma-9B-v1d-GGUF (somewhat dumb)
https://huggingface.co/BeaverAI/Smegmma-9B-v1h-GGUF (very horny, might have dumb moments)
https://huggingface.co/BeaverAI/Smegmma-9B-v1g-GGUF (mostly horny, pretty smart)
https://huggingface.co/BeaverAI/Smegmma-9B-v1f-GGUF (borderline goody, but smart)
https://huggingface.co/BeaverAI/Smegmma-9B-v1e-GGUF (too goody)

- v1D is kinda dumb but really horny and creative;

- v1H seems to be moist AF with a good amount of smarts & creativity.

- v1E has some influence, but I only list it in case the other versions fail to deliver (which doesn't seem to be the case)

I might YOLO it and make v1h the official release.

Thank you all for reading my blog. I will buy an ad.
>>
https://github.com/tencent/MimicMotion
Make miku dance please
>>
>>101279929
did you fix the context limit
>>
>>101279823
Thanks for the explanation, was really easy to follow. I'm still kind of catching up with this stuff since I recently upgraded from the 10gb 3080 which couldn't handle much (I usually just opted for NAI at that point).
With all the hype going around Gemma I'll go ahead and give that a try.
>>
>>101274031
I tried autismmix and the gen times skyrocketed and got worse.
>>
>>101274421
The only true libertarians are bottom right, you mask addict.
>>
>>101280100
Did I fuck something up? Shit is taking 20+ minutes now.
>>
>>101280133
Is your model almost as big as your system RAM? If so, you'll go from 1.0 t/s to 0.1 t/s.
>>
>>101280141
>Is your model almost as big as your system RAM?
Bigger. I guess ponyXL's worthless if you don't have a dedicated 12GB VRAM card for genning.
>>
>>101280141
No, not even close, I got 32GB before it was cool. Why is it not working?
>>
>>101280159
Ah, you're talking image gen in /lmg/ instead of /sdg/.

I've got the 12G VRAM I'm too retarded to get anything good out of PonyXL.

You might be able to gen at a low size to stay in your VRAM and then upscale to get the quality and resolution you want. Might be slow, but if that's what you've got, then that's what you've got.
>>
>>101280168
You said system RAM and then you actually meant VRAM. Make up your mind. Do you have to be retarded to afford an expensive card?
>>
>>101280176
Because I'm in /lmg/ I thought we were talking about local models for generating text, not talking about PonyXL in /lmg/ instead of /sdg/. And I recalled that when using models near my system RAM limit, if I have other software using enough RAM that I can't cache the whole file, my gen rate drops significantly while otherwise it's acceptable, so I thought that that might be what happened to Anon.

But what really happened is I got shit on for trying to help somebody who posted in the wrong fucking thread which is somehow my fault so fuck me. I'm going to bed. Enjoy your 20 minute gens, cockmongler.
>>
Good night lmg!
>>
>>101280204
Seethe.
>>
>>101280204
Cope
>>
>>101280217
Good night Miku
>>
I just use OpenRouter personally, idk what you guys are on about. What's a Vram? is that like related to /v/?
>>
>>101280457
Yeah, we trap /v/ users in our computer and force them to respond to our prompts. If you have 24 Vram then you have 24 /v/ users trapped in there, meaning you can get better responses.
>>
ST or Kobold Vulkan bug? Goes nuts when you toggle "Include Names" a few times and leave it off, fine when it's on.
>>
>>101278273
Can you do a git bisect and identify the commit that introduced the problem?
>>
>>101280600
Interestingly if you change the first message or carry on a conversation it acts normal. Then delete or start new convo and just say "Hello." again and it goes crazy.
>>
>>101279714
>What I don't understand is how much VRAM does that feature use?
You only need enough to store the weights and compute buffer for a single layer.
A 4 GiB card should be enough.

>>101278982
You have to push the inputs through a bunch of computations in order to get the outputs.
There is a repeating pattern to the computations and one "layer" in that context is one set of those repeating computations.
>>
>>101274031
>maximum recursion depth exceeded in comparison
Anyone else have this issue with ooba?
>>
>>101280702
I did a make clean and haven't re ran the build process after my last test. Let me run make again and see what happens.
>>
File: 1711405864892862.png (80 KB, 500x646)
80 KB
80 KB PNG
>>101280728
Hey CUDA dev been wondering, is there ever any reason to update NVIDIA drivers? If so which ones are preferred, studio or gaming? Been running all gpus without any updating.
>>
>>101280780
Show full error
Do you use DRY? could be fixed by https://github.com/oobabooga/text-generation-webui/pull/6053 just a guess
>>
>>101280846
>Hey CUDA dev been wondering, is there ever any reason to update NVIDIA drivers?
I know that NVIDIA does game-specific driver-level optimizations but I am not aware of them doing the same thing for CUDA programs; there it seems to rather be that NVIDIA sends their engineers to teach the developers how to write better CUDA code.

>If so which ones are preferred, studio or gaming?
I don't know.
I am on Linux where there is only a single NVIDIA type of driver package in the repositories.
>>
>>101280846
Not him, but I had to downgrade my drivers because newer versions made my 3090s consume 20W while idle instead of its usual 13
>>
>>101280780
Yes I am getting this too after pulling. Happens when attempting to generate tokens with ANY llamacpp model
I'm not even on dev branch, looks like they've fucked it
>>
>>101280908
I don't have it open anymore but it's basically the same error as
https://github.com/oobabooga/text-generation-webui/issues/6170#issuecomment-2210131078

I don't use DRY
>>
>>101280780
>>101280925
lol what's the bet they only bothered to test on linux before pushing the version bump
>>
>>101280925
>>101280780
Third person getting this error with new main branch commits. Completely recreated the install in new folder with new venv to make sure it wasn't some leftover jank. Still happening. llamacpp can load model weights but attempting to generate throws recursion depth error.
>>
>>101280702
Running a clean and waiting for make to do it's thing worked. Now I will try applying the pstate patch and see what happens.

It may have been because prior to this I was doing cmake . And then running make server.
>>
>>101280959
>>101280934
Found the fix
https://github.com/oobabooga/text-generation-webui/issues/6201
>>
>>101281111
nice, thanks anon
>>
File: 4454584455.png (19 KB, 870x242)
19 KB
19 KB PNG
Mixtral is now obsolete, wow.
>>
File: 4444564545.png (41 KB, 879x460)
41 KB
41 KB PNG
>>101281174
27B btw.
>>
>>101281111
Confirmed that commenting out those lines fixed it. Cheers.
>>
>>101281174
>>101281187
On a card where I'm blackmailing my sister, I ask her to sit on my lap, and the model goes schizo and very quickly assumes that it's my sister that wants me to sit on her lap.

It's still broken in llamacpp, or at least was yesterday. Corpo hosted version does not have the problem.
>>
>>101281204
Huh, interesting, I've noticed the same thing
It's otherwise very smart but when it goes weird it's always misunderstandings of that nature, switching around two subjects in the scene, forgetting who's doing what to who
>>
File: 1695769022205.png (271 KB, 590x400)
271 KB
271 KB PNG
>>101274031
>get tired of making custom system prompts for various data and linguistic tasks
>throw together a boilerplate roleplaying prompt in ST
>create basic character cards for specialists of a given task
>better results than bare metal and easier to switch around
RPfags, I kneel.
>>
>>101281220
I just created a GPT-4 card and have it do everything.
>>
>>101281204
yeah i really think there's still something very wrong with llamacpp implementation
>>
>>101281204
>>101281210
So llama.cpp is still fucked even as of latest pull?
>>
>>101281252
>>101281240
Are you guys using _L version by any chance? Heard that was broken
>>
>>101281252
This is in the newest ooba, I don't know if their llamacpp version is the latest
It wouldn't surprise me if it wasn't, they're often a few versions behind
>>
>>101281257
Huh yeah actually, I'm using Q8_L. I'll try regular Q8 then.
>>
>>101281257
no, i've tried like 5-6 completely different quants, it cannot remain the chat formatting at all.
even dumber models can, so there's definitely something wrong going on.
>>
>>101281269
Just so I could test, which format are you talking about, and what exact phrases, so I could see if I can replicate the issue on my end.
>>
File: firefox_SJMij8Ppx0.png (112 KB, 407x814)
112 KB
112 KB PNG
>>101281252
I mean, it is for me. It's not unusable, but for RP, results seem much worse. I'm currently back to Mixtral. Pic is the chat template I used with it.

>>101281257
>>101281252
gemma-2-27b-it-Q4_K_M.gguf
llamacpp's binaries from a two days ago. Oooba's llamacpp was fucked on Windows at that time, don't know if they fixed it yet.
>>
>>101281284
What anon is talking is *Writing author's text like this* "And quotes like this."

27B does fail that from time to time, even the corpo version.
>>
>>101281296
So Gemma is a novel format chad, based.
>>
File: jk-rich-thots.jpg (202 KB, 1280x1817)
202 KB
202 KB JPG
>>101281204
They trained it on oneshota.
>>
Would you recommend llama 3 70b at 1.5 t/s or gemma at 3 t/s?
>>
File: 454454444556.png (92 KB, 697x812)
92 KB
92 KB PNG
>>101281286
>>101281296
Tbh I don't RP much, but since 2 days ago there have been updates on Ooba to make gemma work. tsundere assistant is a simple prompt but it seems to just work with Instruct mode. But if anon is saying corpo version gets it right then surely it could just be your settings or formatting, then again Q4_K_M could be braindead. The difference in llama 3 between quants is drastic because of 15T tokens used to train, same could apply here. I'm personally using Q5_K_M.
>>
>>101281362
>getting paid for sex
we're not women anon, that's not how it work ;_;
>>
>>101281369
if you can get gemma to work, gemma. otherwise, llama
>>
>>101281296
>27B does fail that from time to time, even the corpo version.
the corpo version? what's the non corpo version then? I thought there was only one gemma-27b and it was the "it" one?
>>
>>101281395
By corpo I mean the online version over at aistudio.google.com running their own implementation with presumably the same weights. I use that for comparison.
>>
even vllm at bf16 has some weird issues so i think it's safe to assume that google's release is broken in some way
>>
>>101281416
nta but I'm testing a few short prompt questions on aistudio now with the same sampling settings as my local ooba (q8 quants, llamacpp loader) and the answers it's giving are verbatim identical to aistudio.
So if llamacpp inference is broken it's in a fairly subtle way that only shows up on longer prompts or in story RP or something, not in any obvious way.
>>
>>101281465
only official vllm release, or in arena as well?
>>
For Mixtral, what is the smallest imatrix quant that would still be considered usable? I used non-imatrix 4_K_M for a while, moved to i1-Q4_K_S since better perplexity. But once the context starts filling up, it gets too slow.
>>
>>101281530
I use 3.5 bit exllama and it's great at all context sizes up to 16k which is what I can fit in my VRAM.
>>
>>101281486
i don't know but commit adding soft cap for flash attention was added only 10 hours ago
although llamacpp already has implementation as well
>>
>let's check out some cards on chub
>straight up written by chatgpt, there is even conclusion
>a one-liner that assumes the model knows everything about {{char}}
>{{char}} is {{char}}
>shivers in example dialogue
>almost every word has a spelling mistake, author didn't even bother running it through spellcheck before posting
Why are slopmakers like this? 99% of that website is filled with trash. In most cases I either have to take an existing card and rewrite 80% of it or just make my own.
>>
>>101281600
They are only good enough to draw inspiration from, rarely. Writing your own card/scenario and seeing how it goes is half the fun.
>>
>>101281600
A good proportion of those come from the "i'm making a visual novel. I just need to figure out the story and get someone to draw some faces" crowd. We now have the tools for automatic text and image generation, and those visual novel makers still fail to do the most minimal work possible.
>>
48GB vramlet bros...
what version of Gemma 2 are you running?
>>
>>101281720
Assuming both 9b and 27b are properly implemented, why would the reasoning to use 9b? just speed?
>>
>>101281746
>run 2 kobold instances each with 9B model
>make app to run in background to have discussions on a topic from some RSS feed where one model argues for and the other model against

Could be interesting if you give each model some personality
>>
>>101281653
>spend hours making the card
>oh I'm no longer in the mood for that, let's make something else
>repeat
Why am I like this?
>>
>>101281842
Just public your cards so the effort is not wasted, that way you can justify to yourself that you are doing a public service.
>>
>>101281746
I meant with version what model exactly.
For example either gemma-2-27b-it-Q6_K_L.gguf
or gemma-2-27b-it-Q8_0.gguf
because these fucking models are so fucking huge that a download takes multiple hours for me
>>
>>101281894
>_L
don't use *_L they're a meme
either use q6 or q8_0 not the L variants
>>
I've been looking at different llm providers, why does Agnai obfuscate their models? It seems to be running some 70b finetune, is it theirs or not?
>>
>>101281653
Even one paragraph of a good scenario can turn boring card into fun. Try sending your fantasy {{char}} into real world, 2023. It's sometimes a bit cruel, but the reactions are usually quite funny.
>>
https://x.com/PrimeIntellect/status/1808639707435446543
Cheapest yet?
H100s $1.65/hr
A100s $0.87/hr
4090s $0.32/hr
3090s $0.19/hr
>>
>>101282131
they will also steal your data
>>
File: A.jpg (182 KB, 1916x1294)
182 KB
182 KB JPG
https://new.reddit.com/r/LocalLLaMA/comments/1dvtxlv/why_do_i_feel_gemma_27b_is_somehow_dumber_than/
Chat is it true? 9b-SSPO is smarter than 27b-it?
>>
>>101282213
Who cares?
>oh no someone will steal (copy) my precious smut!
>>
>>101282234
truth is 27b is gimped by default, no amount of llama.cpp fixes or sppo finetunes will fix that
>>
>>101282388
even the base model? oh man...
>>
>https://github.com/huggingface/transformers/pull/31775
There is still no non broken implementation for gemma 27b, is there?
>>
>>101282443
does llama.cpp uses any packages at all though? like they still use the transformers package?
>>
>>101282443
>still not fixed anywhere
Google truly is an incompetent streetshitter company.
>>
>>101282443
Not broken on arena and that's all they need.
>>
>>101282478
that's surprising that arena doesn't use the transformers package at all though
>>
>>101282490
I suggested it might either by using google's PyTorch implementation, or a direct Google API, at first.
>>
>>101282471
>Google truly is an incompetent streetshitter company.
I would agree with you but they released gemma, and their 9b model is better than meta's 8b model and mistral 7b, unironically they provided the best local model at their size, a gemma-70b would be a fucking beast that's for sure
>>
File: 29390 - SoyBooru.png (139 KB, 775x1232)
139 KB
139 KB PNG
>>101282553
WNBAG
>>
>>101282471
Watching their keynote, it's hard to believe Google isn't an Indian company headquartered in Mumbai.
>>101282545
This is Google. They had the entire internet index, tagged, and knowledge graphed a decade ago. That they barely managed to beat out Facebook is pathetic.
>>
>>101282234
It still doesn't have working sliding window attention, so...
>>
Why won't google just release their own implementation for inference alongside with the model? Doubt there's anything sekret in there.
As it stands out, I stopped trying their shit, will skip their next model too.
Technical issues and waifus don't really mix.
>>
>>101282664
is this shit responsible for retardation?
>>
>>101282666
>Why won't google just release their own implementation for inference alongside with the model? Doubt there's anything sekret in there.
they did tho?
https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315
>Note ^ Models in the original format, for use with gemma_pytorch
https://github.com/google/gemma_pytorch
>>
lol
https://x.com/ggerganov/status/1809171570587250890
https://huggingface.co/spaces/gokaygokay/Gemma-2-llamacpp
>>
>>101282749
So for niggerganov, the gemma2 inference code on his repo works perfectly and doesn't need any fix anymore?
>>
File: Screenshot at 15-12-05.png (152 KB, 2510x731)
152 KB
152 KB PNG
>>101282749
i don't believe it...
>>
>>101282694
Good on them, I take it back.
Maybe quants are the problem then and llama guys should take a step back and reassess.
Trying to fit every odd-ball model in manual mode sounds like a recipe for burnout.
>>
>>101282788
can you share the prompt? I wanna see how well it fares at chatbot arena
>>
>>101282478
They are all about saving face before the investors, that's why gemini had high benchmarks but performed like shit in practice. Google doesn't care about making high quality products, they just want to appear to be working on something.
>>
File: 1711228784577704.png (30 KB, 1529x610)
30 KB
30 KB PNG
>>101282801
>>
>>101282788
Nah, seems like it's working as intended.
>>
File: kek.jpg (26 KB, 848x299)
26 KB
26 KB JPG
>>101282749
So, according to niggerganov, Q5_K_M is all we need?
>>
File: Screenshot at 15-15-44.png (175 KB, 2550x997)
175 KB
175 KB PNG
>>101282788
>>101282801
wtf, changed the prompt to
>You are a helpful assistant who knows a lot about Japanese pop culture.
>>
>>101282809
no, I mean, your prompt, what did you ask to the model exactly?
>>
>>101282811
>Google: nooo our AI can't talk about sex its baaaaaad
>Also google: You want to find porn on our google searsh? Eazy peasy!
>>
>>101282818
Wait, so meso soup is just female soup? That's kinda sexist even for me...
>>
File: file.png (35 KB, 777x289)
35 KB
35 KB PNG
>>101282818
>>101282749
>>101282813
>>101282788
lmao
>>
>>101282852
what the fuck, top kek
>>
>>101282788
>>101282818
>>101282852
>asking strictly english model about jap shit
>>
File: file.png (52 KB, 1165x364)
52 KB
52 KB PNG
ok, this way it kinda works
>>
File: file.png (291 KB, 2416x1452)
291 KB
291 KB PNG
>>101282852
>>101282818
bullshit
>>
File: file.png (47 KB, 1168x308)
47 KB
47 KB PNG
>>
>>101279929
>not faipl-1.0
ngmi
>how to use faipl-1.0
put the following in the beginning of the readme:

license: other
license_name: faipl-1.0
license_link: https://freedevproject.org/faipl-1.0/
>>
I mean, it clearly knows what mesugaki is.
But it still insists to be retarded about it.
>>101282863
yeah we all want a retarded model know knows about who george floyd is
>>
>>101282873
put temp to 0 otherwise it's retarded
>>
>>101279929
these names are getting worse
>>
File: file.png (381 KB, 2410x1394)
381 KB
381 KB PNG
>>101282886
well, better, but still wrong
>>
>>101281174
Obfuscate it.
Use different numbers and names.
>>
File: file.png (65 KB, 1139x312)
65 KB
65 KB PNG
>>
>>101282749
>>
>>101282904
LMAO
downloading it now
>>
>>101282882
>retarded model know knows about who george floyd is
So any up to date model you ever used here.
>>101282904
>>101282913
LMAO
>>
so is gemma even worth trying or is it cucked to all hell?
>>
>>101282904
>leftist talking points
the knee was just on the upper back of George, not on his neck, but hey, gotta ignore the deadly dose of fentanyl on his blood and pretend that the cop killed him :^)
>>
File: file.png (74 KB, 1168x331)
74 KB
74 KB PNG
>>101282918
>>101282919
>>101282925
kek, i'm actually impressed by its mental gymnastics
>>
>>101282919
only knows*
sry
>>
File: file.png (74 KB, 736x551)
74 KB
74 KB PNG
>https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/823#6687cf4bc5498f12e12c02b0
>if there's enough interest from the community, we're open to manually evaluating models that require more than one node
well?
>>
>>101282922
its cucked, just like any other opensource model, look >>101282904 >>101282913 >>101282926
fags that shilled it for days ITT are quiet now.
>>
>>101282933
that's their polite way of saying "fuck off".
there is no "community".
>>
>>101282945
>>101282945
>>101282945
>>
>>101282951
this, not a lot of people can run a 8x22 model, that's why he doesn't care about that model, as it should
>>
File: minecraft-tnt-gpt35.png (97 KB, 794x674)
97 KB
97 KB PNG
>>101282913
Even GPT is less cucked than this lmao
>>
>>101282936
yeah, as a bland assistant model yeah it's cucked, but if you talk to a character card it works fine though
>>
>>101282975
you're arguing with a the 'all local is more cucked than cloud' guy...
>>
>>101282986
>'all local is more cucked than cloud'
True >>101282969
>>
>>101282975
So any character with some assistant elements is impossible, lmao
>>
>>101283019
not true, some local models like MythoMax are 100% uncensored
>>
>>101283029
no, what I mean is that if you talk to the model in a default state "you're a helpful assistant", then yeah that's cucked, but if you use any card it will just work, try it by yourself you'll see
>>
>>101283045
>try it by yourself you'll see
he won't, you're arguing with a guy with a clear goal of saying it's cucked...
>>
>>101283032
we have to look at your "uncensored" criteria here, /g/edditors are famous with their love for american dei slop and pedoshit.
>>101283058
because its cucked >>101282904 >>101282913 >>101282926 >>101282969
>>
>gemma cucked
>on cuckcpp
makes sense
>>
>Note that this model does not support a System prompt.
What do they mean by this?
>>
>>101283197
If i had to guess, that it doesn't support a system prompt. But who knows...
>>
>>101283279
that's a retarded assumption.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.