[a / b / c / d / e / f / g / gif / h / hr / k / m / o / p / r / s / t / u / v / vg / vm / vmg / vr / vrpg / vst / w / wg] [i / ic] [r9k / s4s / vip / qa] [cm / hm / lgbt / y] [3 / aco / adv / an / bant / biz / cgl / ck / co / diy / fa / fit / gd / hc / his / int / jp / lit / mlp / mu / n / news / out / po / pol / pw / qst / sci / soc / sp / tg / toy / trv / tv / vp / vt / wsg / wsr / x / xs] [Settings] [Search] [Mobile] [Home]
Board
Settings Mobile Home
/sci/ - Science & Math

Name
Options
Comment
Verification
4chan Pass users can bypass this verification. [Learn More] [Login]
File
  • Please read the Rules and FAQ before posting.
  • Additional supported file types are: PDF
  • Use with [math] tags for inline and [eqn] tags for block equations.
  • Right-click equations to view the source.

08/21/20New boards added: /vrpg/, /vmg/, /vst/ and /vm/
05/04/17New trial board added: /bant/ - International/Random
10/04/16New board for 4chan Pass users: /vip/ - Very Important Posts
[Hide] [Show All]


[Advertise on 4chan]


File: 39rXD.jpg (103 KB, 1024x706)
103 KB
103 KB JPG
previous thread >>16072199

if you love stats, weird numbers and counterintutive science, this is your general. Because one of the things with statistics is that nothing ever seems to be what it tries to show you on a first glance or glimpse. Doesn't matter if you are a seasoned professional, NEET or some disgruntled grad student. All are welcome.

Some people may not like it if you try to make them do your homework, others won't care and will just help you. Let's discuss theories together, ask questions and try to meme a little about this field.

in the previous thread we discussed why Julia has promise but is not delivering. How some people still use Matlab but hate it.

So, grab your favorite statistical software, dust off your textbooks, and join me in this exciting journey through the world of /psg/ - Probability and Statistics General! Let's embark on this adventure together and unravel the mysteries of data one statistical concept at a time.
>>
>>16113091
>in the previous thread we discussed why Julia has promise but is not delivering. How some people still use Matlab but hate it.
i don't think the data supports this conclusion
>>
why is random so scawy?
>>
can somebody tell me how to interpret the johansen test results?
Im going fucking insane trying to understand this

when I look it up I get conflicting answers, for instance:
this reply (https://stats.stackexchange.com/questions/600918/johansen-cointegration-test-in-python-statsmodels) to a stack exchange question says the OPs test fails to reject the null hypothesis when OPs test statistic is greater than the critical values
while this blog post (https://www.quantstart.com/articles/Johansen-Test-for-Cointegrating-Time-Series-Analysis-in-R/) says they can reject the null hypothesis because the test stat is much higher than the critical values
THEY ARE LITERALLY SAYING THE EXACT OPPOSITE THING
I think the latter is correct after looking at other sources, but its confusing the hell out of me

I implemented my own johansen test using the python statstools module and ran a test on something that *looks* cointegrated and got picrel results
can someone tell me if I have a cointegrating relationship?

explain it to me like Im a fucking retarded infant:
do I reject the null hypothesis (ie that theres no cointegration) when the test stat is greater or lesser than the crit val?

and secondly, what do these values mean in the context of r, r*, and k as their defined in the wikipedia article for the johansen test (https://en.wikipedia.org/wiki/Johansen_test)?
is r* the test statistic and k the critical value?

Ive been trying to wrap my head around this for weeks and I think I might be retarded
related request: textbook reccs for understanding VAR, VECM, prerequisites for this kind of thing, etc
>>
File: IMG_8662.jpg (789 KB, 3843x665)
789 KB
789 KB JPG
>>16113390
forgot to attach picrel
here is my test result
>>
File: E5v3YvJh.jpg (442 KB, 2416x3266)
442 KB
442 KB JPG
>>
>subtle black hate statistics
Based way to start thread.
What is the most comprehensive book which has stats and explains it using code eg python. I can do all of stats myself, but I don't actually understand anything. Its feels like a set of rules that I follow and the answers are correct but I now want to know the reason behind those rules
>>
>>16113097
I think you concluded correctly. The SAAAR who did the needful in reviving the thread from the archive did fine though.
>>
>>16113390
I have checked it a little and I think you are trying to see nonstationary data and if it is cointegrated? The thing is though that the trace test checkes one thing, and the eigenvalue test checks the other. I have honestly not used this on Python so have no idea if the package is any good or if you should use it.

I think you have to go past the documentation of the code and delve deeper into the maths of it all. What time series courses have you taken?
>>
>>16113665
>subtle black hate statistics
This thread is more racist than /pol/ but it flies under the radar because everything is backed by data and anyone complaining either has to show faults in the data (hah, good luck), methodology (no member of the bignosed tribe is in this thread).

>What is the most comprehensive book which has stats and explains it using code eg python.

I would recommend R because it is more accurate than the packages in Python. Which is important if you are betting money on anything you are doing.

What level of stats are you starting at when trying to code your way to understanding, because this is a pretty novel approach to statistics which I have only seen in one textbook so far that is catered to a mass market.
>>
I think we need to update the opening text in the /psg/.
>>
>>16113749
doctoral. I use statistics daily but as a retarded robot. I know what to use and when but not why.
>>
I want to tell someone "this collection of data does not actually depict a function. It's just a sequential array of data, but interpreting some underlying function into it that generates it (and thus, allows to extrapolate into the future) is wrong".
What is the formal way to refer to this succinctly? Can I just say something is "non-functional"?
>>
>>16113944
You must be in some field adjacent to statistics because at that level you should know the assumptions and when a model holds. Have you checked if your model actually holds under the prereqs?

Also if you are doing research, why are you using Python and not R?
>>
>>16113971
Is the retard you are trying to flex on going the p-hacking route?
>>
>>16113745
no thats not correct
johansen test is meant to be use on nonstationary data of any integration order above 0, provided that all your data is of the same integration order

the trace test and eigenvalue test both test for the same thing (trace test is just a slightly modified eigenvalue test)
>>
ngl notation in stats fucking sucks and it's only an intro class. i am scared of what horrors await in future courses
>[math]\overline S_{\hat \chi_{\nu_2}}^2[/math]
yeah bro just add more subscripts and accents, that's will make it more clear what the fuck it's referring to.
>>
>>16113751
I agree, an edition name here, important resources and wit there...
>>
>>16113091
Toll paid.
>>
>>16114263
No that's pretty much it. You can mix in some calc, some PDEs, some SDEs and so on, but the notation will be more in line with the general mathematics field and not the garbaggio that intro stats has.
>>
>>16113749
>I would recommend R because it is more accurate than the packages in Python
wut

>Which is important if you are betting money on anything you are doing.
everyone uses python in finance not R
R is for academics and nerds
>>
>>16114362
If you worked in a place where speed was of the essence, you would have done your calculations in C++ since that is fast. Python is for toy models and static models, not something that go bing ding bing on your traders screen.
>>
>>16113091
Wasn’t this study all self reported?
>>
>>16114439
yeah no fucking shit
research to develop strategies and do time series analysis is all done in python then the work is passed along to the code slaves to implement it in c++
>>
this is the most retarded statistics forum Ive ever seen
>>
>>
>>16114647
But if you work in model development, how come you have no idea what the model output is, the assumptions that are used in creating the validity of the model and if it is any useful for the thing you are looking for? Are you some kind of nepobaby hire on wallstreet?
>>
>>16114651
Could be. But I've learnt somethings here and there in just the few generals that have been up and running. I quite enjoy the mix of racism and math.
>>
5% of the population is responsible for 42% of the abortion. Guess what, it ain't the white bitches...
>>
>>16114708
what are you even talking about
youre replying to the wrong guy
Im not this anon >>16113944
>>
>>16114665
Any data from the past 2 decades instead?
>>
Is there any generalization of correlation that would make 0 correlation imply independence?
>>
>>16115179
It's redacted because it's too spicy and will uphold the worldview of anyone who is not a libshit.
>>
>>16113665
Study the three axioms of probability, and then the theory behind a hypothesis test on a normal distribution. That should give you an intuitive sense of knowing statistics. Coding the rules in python won’t help you for this specific goal
>>
>>16115276
Honestly it's a new approach on getting younger adults to understand statistics. It's very focused on the output of what you put into the model and working with larger datasets than is possible if you do the calculations by hand. I think you need to be able to do both, but let's face it most of the things you do in intro courses will be relegated to computers and programming anyways, so it's not like there isn't a basis for this kind of approach, even if it is more common in economics, accounting and applied stats programmes or courses.
>>
>>16115186
Independence implies zero correlation, so any acceptable answer would have to be something that is logically equivalent to independence, like factorizability of the characteristic functions or something.
>>
>>16115186
KL divergence or mutual information. Theese measures are ugly for continuous variables though, when it comes to estimation if you do not know the pdf
>>
File: Capture.png (115 KB, 896x957)
115 KB
115 KB PNG
There is a certain trial that can either succeed or fail. If it succeeds, one to three measurements (always greater than zero if success) are logged.
The measured values depend on one independent variable, some per-day randomness, some per-trial randomness. The trial is repeated with an increasing IV, typically until failure but sometimes not.
The success is certain until "big" IV (whatever that means, let's say ~170), then quickly decreases. The three measurements decrease linearly with respect to IV.

I have three estimators for where the last success will be.
The chart on the top shows daily estimates for the highest value of IV that will result in success.
To provide an example of what one day looks like, the chart on the bottom shows the non-adjusted measurements - I can't easily access the logged data now, just imagine that the linear fits are shifted down some amount.

Now, with all that being said:
- dip-mid tends to give more accurate and more precise predictions than the other two. Does it make sense to keep all three estimators anyway?
- If so, how should they be combined (average weighted by r^2, weighted by numbers pulled out of my ass that seem reasonable, ...)?
- There are also a few measurements for ID between 60 and 140, but they are very noisy, and roughly constant until ID is between 120 and 140. Should I keep them anyway and account for them somehow?
>>
>>16115460
Can you show the full equation? Because this smells like you need to use Adj R2 and not just the R2 out of the model. Furthermore, what is the error of the model?
>>
File: Capture 2.png (95 KB, 1475x617)
95 KB
95 KB PNG
>>16115711
It's literally just simple linear regression, with the measured values offset by some guesstimated constant.
>>
>>
>>16116258
If you don't want to show what it is or how you set it up, fine. But if it's just a data and then a linear MS approach, then it is what it is. I thought you actually had data, variables and an equation and some kind of approach to the linear regression.

For instance let's say I have an area in arizona or something, I want a linearmodel for housing prices with regards to amount of rooms.

It could look something like Housing prices (Arizona) = beta0 + beta1*Rooms

and then depending on what kind of data I have, it will give a certain output in regards to what kind of data you have and most programmes will find something, so you can develop the model further without hopefully overfitting.

Before some nerd says my model is dumb, it's just a toy model, and not a real regression on housing prices in that area, because honestly those models can become pretty advanced.
>>
>>16116262
What is the cause of this?
>>
>>16116434
venereal diseases being racist again
>>
Anybody know a book about calculating different probabilities for gambling websites?
>>
>>16116495
Casino economics?
>>
>>16113091
not to even mention, white girls/woman with a black child are one of the least attracting things worldwide. no one wants a disgusting ape child. they will struggle to find a new man for the rest of their lives. best is to still abort the baby.

once you go black you will either psychically abused, your ape husband will certainly vanish when a kid is born and you're attractiveness a a white woman will be forever damaged. dating black is like a reversed status symbol, it will haunt your for the rest of your life. black guys will date anything, doesn't matter if it's a ugly whale.
>>
CLT states that the sample mean/total of iid samples from any distribution converge to a normal distribution as number of samples goes to infinity.
Why wouldn't it converge to the actual underlying distribution? this is really counterintuitive for very "abnormal" distributions like exponential
>>
>>16117604
>Why wouldn't it converge to the actual underlying distribution? this is really counterintuitive for very "abnormal" distributions like exponential
I suppose you want an intuitive answer rather than an abstract mathematical proof.
In which case, consider a fair die roll. This ditribution is discrete, with a 100% chance of taking one of the values {1,2,3,4,5,6}, and 0% chance of taking any other value like 3.5. So why should the sample mean converge to it?

(By the way, the sample total does not converge, unless the population mean is exactly zero. Otherwise, its expected value will drift to infinity along with the number of samples.)
>>
>>16117604
Honestly most things just converge to a normal distribution regardless what you had underlying considering large enough numbers. It's a very powerful theorem even if it's not exactly the simplest to really understand.

However, it has its downsides as well. In many fields where this is applied, maybe people should consider the possibility of fat tails and what that will mean for the treatment you are trying to do and information you are trying to gather.

In the previous thread there was some discussion on this and how it applies to finance, honestly quite fascinating stuff.
>>
>>16117604
It has to do with repeated convolution of integrable functions.

Most measure theoretic probability books spend some time going over the different strong and weak laws of large numbers. Takes a lot of measure theoretic infrastructure to get there.
>>
>>16113091
Here's an interesting problem, and I don't know if there is a closed-form solution:
Suppose someone draws from a uniform distribution on [0,1]. You write down a pair (x,y) in [0,1] x [0,1], and hand it to them.
First they read out the first element, x. If x is higher than than whatever they draw, you get a payout of $1 minus x. If you guess wrong, they look at your second guess, y, and do the same. If their number is higher than the second guess, you get nothing.
How would you choose the numbers (x,y) to optimise your expected profit?
>>
>>16118487
isn't this a dynamic programming thing?
they normally are tricky the first time you come across them but this one should be ok, or maybe I'm misunderstanding
>>
File: 32EF1O7nEhgI.jpg (31 KB, 522x386)
31 KB
31 KB JPG
>>
>>16118487
1) Write down the cumulative distribution function (CDF) for the uniform distribution on [0,1].
2) Use (1) to find, for any x, the probability that a draw from a uniform distribution on [0,1] is less than x.
3) Use (2) to draw a probability tree diagram (aka "game tree"), with the corresponding payouts on each branch as a function of x and y.
4) Sum up the payouts on each branch to get the expected payoff (as a function of x and y), and use calculus to maximize it subject to the constraints that 0<=x,y<=1.
>>
>>16118130
>>16118282
neither of these address or even understand his question at all, you utter morons
>>
>>16117604
I don't think I was very helpful in this answer >>16118476

Now that it's not really early in the morning for me, let me try and build some intuition for you.

Take an (x,y) plane and draw a rectangle that is from (-1,1) and 1/2 unit tall. This is your sample distribution. You could also make it from (0,1), it doesn't really matter.

Now take another one of those and "pass it over" the first one, with a third curve being the area that's shared between them as you slide the second one from the left to the right.

You'll get a triangle which touches the x axis at (-2,0) and (2,0) and has a peak at (0,1). This third curve is the convolution of your first and second rectangles. It describes the probability density function for two samples from your uniform between (-1,1).

Now, do it a third time by sliding the rectangular function over the triangle. You'll get a quadratic that goes from -3 to 3 with a peak in the middle.

If you repeat this sequence infinitely, your "support" for your final density will stretch out to cover the entire real line, and you'll get that classic "bell curve" shape.

This example with the uniform rectangle for your density is the easiest to visualize for most people, but it works out the same way with any sampling distribution with a finite mean and variance. The only exceptions are distributions which do not have a finite expected value (e.g., Cauchy distribution), in which case you should not expect a convergent density or distribution from repeatedly sampling.
>>
>>16118769
kino. I love this thread. Now do it visually.
>>
>>16113091
She deserves to be alone for the rest of her life. Filthy, race traitor whore.
>>
File: 2fe.jpg (119 KB, 973x1200)
119 KB
119 KB JPG
Statistics is just the pattern recognition of racism dressed in scientific garbs.
>>
>>16119999
Based, also checked.
Literal fucking pencil pushers getting handed numbers from the bean counters and then think it lets them know about the world
>>
>>16120014
Now imagine the statistical probability of someone getting those numbers while saying something that based? Makes you thingk.
>>
>>16119999
checked and keked
>>
File: 1710716732304164.jpg (181 KB, 1023x1024)
181 KB
181 KB JPG
>>16113091
>"dating is exhausting"

Yeah, thanks to retarded roastie cunts like you.
>>
>>16120681
Don't bash the roasties too much. Statistically they are going to be fine mothers... for Tyrones spawn.
>>
>>16119999
baysed
>>
>>16118130
>>16118282
>>16118476
>>16118769
thanks for answer. i had a vague understanding of the proof with mgfs but this is more intuitive.
i guess the other question is about the results of clt. wouldn't the sample mean as n goes to infinity be simply outright wrong if the underlying distribution is something like exponential? the mean and variance would be correct but other than that it would be inaccurate.
what's the use of clt in those cases?
>>
File: peperead.png (14 KB, 134x134)
14 KB
14 KB PNG
>>16121914
Hmmm, suspicious.
>>
>>16122231
indeed
>>
Actually in field employed people
How threatened by AI is your job? 5 year prognosis? 10?
>>
>>16123071
not at all
>>
>>16123071
My job is far more threatened by my online shit-posting and my rapidly declining mental stability than AI.

Though, ChatGPT is pretty good at schizoposting after "randomly interpolating" a few wiki articles so it might take away one of my favorite hobbies.
>>
Related question, how reliable is AI for data preprocessing right now?
To give a simple but concrete use case, suppose I have 1M social media profiles (unstructured text data), can I throw a batch of them into a (preferably open-source) LLM and extract the top 10 keywords/keyphrases?
>>
>>16123347
I wouldn't bet the house on it. But why would you do that if you have the data in a sort of structured data set? I would rather do it myself so I can trust the data. I have tested LLMs quite a bit since we now have at least five that are semicomptent. They do hallucinate a lot desu or make shit up just to give you answer.
>>
Bode's law, which provides a simple numerical rule for the distances of the planets in the Solar System from the Sun. Once the rule has been derived, through the trial and error matching of various rules with the observed data (exploratory data analysis), there are not enough planets remaining for a rigorous and independent test of the hypothesis (confirmatory data analysis). We have exhausted the natural phenomena. The agreement between data and the numerical rule should be no surprise, as we have deliberately chosen the rule to match the data. If we are concerned about what Bode's law tells us about the cause system of planetary distribution then we demand confirmation that will not be available until better information about other planetary systems becomes available.
>>
>>16123269
New data on homocides based on race and socioeconomical status. Basically one of the races has the same amount of homocides as the lowest rung of a certain race.
>>
>>16124049
baysed
>>
>>16124688
No do this with a rolling mean of 15-30 years.
>>
How to into ARIMA?

everything I've looked at so far seems vague and short
>>
>>16126095
Download some financial time series and use a proper programming language.
>>
>>16113091
Well deserved for fucking a monkey.
>>
File: diaspora.jpg (63 KB, 1080x1144)
63 KB
63 KB JPG
>>16116262
>>16116434
>the races that have the least sex get the fewest STDs
>>
>>16126718
What you smoking pal. Azns have sex like rabbits.
>>
>>16114263
greek = population parameter
latin = sample statistic
z=standard normal distribution
t = student's t distribution

that's basically it
having 3 separate definitions for alpha indicates you don't understand what hypothesis testing even is
>>
>>16113098
No I won't check if the code is broken, it was never broken, you're just going extra extra dry for your dragon full helm :)
trust the science
guess god just HATES you
maybe you'll NEVER get it maybe you'll spend your LIFE killing brutal dragons and never get your helmet LOL THOUGHEVER
t. programmer
>>
>>16127432
this
>>
>>16127617
>>16126910
I understand the memes are funny and it's frustrating to live in a world where you are told you can't believe what is directly in front of your face. With that said, do you double-niggers really have nothing better to do than shitpost about race-crime statistics? There's so much more that probability theory and statistics have to offer than a more educated way to hate black people for behaving as they always have.
>>
>>16128159
Only a triple-nigger can hem and haw so much and still say nothing.
>>
>>16128159
I think it's funny. Who cares if you are bored by it brownoid.
>>
>>16128371
Where are the brownoids? Show them to me.
>>
File: 1696343868635296.jpg (390 KB, 1819x847)
390 KB
390 KB JPG
>>16113091
Is shit like this accurate?
>>
>>16128701
So oiur pitbull model has to be understood as stoachastic terrorism, less like a nigger and more like a Ted Kaczynski.
This makes perfect sense when you think about it. This is why pitbull and school shooters have the same target demographics.
Could pitbulls be based?
>>
>>16128371
I'm not brown, just old. I've seen these jokes 1000 times before and it has gotten stale. It's not 2015 anymore.
>>
>>16128741
Your soul is statistically significant within the brownoid area. Stop complaining like some old fart and either write about what you want to see or shut the fuck up.
>>
>>16128755
lol lmao even
>>
File: 1709163730493398.jpg (154 KB, 1002x834)
154 KB
154 KB JPG
>>16128719
>Stochastic terrorism
At this point are the feds even trying to make coherent sentences and words? Anyone with a brain will just laugh at weird made up shit like this.
>>
>>16113665
I recommend R over python for statistics.
Check out this free book for elemental statistics
https://moderndive.com/
and https://www.statlearning.com/
for predictive models, it has python too if you like
>>
>>16128969
Moderndive seems really cool. Thanks for suggesting it.
>>
Why do people use significance levels like 0.05 or 0.01 etc, it seems very arbitary for maths. We accept that 5% of the time are false positives? So every 20 times I take a sample, one of them could be wrong? That just seems high
>>
>>16129031
What is the alternative? In general we want to know what is the probability that the results we got are coming from just getting lucky with our random samples rather than the process, and the P_fa metric gives us a way to quantify this notion.

When you are looking at something like Monte Carlo trials to test an algorithm or estimate some population mean, you're often looking at thousands of samples per trial. A 5% P_fa would indicate that if you were to run the trial 20 times (with all 1000+ samples), you should see roughly 19 of those 20 accept the null if it is truly a false alarm (meaning your alternative hypothesis isn't true, you just got lucky with the samples). You can, however, make this arbitrarily restrictive by taking more samples (though you are in general bound by a ROC curve dependent on whatever particular threshold choice you set for your hypothesis test).
>>
>>16129031
It's arbitrary. That's what we did in the early literature so people have been copying it ever since.

You can choose a better level if you want.
>>
>>16129069
It's fine to have alphas at 0.01 and 0.05, I don't get what the other anon is bitching about.
>>
>>16130021
He is lowkey brown and mad that people with high IQ have the same kind of humour about brownoids as people with low IQ.
>>
>>16130977
Brownoids, AMIRITE?
>>
Anyone have good books for bayesian techniques or stats?
>>
>>16132447
The classic is Bayesian Data Analysis by Gelman. It assumes you already have a relatively strong fundamental understanding of mathematical statistics but is really not that bad.

Bayesian Smoothing and Filtering is a fantastic textbook for Bayesian time series filtering/state estimation. It's developed a lot around the sort of state estimators you'd see for target tracking, but really you can use it for any time series data where you've got a linear (either exactly or approximately via linearization) forward state propagation and a quadratic objective/error function.
>>
Is the OR anon still in this thread?

If so, do you (or anyone else) have good recommendations for multi-objective optimization/multicriteria optimization textbooks? Looking to apply Pareto front/Pareto optimality and related notions to a regularization parameter search.
>>
>>16113665
>subtle
???.jpg
>>
I got hired as a data analyst, but my only quantitative education is one stupid data science bootcamp. It would probably be healthy for me to be thrown in the Tableau/power BI mines for a while, but for better or for worse I have basically been given free reign to work on predictive analytics for my company which currently has none. This means any ML model I train and use for predictive or explanatory purposes will be entirely up to me to figure out. Problem is, I'm dumb. And this is the point of this rambling post: what statistics should I know to perform predictive analytics on real world data? Am I gonna be okay with just knowing p value < 0.05 good, p value > 0.05 bad?

You'll probably read my post and wonder why the fuck I'm talking about predictive analytics like it's the same thing as statistical tests. I don't know man.
>>
>>16133056
what kind of data do you have access too? and what kind of questions are you looking to answer?
>>
>>16133056
How much are you getting paid? It sounds like I could do your job better than you can
>>
>>16133262
Student data. Grades, major, classes taken, periods of enrollment, limited demographics, and a butt load of esoteric school fields. I would like to start very simple and ask simple questions like what are the best predictors of graduation. And eventually start fine tuning questions related to that: if a student got an F in their first term, what does that do to their probability of graduation, etc.

The above kinds of questions are different than mere descriptive statistics, like what are the graduation rates for each major. But it strikes me as wrong to refer to either kind of question as a statistical question, because in neither case am I formulating a hypothesis or making an inference about a population based on a sample.

I've heard it said that ML is applied statistics. But I'm not sure what I need statistics for if I'm just calling .fit().predict() for a random forest classifier.
>>
>>16133277
Peanuts and yes you could. I don't know anything.
>>
>>16133285
Just need to explore the data some more, figure out a few relationships with some OLS and logistic regressions. Formulating hypothesis is easy :

>Null,
There is no relationship between X and Y.

>Alternate,
There is a relationship.

Since it doesn't seem like you've been given any direction, ask around "What kind of stats and information would help you with your job?".

Also You should get data on the staff and teachers and the costs and revenues.
>>
>>16113749
Machines are well known for their perfect accuracy, I don't know what you're talking about

>I would recommend R because it is more accurate than the packages in Python
>>
>>16113091
Any good textbooks on decision theory? I've finished Tao's Analysis 1 and am taking an introductory course on group theory
>>
What the fuck are "moments"? What unholy field of mathematics do I have to delve into in order to understand what they are?
>>
>>16133667
>Machines are well known for their perfect accuracy
I can't tell if this is a joke I'm too autistic to detect or if you truly believe this
>>
File: I9I.jpg (61 KB, 474x756)
61 KB
61 KB JPG
>>
>>16133736
Instead of the average.
You take the square ,cube ,fourth power.Then average it.
>>
>>16133813
Are moments the same as cumulants? When would you want to generate one rather than the other?
>>
>>16133851
Moments are where cumulants come from. Your cumulants generating function is the natural logarithm of your moment generating function.
>>
>stats
>business (accounting)
>finance
>cs
>econ
What degree would you recommend to someone wanting to enter data or pricing analytics who has plans to start an ecommerce business?
>>
File: Untitled.png (163 KB, 1012x1030)
163 KB
163 KB PNG
What's the biggest Dunning-Kruger check in probability?
>>
>>16133905
>Your cumulants generating function is the natural logarithm of your moment generating function.
Yeah, that's how I came to be aware of cumulants (though the version I saw used the log of the characteristic function instead). But whoever invented this must have been dissatisfied enough with moments to want to look at cumulants instead, so what is that reason? Or to put it another way, what can cumulants do that moments can't?
All I vaguely recall is a "law of total cumulance" that involves summing over partitions, which leads me to believe that cumulant statistics must have some kind of combinatorial interest. But I have no idea what this could be.
>>
>>16129031
Best practice is to just state the p-value exactly instead of reporting if it's over or under some arbitrary threshold
>>16133674
Idk about textbooks but Stanford Encyclopedia of Philosophy has good articles on it
>>
>>16134187
Honestly, if you want to fully understand this stuff you'll need to have a fairly strong math foundation and some patience to go through graduate level probability theory material (either measure theoretic or the higher level of the multivariate calc stuff).

One of the main things I've seen cumulants used for is for large deviations theory (finding the probability of rare outlier events). In the large deviations principle, the rate of outlier events is determined by the Fenchel-Legendre transform/convex conjugate of the cumulant-generating function of the distribution. The roots of the cumulant generating function also give you information about boundary crossing probabilities for random walks (related to the above example) and in general the cumulants tend to be used when you are "accumulating" samples (usually via summation).
>>
>>16134388
>Honestly, if you want to fully understand this stuff you'll need to have a fairly strong math foundation and some patience to go through graduate level probability theory material (either measure theoretic or the higher level of the multivariate calc stuff).
I did take a measure theory course, but for now I'm content with my partial understanding after bothering to read the Wikipedia pages for Edgeworth series (which provides the use case of approximating deviations from normality, with the polynomial truncation enabling root analysis) and Bell polynomials (which explains why the law of total cumulance looks the way it does).
>Fenchel-Legendre transform/convex conjugate of the cumulant-generating function
Huh, I did not expect the CGF to be convex, but apparently it is. The usual inequality (Jensen) doesn't work, but Holder's inequality (which I found after digging through my measure theory notes) enables a direct proof from the definition of convexity. I guess that's why it's a viable thing to approximate.
>>
File: rms.png (82 KB, 1600x844)
82 KB
82 KB PNG
This may seem like a silly question so bear with me mathchads. what's so special about root mean square? why is it employed when it comes to computing standard deviation? why not employ root mean n(where n is even)? what about using absolute mean deviation?
>>
>>16134585
L^2, euclidean norm, geometry, pythagorean theorem, bilinearity, gaussians, orthogonal matrices, it all works together
>>
>>16134606
eli5 to someone who knows nothing about the stuff u mentioned except pythagorean theorem. What would happen if the limit is taken to infinity?
>>
>>16134614
>What would happen if the limit is taken to infinity?
root mean n, n->inf
>>
File: pythagoras in 3d.jpg (15 KB, 602x245)
15 KB
15 KB JPG
>>16134585
>what's so special about root mean square?
Root and square appear in the formula for calculating distances, in some coordinate system (see pic).
Mean appears when these coordinates are interpreted as deviations from some expected value. For example, when an astronomer makes multiple observations of a celestial object, his recorded measurements will cluster around an average value, with a spread depending on his instrument's precision. The "law of errors", which is the historical name for what is now called the central limit theorem, says that this average (mean) and spread (standard deviation) can be fitted to a Gaussian distribution (or bell curve), allowing scientists to quantify the uncertainty of their measurements.
>>
>>16134627
It's safe to say you didn't manage this.
>>
>>16134627
>Mean appears when these coordinates are interpreted as deviations from some expected value
avg distances??
don't know what u mean with ur 3rd point
>>
>>16134585
That's not a silly question at all.
From a practical point of view, you probably should use the mean absolute deviation, for two reasons:
- It's what most people (this includes practitioners, sadly) intuitively think standard deviation measures, namely the `average distance from the mean'.
- It's more robust to outliers.

Some objections to this are that
- Squares are easier to work with, algebraically (you shouldn't care since you'd use a computer to do your calculations).
- It's what everyone else seems to use (this is mainly because of inertia).
- The SD is more efficient for normally distributed populations.
This latter point is worth examining more closely.

Relative efficiency is defined as the ratio of the variances of two estimators (this uses the standard deviation / variance to argue *why* the standard deviation is worth using, which seems circular, but we need something to use as a comparison measure).
Indeed, the SD of mean deviations is about 14% greater than the SD of standard deviations, but this is *only* when sampling from a normal distribution.
If you have very few outliers, even only, say, 2 every thousand, this efficiency reverses (because of non-robustness of SD).
The presence of outliers (true ones, not measurement error) is so very common in so many real situations that you really should think twice about interpreting SDs, or even using them at all in statistical analysis.
>>
>>16133786
The 2020 election is what made me interested in stats. There were autistists and scholars who scraped the election data live from the website. Then made some pretty fucked up analysis on it.
>>
>>16134513
nobody knows
>>
>>16134513
>How and why does Belford's law work?
The lead digits for the numbers 1-99 are evenly distributed. The lead digits for 1-100 favor 1s. The lead digits for the numbers 1-98 disfavor 9s. Fundamentally random numbers favor generating lower leads and disfavor generating higher ones.
>>
>>16134935
>From a practical point of view, you probably should use the mean absolute deviation
What if I'm a Bayesian who wants to have a precision parameter in my priors? Using (SD)^-2 for precision gives me a wide selection of conjugate priors, but without an equivalent for MAD (some kind of kernel?) I can't articulate how much a new piece of information has reduced my uncertainty (or increased it, if the new data point is an outlier).
>>
>>16134935
I like MAE/MAD as robust estimators of dispersion, but they also kind of suck.

There's (as far as I'm aware) no equivalent to the Rao-Blackwell Theorem/CRLB for MAE/MAD, meaning that it is practically very difficult to determine if there's any "better estimator" for the quantity you are trying to determine from your samples.

MAE/MAD are also very inefficient in terms of sequential/recursive estimators, so when you are observing a process that you are incrementally getting new samples, it becomes essentially useless as a metric of estimator quality. You can have a recursive estimator that is doing fairly well in terms of MAD but is super tailed meaning you essentially have to throw out half of your recursive estimates anyways else you end up in no man's land.

Whereas conditional expectation (and as a result MMSE based estimators) have a proper analytical foundation regardless of whether you are incorporating priors, there's really nothing equivalent for "minimum median absolute error" or "minimum median absolute deviation" estimators (as far as I'm aware).
>>
>>16135229
You can't really do anything recursive with MAD/MAE. This is why they really are only useful for "robust estimators" on systems where you are pretty certain an MMSE estimator (Bayesian or otherwise) will be fucked.
>>
>>16135245
Anon who suggested kernels here. In theory, the difference between using MAD/MAE and MSE on your dataset boils down to choosing whether to plot your errors in L1 or L2 space, and by norm-equivalence this shouldn't affect the asymptotic properties of your estimators. (In practice, where samples are finite, this makes all the difference.)

This suggests that just as you can transform a dataset like {(x,y):y=exp(<b,x>+e)} so that nonlinear relations become linear and OLS becomes applicable (so as long as the transformation can be justified), you should be able to "transform your statistical analysis" so that absolute distances become root-mean-square, and the CRLB becomes valid. In the ML literature, this is known as the "kernel trick", although in practice the underlying mathematics is obfuscated by a bunch of optimizations, kind of like how the geometry of OLS is obfuscated by matrix calculations.
>>
>>16134706
Your RMS formula >>16134585 uses [math]\hat{y}_i[/math] instead of the more common [math]\bar{y}[/math]. Mathematically speaking, this makes it a conditional expectation instead of an ordinary expectation. In terms of the subsequent example of the astronomer, think of each [math]y_i[/math] as the measurement of a star's position at time [math]i[/math], and [math]\hat{y}_i[/math] as where the astronomer's scientific theories predicts it should be. The difference [math]y_i-\hat{y}_i[/math] can then be chalked up to measurement error of his telescope, which is distinct from [math]y_i-\bar{y}[/math], the difference of the i-th observation from the simple (ordinary) average of all his observations.

Unfortunately intuition can only take you so far on this, because the motivation for the formula comes from a body of theorems from mathematical analysis, so understanding where it comes from will require at minimum some level of understanding of multivariable calculus, and that's better taught in a university or textbook than a 4chan post. Don't give up on your curiosity though, it's an excellent motivator to learn the stuff.
>>
>>16135379
This might be a dumb question, but when you are writing MAE/MAD do you mean "mean absolute error" or "median absolute error?" I was wracking my brain trying to think about how you got L1 vs. L2 out of median absolute error/deviation and thinking you were talking about doing some rad-nik change of measures shenanigans, but I think we might be talking about different things when we are saying MAE.

Median absolute deviation is a super robust measure of dispersion that works super well when you have low sample density and really tailed distributions. It's especially good when you are doing non-linear estimation where you sometimes have non-convergent objective function minimization where you might have like 1 or 2 outliers in the 10^20 (or whatever other garbage meaningless floating point number your optimization routine finally calls it quits on).

Median absolute error/deviation are absolutely terrible metrics to tune any kind of estimator around because quantile estimation is super jank. You essentially need to do a full sort and batch calculation every single time you add a new sample, and that really doesn't lend itself well to recursive/iterative solutions.
>>
>>16135468
>This might be a dumb question, but when you are writing MAE/MAD do you mean "mean absolute error" or "median absolute error?" I was wracking my brain trying to think about how you got L1 vs. L2 out of median absolute error/deviation and thinking you were talking about doing some rad-nik change of measures shenanigans, but I think we might be talking about different things when we are saying MAE.
I think we are, the entire reply chain has been talking about means throughout.
As you rightly point out, medians have undesirable large-sample properties because they're maximally trimmed means, and would be better judged by recognizing them as quantile estimators (I know very little of this so can't comment further, although I can imagine some kind of cumulant-based theory >>16134388 working out).
>>
>>16135415
>Unfortunately intuition can only take you so far on this, because the motivation for the formula comes from a body of theorems from mathematical analysis
y is this then casually taught to high schoolers?
>>
>>16135517
I've seen many high school curricula that dump the standard deviation formula onto the students and teach them to calculate it mechanically, but I've never seen any syllabus that even dares to address the question >>16134585 of why we use THIS formula for spread, rather than some other formula like mean absolute deviation. Do you know of any?
>>
>>16135530
>Do you know of any?
of course not which was y I posted that question in the first place. It had bugged me since high school. the only potential reason that i could think was the rms>=am>=gm>=hm inequality, where rms gives u the maximum possible spread in a given data set. Bu then again, Y shouldn't one try employing root mean 4th,6th, etc..
>>
>>16135534
>Y shouldn't one try employing root mean 4th,6th, etc..
Now that you mention this, I recall from my own high school experience that we were taught that RMS was used rather than mean absolute deviation because it can be differentiated, and so we can calculate the best-fit line for our science experiments. But this argument is flimsy because, as you've pointed out, |x|^p is also differentiable for any even integer p, so what makes p=2 so special?
The best I can come up with is that "only p=2 enables us to derive more advanced formulas for a bunch of things, but understanding the significance of these things requires higher mathematics". Some of these things were alluded to by this anon >>16134606, and to give a bit of flavor, by "geometry" he probably means that the angle between two points (considered as vectors from the origin), which can only be calculated when p=2.
If you're serious about seeking a deeper understanding, /mg/ can probably recommend someth-
>/mg/ is dead
WHAT THE FUCK
>>
>>16135534
>>16135546
There's actually a more significant reason why RMSE is used (aside from the relationship between MSE and the L2 norm of the residual distribution).

The naive estimator/predictor for a parameter is the sample mean. If your samples are i.i.d. with finite mean and variance, the normalized sample mean approaches the standard unit Gaussian.

Gaussians are special in two ways (outside of their relationship to convergence of random series). 1) They are fully specified by their first two moments (meaning that your "mean cubed error" or "mean super cubed error" don't give you any additional information if your errors are Gaussian distributed). 2) Gaussians are the second order maximum entropy continuous distribution. This means that even if your errors are not exactly Gaussian distributed (which they won't be for finite sample sizes unless your process is already linear and Gaussian) assuming that your errors are Gaussian will serve as a "worst case" as compared to all other distributions with the same first two moments.

In principle you could use the same concept for a 4th order maximum entropy distribution (which would have mean, variance, skewness and kurtosis) but in general maximum entropy distributions of this form are not guaranteed to exist, while the second order maximum entropy, the Gaussian, does for any specified finite mean/variance.
>>
>>16135570
Point taken on the Gaussian being entropy maximizing, but are you sure this isn't just another instance of p=2 being "special" among all Lp? It feels like it to me.
>>
>>16135570
>In principle you could use the same concept for a 4th order maximum entropy distribution (which would have mean, variance, skewness and kurtosis) but in general maximum entropy distributions of this form are not guaranteed to exist
Also, I'd personally have no (philosophical) issues with a "space of distributions" characterized by having a quartic polynomial as their cumulant-generating function. If the nonexistence of these distributions is of the same nature as that of the Dirac delta, then I'd follow Jaynes and call for them to be admitted on grounds of practical utility, while leaving it to the pure mathematicians to sort out the formalization.
>>
File: pepe.jpg (58 KB, 976x850)
58 KB
58 KB JPG
>>16135570>>16135582>>16135614
isn't there a simpler explanantion boyos? what do i need to know to understand what u just said as I've no formal education in math or stats post high school
>>
>why do we use this
I don't usually come to this board but I'm going to go ahead and throw this out there; math is to distract otherwise intellectuals from real world problems and who causes them. You fell for it because you kept getting gold stars but these were also being used to ostracize you.
>>
>>16135637
What kind of math is used to dupe people in your opinion?
>>
>>16135673
>What kind of math is used to dupe people in your opinion?
Cryptography
>>
>>16135614
it's not an issue with definitions (e.g., Dirac delta). The problem is that if the skewness is too large relative to the kurtosis, there are finite moment maximum entropy distributions that can't normalize. For some finite 4th moment distributions, there's no finite coefficient that you can divide the distribution by to have finite area under the curve on the whole real line (let alone have area 1).

>>16135582
Hilbert spaces (L2) do seem to be "special" in some sense. One reason I can think of is that L2 is the only space where the conjugate/dual space is exactly the same. The conjugate/dual space of L2 is also L2 and that isn't true for any other Lp. Not sure if that's the only reason but it's one that sticks out to me.
>>
>>16135933
The nomenclature for Lp spaces was a mistake, we should have gone with the reciprocal 1/p instead.
Things like 0<=1/p<=1 and 1/p+1/q=1 work out so much nicer that way, and even the self-duality of "L^1/2" becomes less miraculous.
>>
File: PowerCurves.png (593 KB, 4958x2984)
593 KB
593 KB PNG
I was fucking around with some lifting statistics. I already knew that men on average are stronger than women (duh), but I wanted to know by how much. I went on open power lifting, downloaded their IPF dataset and did some sorting. Plotting the average, minimum, maximum squat, bench, and deadlift weights against weight classes produces some interesting curves. I find the shape consistency between the average, minimum, and maximum curves to be fascinating. What's also interesting is that below 50kg, men tend to be weaker or on par with women. My guess is that its more common for women to be <50kg, leaving only troglodytes in the <50kg male category.
>>
>>16135972
You also have to remember that some number of those <50kg men may literally be children. I obviously don't know the exact data you pulled but a lot of the lower weight males who used to do the "weightlifting" sport/club at my local gym were like 13-15 years old and still very much in development.
>>
Will shitting on humanities in my college essay make them think im based
>>
>>16135981
Humanities is for pretentious snobs who like to LARP as elitists. This is especially true for "artists". Vocations are for people who want to get shit done. STEM is for people who want to actually advance society.
>>
>>16135978
Good point. The dataset also lists the ages of the athletes, so itd be interesting to see the results adjustes for age differences. I was just feeling lazy in this particular intance and neglected age entirely.
>>
>>16135981
Nah, not really. It just kind of makes you look like a tool with a superiority complex.

Some of the most difficult problems to solve in STEM are the ones that interface with the "humanities." There's also a ton that modern STEM owes (especially in statistics and differential equations) to previous efforts at solving problems like population modeling. Coding theory and computer science/information theory also owe a large debt of gratitude to linguistics (which is arguably a humanities field)..
>>
>>16114665
yes because the luton sandniggers count as white in uk stats.
>>
>>16134119 it's natural to imagine it exactly in this way. you are choosing between the car being in one door or in the two others, with of course the latter choice being more prospective. but i don't think pic related is of any help to those that don't get it, as they still won't take one step back in their mind. they will imagine it as if you opened 98 doors and there are 2 left, which means 50:50.
>>
>>16136149
That is a 50:50 though. It's analogous to putting 1 red ball in a bag with 99 green balls, then pulling out 98 green balls in a row. Your odds of either the next pull or the one after being red is 50:50.

You got Dunning-Krugered.
>>
>>16113091
>has perfect teeth early in her pregnancy
>gets braces 1.5 years later
Why/
>>
>>16134119
Discussions about Monty Hall usyally degenerate to people throwing around probability calculations and yelling at anyone who disagrees with them, but I want to ask about something else for a change.

The Monty Hall gameshow scenario is, by definition, contrived and artificial. By itself, this is not a problem: so is Newcomb's paradox and the Sleeping Beauty paradox. But for those two, I can at least imagine a world where the scenarios arise (a panoptic AI that anticipates your choices, a tipsy driver that forgot whether she made a turn), I can't think of anything for Monty Hall, and I suspect that this complete lack of any "real-world analogue" is a major reason why people find it so unintuitive, and why the debates are so unproductive.

tl;dr When am I going to use this?
>>
>>16136149
>>16136709
>They think there isn't a goat in the car
Sorry bros looks like you've been filtered
>>
>>16136845
I remember seeing the Monty hall problem used as an example for some multi-user information theory example in one of the books I was studying from. No clue if it's actually really used anywhere outside of toy examples in information theory, game theory, or decision engineering.
>>
>>16136895
>No clue if it's actually really used anywhere outside of toy examples in information theory, game theory, or decision engineering.
That's exactly my problem with it. A good paradox deepens our understanding of the foundations of a subject, and simultaneously of the limits of its domain of applicability.
By this standard, the Monty Hall paradox is a spectacularly bad one, with its discussions producing much heat but little enlightenment.
>>
>>16136845
>The Monty Hall gameshow scenario
That's not the Monty Hall gameshow scenario. It's just a probability question involving goats, cars, and doors.

As for a practical application of the lesson of the paradox...you could stretch it to say that in any search where you can't test all samples in a single batch, where you can get inconclusive results, and where you have a known amount of positives to find which can't be directly tested for, it is always optimal to change the sample group being tested as much as possible every test. I'm sure that has real world applications.
>>
>>16136910
>you could stretch it to say that in any search where you can't test all samples in a single batch, where you can get inconclusive results, and where you have a known amount of positives to find which can't be directly tested for, it is always optimal to change the sample group being tested as much as possible every test. I'm sure that has real world applications
I can see your scenario arising in the "real world" from the problem of optimal kriging, but it sounds like quite a stretch to link the lesson you describe to the Monty Hall puzzle. Could you elaborate on this?
>>
>>16136922
The Host can only reveal negative results (goats), but they don't reveal every door, ergo the information they give is inconclusive. The player's selection can be treated as what to exclude from a test being run.

It becomes more clearly analogous if you extend the Monty Hall scenario to a second round of door openings, where if you switched to the car, a goat door will be opened, leaving you with the car definitely behind the one unopened door, but if you switched to a goat, the car door will remained closed, and you should therefore switch again.

Honestly the only real stretch is converting "test reveals up to one negative result to", "test may reveal one or more negative results", something much more likely to occur in the real world, but the underlying logic still holds and it can be scaled up to larger sample populations and varying batch sizes.
>>
>>16136930
>The Host can only reveal negative results (goats), but they don't reveal every door, ergo the information they give is inconclusive. The player's selection can be treated as what to exclude from a test being run.
This does clarify the analogy with kriging (or for more a contemporary example, drone strikes), it seems we can motivate the Monty Hall game by reframing it as a targeting exercise.
>It becomes more clearly analogous if you extend the Monty Hall scenario to a second round of door openings, where if you switched to the car, a goat door will be opened, leaving you with the car definitely behind the one unopened door, but if you switched to a goat, the car door will remained closed, and you should therefore switch again.
Not so sure about this though. Assuming the 3-door version of the problem, this is the point where the rules of the game could break down (specifically, the rule that exactly 1 door is always opened each round). Exploiting this possibility to extract information might make sense for a logic puzzle, but less so for a probability problem like Monty Hall.
We can try an N-door version, with 1 car and N-2 rounds, where in each round a door is opened and you're given the option of switching to another door (for a small cost of [math]\varepsilon \ll 1/N[/math], say). Would the optimal strategy be to wait until the very last round and change targets to the last unopened door? That would be my guess, but without working out the math in detail, I'm not 100% confident about it.
>>
File: 6GWZFItl8fdr.jpg (92 KB, 1160x1120)
92 KB
92 KB JPG
>>
>>16136953
>Would the optimal strategy be to wait until the very last round and change targets to the last unopened door?
There are various ways to reframe the problem in this context depending on the number of doors, the number of rounds, the number of goats, the number of cars, the number of player choices, the reliability of the host, whether you as a player know any or all of this, when and how these values decrease, etc.

I think fundamentally you can think of it in terms of any time a door is tested and not revealed to be a goat, the likelihood of it being a car increases beyond the general increase from the elimination of possible options with added complications depending on whether reveals are guaranteed or not. So any optimal strategy is going to involve having your pick or picks be whichever has been tested the most. As for a situation where every door has been tested an equal number of times, I don't know that there's going to be an advantage in choosing any particular door.

I wonder if this logic could be/is applied to maze solving or modeling dynamic growth in an environment with obstructions.
>>
>>16136978
There are infinitely many ways to play with the problem setup, but if the goal is to explain the source of puzzlement, I'd rather stick to the basics and look at how people like >>16136709 are thinking about the problem.
Indeed, under this anon's model, 50:50 (i.e. no-switch) is indeed the correct conclusion. If the host reveals doors independently of their contents (subject only to the constraint that it's not chosen by the player), and the universe explodes whenever the car is revealed, then it is indeed true that switching wins in only 50% of the surviving universes (and no-switching wins in the other 50%). These exploding universes quantify the player's information gain, in the standard scenario, from seeing a revealed goat: whether the number of doors is N=3 or N=100, the revelations cause a proportion of universes to explode as given by
[eqn]
P(\text{car chosen}) \cdot P(\text{car revealed} | \text{ car chosen}) + P(\text{goat chosen}) \cdot P( \text{car revealed} | \text{goat chosen}) \\
= \frac{1}{N} \cdot 0 + \frac{N-1}{N} \cdot \frac{N-2}{N-1} \\
= \frac{N-2}{N} [/eqn]
In other words, only 2/N of the universes survive, which is just enough to raise the surviving player's switch-winning probability (= P(car chosen)) from 1/N to 1/2.

So I don't really see where your increase in likelihood "beyond the general increase from the elimination of possible options" is coming from, at least in the case without "added complications depending on whether reveals are guaranteed or not" (taking the standard scenario of guaranteed reveals as the uncomplicated case). But in all likelihood, I'm probably just not understanding your model correctly.
>>
>>16137002
And of course, since the experimentally verified probability of winning from switching is (N-1)/N and not 1/2, there must be something wrong with modeling the problem in this way, despite its superficial plausibility. But what?
>>
>>16137016
Thinking more about this, the difference between the two answers seems to boil down to whether the event of revealing a goat is part of the game rules (in which case it occurs surely) or not (in which case it occurs with probability < 1, giving the player a reason to update their prior P(car chosen)=1/N of winning from not-switching, fixing a typo in >>16137002).

Does this have any real-world consequences? It does warn modelers to be careful when setting up probability problems, but under this analysis, the Monty Hall paradox is epistemological rather than substantial.
>>
File: 00000913.png (171 KB, 1000x1000)
171 KB
171 KB PNG
anyone has that meme of a guy on stack overflow (or a similar website) saying that his boss reorganizes all data points to get a perfect correlation one and he is unable to convince him he is wrong?
>>
>>16137102
https://stats.stackexchange.com/questions/185507
>>
File: Fl6yw.png (17 KB, 650x721)
17 KB
17 KB PNG
>>16137104
lmao
>>
>>16137002
>beyond the general increase from the elimination of possible options
Let's take the original Monty Hall Problem for example.

If the Host were just acting randomly, then eliminating a door by revealing a goat leaves the unopened door which was part of the 2 doors the host was evaluating with a 1/2 chance of containing the car. At that point there are 2 doors left, it's 1 of those 2 doors. It's just a 50:50.

However because the Host can only open goat doors, a door not being opened by the Host has a 2/3rds chance of containing the car. 2/3-1/3>1/2-1/3. The increase in probability that the unchosen door contains the car in the latter scenario is beyond that of a simple elimination of options.
>>
>>16138336
that is insane.
>>
I went ahead and broke down the weightlifting statistics that I was messing around with by age. There are a few anomalies that are a result of a small n for specific demographics. For example, there are very few bodybuilders in this dataset that are above 60 years old and under 50kg.
>>
And one for squatting.
>>
>>16138532
The Lakers got 19 free throws in yesterday's game, Denver got 6
>>
>>16117604
Think of it as collecting data on where the mean is. Why would the mean, follow the distribution it's governing? This means you're mean for your distribution can be all over the goddamn place, especially if it's something no symmetric, ask yourself why would the location of the mean be anything BUT symmetric? Think about what that would mean. It's not as abnormal as you'd think..
>>
>>16123071
0, if anything it's fucking skyrocketing me to the top. The executives have NO idea how to interpret what is being told to them. I'm now 5x more powerful because I understand what is going on AND can do things so quickly it makes their heads spin.
>>
>>16132512
All time series processing is Bayesian in nature. Fundamentally they're all about using the current information to predict the next state, and then correcting the current pool of information once the next observation has been made.

>>16132447
Yeah, I'd play around with some code honestly, for the most part, it's literally just Bayes equation applied, so if you understand that you're fine. Almost every statistical model has a Bayesian form which is almost exactly the same as the 'classical' form, but relies on prior information (i.e., an additional input) for parameters of the model at some level.

If you understand MLEs, and the likelihood function, most Bayesian models will make sense to you (it's just modifying the likelihood function with more information).

The other nice thing about Bayesian modeling so that A LOT of methods out there, especially those developed by novices or engineers, are actually Bayesian in nature. Think like the Kalman Filter or Lasso Regression. Once you've interpreted them as Bayesian, a whole world of modeling and generalizations opens up.

Just also know it's not the be all end all either. It may have better probabilistic interpretations depending on what you're doing, but typically it's more computationally expensive, and most importantly, it's actually more complicated. Many 'frequentist models' have REALLY simple interpretations when you separate them from statistics. Like linear regression can just be viewed as a simple linear algebraic problem, nonlinear regression can be viewed as a simple optimization problem. While Bayesian methods are also all optimization problems, they FORCE you to understand and interpret things through probabilities.

In other words, you can formulate regression methods without even thinking about probability or statistics, but you need to understand probability and statistics to really use Bayesian methods.
>>
>>16133736
What do you mean? They're fairly straightforward.

They're just the expected value of the powers. If you want to know why they're useful, ask yourself why Taylor expansion is useful. Polynomials are basically always useful.

Oh it's also easier to use moments for generating certain properties of distribution or checking if distributions are the same. Basically another tool in the derivation/proof box.
>>
>>16139669
Very good and meaty poast. I like how you think and I kind of understand now why bayesian methods are good. If applied correctly.
>>
can some anons explain the difference between factor analysis and cluster analysis?
>>
what book should I read to be able to write a PhD thesis like this?
https://www.cs.ox.ac.uk/people/yarin.gal/website/thesis/thesis.pdf
Assuming I'm not too dumb cause I think I understand most of what he's saying in the thesis.
>>
>>16139669
Bar-Shalom in his Estimation textbook has some footnote joke that goes something like "Lemma: There are two kinds of people in this world, people who agree and people who don't. Corollary: there are two types of statisticians, proud Bayesians who use priors and closeted Bayesians who use priors."

It's been a couple of years since I've gone through that book but it's got a lot of random jokes like that.
>>
>>16141431
Spend some time going through Bishop's Pattern Recognition and Machine Learning (assuming you have some probability background) and then go through either Bishop's Foundations of Deep Learning or Price's Understanding Deep Learning.
>>
Is business analytics data analytics for dummies? Whats stopping data analysts from doing the job of a BA? Not knowing how markets and competition work? Obviously you do know or you wouldnt be working in a competitive field. Is it pure autism and you people arent allowed to speak to others or you will scare them away? Would you say DA or BA is more future proof?
Also whats the deal with kaggle? Do i need to be competing and winning on this or my resume is worthless?
>>
File: eUNu.jpg (97 KB, 1098x1068)
97 KB
97 KB JPG
>>
File: 1705613469051319.png (268 KB, 593x519)
268 KB
268 KB PNG
>>16113091
Hello frens,mathlet here, can anyone give me ideas how to statistically generate data, let's say for example i want to generate different width, height, length for a box by category, small / medium / large boxes, i want the data to be varied and maybe reflect some real life truths, is there a way to do this, i tried to generate according to a distribution i picked but meh
>>
>>16141958
Why would you want to generate data to analyse? you're just analysing the process of generating the data. And what do you mean by "meh"? Or are you talking about resampling?

if you just want some numbers to work with go to kaggle
>>
>>16113091
>be me
>taking my last statistics exam before the final
>need a B or above to pass with a C
>prof. forgot to print p-value lookup table so shes looking it up while we work
>prof. can't tell the difference between an image and a video thumbnail and keeps starting youtube videos blaring ads and then closing them
>no clock or timer in the room so i'm rushing
>skipped 50% of the multiple choice questions
>finish the first free response pretty quickly
>completely lost on the second free response
>yup im fucked
>thinking about how I'm going to have to take this shit again
>hear sniffles coming from the other side of the room
>girl starts full out crying
>can't help but start laughing
>start to feel bad because if it
>not an asshole just absolutely fucked and the crying just put me over the edge


Well cya guys again this fall
>>
>>16143011
you sure do seem to like talking about yourself on social media
>>
>>16143011
How many of you are diagnosed aspies?
>>
>>16143027
There is a lot of assburgers in stats.
>>
>>16143536
delicious assburgers
>>
bump
>>
>>16145278
bump 2
>>
Does anyone have any good recs for non-stationary stochastic processes of bounded variation?

It seems like most of the meaningful work on continuous time processes relies on them being WSS but I don't think that's actually necessary provided the moments are finite and it's a BV process. Not sure though.
>>
File: file.jpg (65 KB, 512x668)
65 KB
65 KB JPG
>>16143011
>Well cya guys again this fall
Same, graduation is looking less and less likely from being too much of a brainlet for this
>>
>>16146061
Stats major, math or some engineering major? Any particular area that's tripping you up?
>>
File: t_test_type_2_error.png (227 KB, 1000x880)
227 KB
227 KB PNG
>>16113091
I'm in statistics class, and there's an entire set of problems that I don't understand. It basically goes that I'm testing the mean of a population using the t-test. I have a given alpha (confidence level) and I need to calculate the type-ii error given an actual mean and standard deviation (everything is normal). I'm pretty sure the n can be assumed in all these problems. The homework acts like I can use the picrel to determine this without software. Can I get any help?
>>
>>16143011
That sucks
>>
File: 1690220138800076.png (183 KB, 1044x957)
183 KB
183 KB PNG
I'm trying to read Durrett's book and, what the fuck, it's so terse.
I like the direct exposure, but I'm getting filtered at page 3.
>>
>>16147579
Durrett's book is really not the introduction to measure theory. It says it is supposed to be able to be read by those who are not familiar with measure and are learning the first time but I really don't think so.

Axler's book Measure, Integration and Real Analysis and Richard Bass's Real Analysis for Graduate students are both free and give a pretty good introduction to the topic. I'd say if you can get through the first 4-5 chapters of either of those books you'll have all of the measure theory you need to make the first part of Durrett's probability book easy.
>>
>>16147895
Why do you need measure theory?
>>
>>16148786
Because probability (at least the way modern mathematicians do it) is all about measures and Lebesgue integrals with respect to probability measures.

You don't really need it for a more "old fashioned" approach to probability. There's plenty of calculus based probability theory books that work for a large number of circumstances where your transformations of random variables are fairly simple and your random variables are standard distributions. If you want a good book for graduate level multivariable calculus based probability, Papoulis, Probability, Random Variables and Stochastic Processes is the standard most point to in engineering.

However, the measure theoretic approach allows you a few serious advantages (hence why mathematicians have more or less solely been using it since the 1970's or so).

Firstly, the measure theoretic approach is consistent regardless of whether your distributions are discrete/continuous/mixtures. By defining a distribution based on a general probability measure, you can have much more flexible definitions that allow a wider variety of problem solving approaches.

Secondly, the measure theoretic approach to conditional expectation is a lot more flexible than the typical "least-squares" approach you see in typical applied probability books. This is super helpful if you are trying to do any MMSE estimation of any quantity based on indirect observations.

Thirdly, functions of random variables are much simpler to handle via Radon-Nikodym derivatives and the change of measure procedure than the Riemann integral formulation (which really only works if your transformation is one to one).

Lastly, the measure theoretic approach to convergence of random sequences/series is so much stronger than the Riemann integral approach and lends itself to a much more meaningful development of stochastic processes.
>>
>>16148873
Even if I really like your exposition and what you wrote. Could you give a practical example? Because I see what you are talking about but I just need to get it squared away with something practical.
>>
>>16143011
Either way, you should've brought your own printouts. Undergraduate stats are piss-easy. Even graduate level stats are pretty easy and I use the same printouts you do when the need arises. Hell, I only use SPSS. I don't need anything else.

Pic somewhat related.
>>
>>16113091
You ever notice wmbf makes way smarter, peaceful and talented kids than bmwf which makes complete pale orcs?
>>16113097
>>16114442
>>16114286
>>16117432
>>16128701
>>
>>16150420
Why do you only use SPSS?
>>
>>16150706
I don't need anything else. The only people who use R are asians, specifically chinks and gooks, who use it just to flex on fresh grad students who have never used it before. They spend two weeks "teaching" R when you could just fire up SPSS and do the same thing in an hour. Less if you know what you are doing.

The only justification they can muster is that R is free, but SPSS is free also.
>>
>>16150706
>>16150755
To give an example, I took a class when I was still doing classwork (I'm dissertation now) and a korean new hire was teaching it. He spent AN ENTIRE CLASS going through how to install R and python and all that shit on Windows. Three hours, did nothing, just tech support. Sorry, but if you're wasting class time teaching morons how to install shit like python then you're wasting my time.

Never went back. He bumped into me in the hall and asked why I dropped his class, I told him it wasn't for me. I know R can do some things that SPSS can't, but if you're running statistical analysis for bullshit grad classes you do not need the full capabilities of R. SPSS will do it just fine. Pirate it and be done with it.
>>
>>16141688
On business analytics, sort of yes. The real thing with them is knowing how to explain things directly to the csuite sharks, catering to their needs and thinking about what information best suites them. The best business analysts I have found are smart enough, but really are just a different form of an MBA.

If you're not cut out to be an MBA you probably won't make it in business analytics.

If you have done nothing, it's not a bad idea to do some kaggle exercises, build a dashboard, presentation or write a paper (honestly do all three). Show skills beyond just analysis, technical writing and presentation are big in the field. Put it on a website too.

Outside that, once you're working, you don't need to do that shit anymore unless you're in FAGMAN world. With real experience all that matters is your resume.
>>
>>16141958
Okay, realize that you don't want to generate width, height, and length.

Likely you want to generate volume, largest surface area, and largest side length and construct the box from that. Basically, boxes don't exist in empty lengths, typically they are designed together so the variables for all three sides aren't actually independently random but have relation to each other.

I'm not sure of the problem, but it is likely easiest to bucket volume into your sizes and then determine surface area and largest length. I'm not sure what problem you're going for them, but in shipping I'm sure there is a fixed maximum for surface area that you have to watch and probably some rules about volume. I doubt they have as many rules about lengths, although, on the flip side, airlines have strict rules on all three lengths for carry ons. It completely depends on what you're modeling.
>>
>>16150706
>>16150762
He's a brainlet. Anything with a GUI just literally infuriates me at this point. It's so sluggish just to get to what I NEED!

I will say it's probably not horrible for learning things, especially those with no coding experience, but you do need to abandon it if you're being serious. The speed boost you get is just like the speed boost from learning latex for the first time.
>>
File: studentstatus.png (9 KB, 897x68)
9 KB
9 KB PNG
>>16150833
Being a brainlet got me this far, not even sorry. SPSS just wurkz. I have no need for anything R does "faster" because I can do it in my head to the point where it's close enough and thus there's no point to even turning on the cumpooter. All my data is automatically fed into SPSS at the end of the day and whatever I don't want to fuck with too much I can pull out as needed.

I don't think people realize just how easy SPSS is to use because they are either too stupid to set up automation or too scared to pirate.
>>
>>16150858
I'm sure you can get by on it, just like you can get by on Microsoft word.

It's just slower. Almost every software of any note allows automation. I know people who do all their analysis in excel with formulas and macros set up for automation, I still would never do that to myself because of how slow setting that up is.

Not only that, the benefit of coding is that it's quite narrative in flow. I can look at scripts and code and understand exactly what is going on (even if it's formatted by someone with little skill).

That's much more difficult to do with dedicated software, which typically is made to help and put guardrails up for those with low skill.
>>
>>16150889
It shows you aren't a researcher.
>>
>>16150762
>He spent AN ENTIRE CLASS going through how to install R and python and all that shit on Windows.

Installing python/software should be used as a filter.
>>
>>16150420
>Either way, you should've brought your own printouts.
No note sheets or printouts are allowed in this class by the prof (Even though the sylabus stated we would be permitted 1 page of notes on exams)

I did learn afterwards, from another student, that certain casio calculators have all the tables id need for this class.

>Undergraduate stats are piss-easy
Yeah probably, but the median course grade in this class is still a failing grade. Im above the median and still probably going to fail. Mostly because im retarded though.
>>
>>16150858
I work in industry and do all my stats shit in R. I could do it in Python if I wished, but think R is prettier and gives better graphics so the boomers in charge like it more.

What is the utility of SPSS outside of government and academia?
>>
>>16113091
>quantum physics has entered the chat

Stats and probabilities you say?
>>
>>16151812
If the syllabus stated you were allowed one note page, and the professor changed that last minute or without class consent, go to the department head. Then go to the dean. Then go to the Title IX office. Cause a stink.
>>16152004
Probably about the same, depending on what it is you're actually doing. R does "more" but what R does I do not need to do and SPSS plays nicer with my other software for automation. I have no idea why that other anon thinks data automation is a bad thing because it's "slow" since it saves me days of time by this point.

I would only use R for something small like pic related where I do not have access to the data itself, but even then I've reached a point where I can kinda eyeball it and put something together in my head. As a result, I don't need to use R.

I do not think you should pay for it because anyone worth mentioning or working for already has a SPSS contract. I just do not want my personal PCs on the list of approved accounts because FOIA applies, based on baby boomer paranoia rumors.
>>
>>16152030
I just knew you were in a soft field like psychology.
>>
>>16152096
Acktually, I'm in the business school. Either way, if you're doing statistics in the "hard" sciences you're basically a grunt or a cubical dweller.
>>
>>16134585
The root is just to get the units right.
The reason sum of squares is used for error is because x^2 is positive and doing calculus to minimize it results in a nice easy linear equation.
It was first used for regression in the early 1800s for this reason way before the statistical underpinnings were established.
The statistics people came about 100 years later.
Don't let the stats nerds retcon some post hoc justification.
>>
>>16152119
You are not even using Stata. You must be some nepobaby MBA.
>>
>>16134585
Why is RMS more valuable then stay root-mean-cube? What makes Pythagorean distance so generalizable?
>>
>>16152168
There's an easy answer and a more general answer. This guy >>16152132 is somewhat right about the early rationalization for using MSE as your error criterion before the formalization of Lp spaces later on. The simple answer is that your MMSE criterion has a simple geometric interpretation, makes for easy linear solutions via least-squares, and is a good approximation of a wide variety of real scenarios.

The more complicated answer is that the L2 is also the only Lp space whose dual is exactly the same as itself. This means if your likelihood is fully characterized by its L2 representation (as Gaussians are) then maximizing your likelihood is the same as minimizing your MSE (in the unbiased case). Thus you can solve the harder problem of minimizing the mean-square error via the easier problem of maximizing the log-likelihood (in this case).

If you were to use mean-cubed-error, you'd have a circumstance where your answer could be either positive or negative, and you may have two complex roots. It also doesn't tell you all that much if you minimize the skewness of your estimator. You could have a totally non-skewed estimator that is complete junk (e.g., a uniform random variable centered at the right place but uses none of the information in the measurements). Being un-skewed has little meaning for general performance, while being unbiased (meaning your mean error is zero) and efficient (meaning your MSE is as low as is possible) has clear meaning for performance purposes.
>>
>>16152216
I like the complicated answer. It feels rigourous.



[Advertise on 4chan]

Delete Post: [File Only] Style:
[Disable Mobile View / Use Desktop Site]

[Enable Mobile View / Use Mobile Site]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.