I've downloaded math Wikipedia for some file embedding & math text subject classification work. I'm doing a statistical analysis overview, right now and in the coming weeks.Q: Is there anything you'd be interested in knowing? It's circa 40k articles, I have all inter-links, given math categories, texts, number of edits and such. E.g. I could tell you that the longest article ishttps://en.wikipedia.org/wiki/Magic_square(unless you include cryptography in a wide sense - then you get WWII articles also.)while the shortest is pretty muchhttps://en.wikipedia.org/wiki/Consistency_(knowledge_bases)One purpose is "prompt engineering"-engineering work. On the embedding side (just to match files) I'll start with/huggingface/sentence-transformerssince I've used it before.So far I've spend most time on the categories and subject classification - and there's a lot to say and do there. But in principle I can check out anything. I'll eventually do more with the interlinks, but only in the coming months. Likely that I'll eventually also look at the text itself. And I'll eventually summarize in a pdf, dataset, or just a video.So you can pitch me short (data or statistical) questions or even bigger ideas what else I could do. Doesn't have to be a quick thing, I'll have to spend a good while on this.
https://en.wikipedia.org/wiki/Heyde_theorem
I have thoughts of statistical analysis of this board. Very curious which % of threads are about IQ.
>>16879921I know here's various websites that continuously track e.g. /biz/ and do sentiment analysis, even for free I think. But that will not really by my data domain in this case.That said, here's all math Wikpedia articles with the substring "IQ" in them.There's none with " IQ".
>>16879862Pagerank the math wiki pages, that could be fun to see what within the network of Wikipedia pages has a lot of impactful links to it effectively. https://www.mathworks.com/help/matlab/math/use-page-rank-algorithm-to-rank-websites.html
>>16879945Ah that could be done possibly. Albeit the link dataset is 10GB for all of Wikipedia. The math part is about 0.07%, so still not totally tiny, and I don't know how the pagerank algos converge. Is your idea to find out which math topics are central w.r.t. that metric? My fear is that certain known/basic pages are just linked to more than others. And for popularity, I already got the page views for free.But still, if there's a use, this might be done, I think.
>>16879951Yes, that’s what I’m thinking is page rank besides just views might indicate a different kind of relevancy of a topic. But yeah you’re right could be the popular pages just get linked to more because they’re better known. But that’s kind of good to know as well I think.
Trivia: Most unique footnotes in one Wikipedia article: 24-cell (180 footnotes on 18 March 2024 but the number is down to 175 as of 25 January 2025). https://en.wikipedia.org/wiki/24-cellSource: https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_records
The transformer embedding tool was easier to write than I thought and it works fantastically.E.g. pic related a file association with the first sentence in the post >>16880309 at /mg/.>>16879982nice list
>>16879937Which site track /biz?
>>16880497Been years since I saw this - you can probably google it.I know there's biz shitcoin mention counting sites and surely even more advanced things, like sentiment analysis.I mean desu /biz/ isn't relevant much anymore I think - it was better/more active 7 years ago. But yeah
>>16880523What about views per byte?Lets see the most effective use of concise writing. Better yet if it could be adjusted for age in some way.Maybe something like a ratio of page views to the average viewcount of pages from the same month and year.
>>16879862Can you list the most cited books and papers by number count among the different articles?
>>16880810Interesting idea.I for now just went for the naive view/(age*bytes) metric. So per time intervals, but just divided by age.Worst offenders: Those are simplexes pages, and the worst is this:https://en.wikipedia.org/wiki/Cantellated_7-simplexesSecond worsthttps://en.wikipedia.org/wiki/Runcinated_7-orthoplexesThey are all about 10 years old, there's plenty of them (hundrets?) and they barely have any views.By my "divide by age", there's an age effect where younger articles score better.By that naive metric, the most watched (at snapshot 2021) short one is that of the number 262.https://en.wikipedia.org/w/index.php?title=262_(number)&oldid=1125950117But this seems to have something to do with it being the lowest number not having an entry for long, so it's more a fun fact gimmicky thing.The next real article at my snapshot time is GPT-3. This was young and has very big number if views that oughtweigh its lenght. It stands out, but yeah that's also an age effect.The next more fair to consider one ishttps://en.wikipedia.org/wiki/Dusting_attackwhich is crypto slash cryptography related.If I cut off younger stuff (only pre 2017), I get a better look.The best there is the 20 years old "Positive_semidefinite", but this is a disambigulation page. Same goes for "Jacobian".The pattern repeated. I see a bunch of number articles, a bunch of crypto ones (relatively young age and good views) and disambigulation pages.If we count fallacies (arguably "logic"), then the winner is "Post_hoc_ergo_propter_hoc". (good views, very old)Next is one that's more logic:https://en.wikipedia.org/wiki/Balayage(13yo old at snapshot, fairly short and for some reason well watched)Finally, for math proper, we have the moderately long but well watchedhttps://en.wikipedia.org/wiki/Less-than_signclosely followed by the less watched but younger and shorterhttps://en.wikipedia.org/wiki/Lists_of_shapes>>16880817Could be in the metadata, I'll come back to you
>>16880937Interesting finds. The recency bias is what I expected.You would really have to adjust for many other variables. It is what marketing calls "impact" instead of views. It's impossible to find how much of it an article had in reality, you would need feedback from the consumers to estimate the real impact it had.A growing internet population most likely also affects the viability of older articles, thats another curve you'd have to scale by.
I'm btw. working on hosting the search function in >>16880490 on an open website
>>16879862Now that is good stuff
>>16883977Mhm, yeah. I now think of both extending to a larger model for the embeddings (>>16880490), 8k context window, and also just going ahead and incoorporating physics too.
interesting thread OP. I've thought about this kind of stuff a bit. For example, an idea I had was running the Frequent pattern growth algo (association rule mining) on wiki articles. Supermarkets use this to know which articles are bought frequently with others.Here instead of a list of groceries, a list would be an article and all it's links to other wiki articles. This way frequent itemsets could be found, and perhaps interesting out of the ordinary ones.I abandonned the idea due to my interest being the out of the ordinary frequent itemsets (bit of a contradiction).A cool thing for wiki articles is the site that shows view counts per day, you can find spikes for certain articles (interesting to find the meaning of why an article is read on a certain day every year), but the more interesting part is finding that same spike in another smaller article linked to the main article.There's tons of cool things to do with this data.
>>16884381Mhm, yeah thanks for the input - what would be the goal of having these numbers for you? Or asked different, if it's finding surprising items, what would you do with it?As I mentioned in the first post, I'm mostly looking at it atm for LLM prompt/context generation. It's evident that even the strong LLMs are sometimes bad at context association even if they SHOULD know better. Surprisal plays a role, although I don't know if those connected articles tell us something about the math corpus we didn't know before. Or whether it reveals something that gives you an edge.Btw., I pulled the articles from the open API, but I've also got a lot more metadata from this EU project. https://zenodo.org/records/6346900It DOES already have the linktree (snapshot 2021). E.g. the link information alone has 10GB. And then there's 20 more with date data and categories etc. I could download it in under 3 hours onto a laptop.Likewise, as long as one as me here is interested in just a dataset of 10k or so articles, one might not need efficient variants of search algos - you can just brute force things. My pipeline, including article pulling and subset computation, has about 4 or so steps and all of them take less than 2 hours, i.e. you can run all parts of it overnight.
>>16884633>if it's finding surprising items, what would you do with it?Well my intuition is that some of these unexpected associations between different wiki articles could bring some valuable insight. Kind of a hidden link that could help in understanding something deeper. It's still a bit abstract in my mind but that's the general idea.Wiki gives you all the metadata if you're interested (views per time(different granularity) for all wiki articles).For the second thing I explained an example is: rick rescorla (9/11 hero), his views spike around the 9/11 (due to the news/media talking about him etc) and people go check out the song he sang to energize the workers of the tower "men of harlech", so there's a spike there too. You can check this on the site wikishark or a site that shows wiki articles views vs time. I think it's interesting to see the link of people discovering rick then discovering the song men of harlech. The view page across time data helps to understand the link. I wonder what other links wiki articles have between them like this, perhaps more interesting ones that one might not suspect.
>>16884721>wikisharkInteresting, I didn't know it. Will get to it (when it works)And you're right, such correlations of attention will be in there. I don't have the data downloaded in such time-based granularity.But also I don't think for what I want to do - math stuff - I might not benefit too much from it.What I'm mostly trying to wrap my head around is how I can get information (on STEM stuff) out of the articles and their relations that are no evident connects to everyone.Like everyone knows rings and groups are conceptually and even definitionally related, and that some proof in group theory can often be used or even mirrored in ring theory. Just as an example. This is known and also captured in the articles.But I'd like to sus out more nieche connections. So as to use this in my AI prompt generator (among other things) and make the thing come to more interesting conclusions or "ways of thinking". An (extreme and now known) example would be the connection between number theory and complex analysis before Riemann.
>>16885548not wikishark my bad (that site never works), this site: https://pageviews.wmcloud.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0you can see the spikes on 11 september for rick rescorla here for exampleYeah I see what you want to do. If you manage to find more niche connections please explain how you went about doing it.
>>16885563Cool page.The views are higher, even for niche articles I wrote myself, than I'd expect.Also, basically on all math pages the trend is downwards in the last years.
>>16885594AI explains this. As in, most LLMs literally use wikipedia definition for math related outputs.
>>16885611I was explaining in the sense that people go more to chatbots than Wikipedia directly.StackExchange has been even more raked.I notice I didn't mention it, but I think I'll start the video with a broakdown of some common public datasets and their role of Wikipedia in it (pic related)
Are the references/further reading in the metadata?
>>16886947In my data from that EU project I linked, it only has the counts.But I downloaded the math pages in under 2 hours, including texts. One can probably read them out, if desired. It's easy with the API to pull the texts, once you know which pages interest you.
>>16887038It would be interesting to know about the most cited sources across all these articles. But i don't think most of them would have a uniform citation format. The same textbook/paper could have been cited in a hundred different ways
>>16879862now make a model correlating all the papers to the schools mean GRE and SAT scores lol and cluster gender with k means or whatever clustering bullshit you want lmao