Wikipedia talk:WikiProject AI Tools

This is the talk page for discussing WikiProject AI Tools and anything related to its purposes and tasks.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Looking for participants in a GenAI factuality study

Hi. I’m working with a team from Columbia University, funded by a Wikimedia Foundation rapid grant. We are seeking Wikipedia editors who are willing to participate in study on GenAI reliability, with a commitment of 10 - 20 hours in mid December - mid January 2026, and a symbolic stipend to compensate for your time.

The Research Project. Our goal is to find out if using a Wikipedia-inspired fact-checking process can increase the reliability of chatbots responding to queries related to Wikipedia’s content. The study uses open-source language models and frameworks, and our full results will be openly shared, with the aim of finding better methods for addressing AI hallucinations that are inspired by the well-established and highly successful practices of Wikimedia projects.

Please note that this project is a ‘’’pure and contained experiment’’’ for analyzing how far or close large language models are to editor-level factuality. We don’t plan on implementing any live tools at the moment.

The Task. The task required from participants is to fact-check an AI-generated response to a general knowledge question. This will be done checking whether each claim in a paragraph-long response is supported by the provided sources (each paragraph will be supported by up to 3 citations, the text of each citation is up to a few paragraphs).

Each participant will be asked to fact-check about 50 samples, with flexibility to do a bit more or less according to your availability. We recognize that this will be a demanding task, which is why we’re offering a stipend to those willing to make the time. The amount of the stipend is based on the amount of samples fact-checked.

Privacy & Security. If you choose to participate, we’re open to either crediting your efforts in our paper, or maintaining your full anonymity, whichever you prefer.

We adhere to the Wikimedia Foundation’s privacy policy. Participants may be asked to provide basic demographics for research purposes, which will be completely discarded after research concludes in early 2026.

Participation. All Wikipedia editors are eligible to participate. For methodological purposes, we may prioritize editors with expertise in specific subject matters, a higher Wikimedia project editor experience, or a focus and interest in fact-checking. If interested, please take a few minutes to submit the form! (Qualtrics external link). If you’re not comfortable filling out an external form, you may just send the answers to me directly using the EmailUser.

Happy to share the research proposal or answer any questions! –Abbad (talk) 00:27, 26 November 2025 (UTC).[reply]

@عباد ديرانية Well that seems like a major waste of time. The page you linked to says We'll build an experimental AI assistant for readers that exclusively draws answer from Wikipedia pages, and integrates an explicit and novel fact-checking step into its architecture that's inspired by Wikipedia's own fact-checking process by editors. and This assistant is not intended for public use but only as a time-bound experiment, which will be used for rigorous testing and evaluation of this model's reliability compared to Wikipedia's baseline of reliable information and using open source large language models (LLMs) as fact-checkers that can provide a reliable paraphrasing of Wikipedia's content

it won't be able to differentiate between its training data and the Wikipedia pages its supposed to use as sources.
current LLM technology can't reliably paraphrase or summarize content
training models requires copyright infringement on a massive scale, or it will be inferior to alternatives, which already have an established install base and a trillion dollars; kinda difficult to compete with.[1][2]
doesn't it make more sense to actually check the sources and verify if they support the claim made in the article, instead of having yet another chatbot which can do something any chatbot can do, but worse?
2300 dollar is not enough to achieve something meaningful.
sample size is tiny
moderate agreement is a very very low bar
We'll consider this a success if more than two thirds of respondents support further experiemntation in the future. Makes no sense, of course 100% will support further experimentation, I do too, but not of this dead-end street. Having people support further experimentation does not mean this was a good idea.
It will just be another lossy unreliable vague layer between users and reliable sources, like Wikipedia often is. We need less of that (e.g. by using the |quote= parameter), not more.
This sounds like "I want to use AI, let me invent a usecase" not "I have a problem, let me fix it with whatever the best tool for the job is".
It is unclear what the results will be used for. The output will just be some numbers, which are meaningless by themselves.
It is unclear what an explicit and novel fact-checking step into its architecture that's inspired by Wikipedia's own fact-checking process by editors. means. Using MiniCheck isn't novel and "We'll ask an AI model to check the work of an AI model" leads to diminishing results. If MiniCheck can do verification, why can't the original model incorporate fact checking. The root problem is that the base model generates facts, half-truths and nonsense. Instead of trying to sort fact from fiction the goal should be to create a model that can verify its own output during generation but that is far outside the scope of the WMF.
A binary metric (true/false) is clearly inadequate when checking if the paraphrasing is any good. A good summary doesn't leave out important facts; yet the proposal only measures pure falsehoods instead of omissions of important stuff, distortions, cherry picking, loss of nuance, synthesis et cetera. Pure hallucinations are a minority of the mistakes an LLM makes, but according to the proposal they're the only ones being measured.
We already had this same discussion , for example over at Wikipedia:Village_pump_(technical)/Archive_221#Simple_summaries:_editor_survey_and_2-week_mobile_study. So when the response was universally negative, and we already know why this can't work, why try again?
Why ask for volunteers and WMF money when Wikipedia doesn't benefit from the results? Why ask Wikipedians, who have a lot of stuff to do, to volunteer to do stuff that doesn't help Wikipedia? Its not like the AI companies will improve their products based on the results, and one can't improve Wikipedia based on the outcome, so who benefits?
The proposal says We'll build an experimental AI assistant and if that was true testing it would make sense. But it also says the plan is to just mash some pre-existing stuff together. If so, why ask volunteers to check how good or bad Llama and MiniCheck are? Shouldn't Meta Platforms employees test Llama? Shouldn't Mistral AI SAS employees test Mixtral? These are commercial companies who can surely hire some people to test their stuff, if they wanted. If there is no plan to add anything new that should improve performance, why bother testing? One datapoint is no datapoint. I already know the outcome: current AI tech is not as good as humans, especially not the nerdy type who edits Wikipedia, and attempts to quantify the difference are pointless because they are just a weighted random number generator one could build a narrative around. In order to make it slightly less meaningless you'd have to keep doing it with each new version and track performance over time, but that would only help AI companies, not Wikipedia.

You can't measure success by comparing this chatbot against commercially available chatbots. The correct baseline is Wikipedia itself, which anyone can access already and read what it actually says.

Showing that this chatbot produces fewer errors than commercial LLMs only proves that it is slightly less bad than commercial LLMs, not that it is a good approach to deliver Wikipedia content to users.

If any hallucinations or distortions are added by the chatbot, then it is worse than just reading Wikipedia yourself.

The interesting variable is how many hallucinations/misrepresentations/distortions are added compared to just reading the Wikipedia articles; how the chatbot compares to commercial LLMs is irrelevant to us.

I may be stupid but I don't get it. Polygnotus (talk) 00:49, 26 November 2025 (UTC)[reply]

@DSaroyan (WMF) and FElgueretly-WMF: Please explain why this is a good idea. Which technical experts has the Review team consulted? It would be nice to hear from them as well. It is also unclear to me how a Rapid grant can be awarded to a project that is ineligible: Applications to complete proposed research related to the Wikimedia movement are not eligible. Please review the Wikimedia Research Fund for these funding opportunties. --meta:Grants:Project/Rapid#Eligibility_requirements Thanks, Polygnotus (talk) 01:06, 26 November 2025 (UTC)[reply]

This was also posted over at Wikipedia:Village_pump_(miscellaneous)#Looking_for_participants_in_a_GenAI_factuality_study. Doubleposting is generally discouraged because it wastes people's time. Polygnotus (talk) 03:29, 26 November 2025 (UTC)[reply]

@Polygnotus I appreciate the thoughtful critique. To what I interpret as your main point - yes, any hallucinations are bad. However, LLMs are already prevelant in the industry and academia, as you must know, and from our daily observations, their use almost completely lacks any sense of responsibility towards reliability. Honestly, Wikipedia itself, as a teritary soruce, shouldn't even be the ideal baseline for factuality, but we recognize that research is an incremental endavour, therefore our approach is start with introducing a methodical way to improve over the status quo of LLMs usage. Realistically, we can't even expect LLMs to improve without such experiments. Please note that because Wikipedia is our chatbot's source, it is effectively a baseline for this study well.

In-line responses:

Points 1-3: We examined the differentiation between retrieval and training data in-depth when scoping our research, and we have two consideraions: A.From our literature review, we're our aware of methods that aim exactly to differentiate when an LLM's answer is grounded in the provided context versus training data. If our resources allow, we do aim to implement the methodology from this paper in drawing this differentiation. However, this is a challenging set up, and our team is ~~100%~~ volunteer-based (or more like 90%, we've had a little budget planned for some team members, but with fiscal sponsorship + paying evaluators + computing, we now expect a couple hundred USD surplus only), so even with the humble grant we may not be able to go that far. B. The eventual purpose of this study is to evaluate the factuality of LLMs in practice. Whether they make errors due to their training data, architecture, or Wikipedia-grounded context, it's eventually an error.
Re: Point 4: 100% agreed, and honestly my original idea was to build something exactly like the Source Verification tool using the MiniCheck model, which is open source, very lightweight, and has shown imprssive accuracy in dozens of experiments that I did with it. My fellow researchers recommnded a RAG approach because it has much more impact on the irresponsible use of chatbots in the industry, which is true. Also, because I discovered now that the Source Verification tool exists, I'm not sure if this approach is any different. I do still hope to run a methodical experiment, once we're done with this project, by: A. Extracting the full-text of some citations (e.g. a book), B. Extracting instances where they're cited on Wikipedia pages, 3. Running the full-text + cited phrases through MiniCheck to see how accurate it is. I believe the results coudl be impressive.
Re: Points 5 - 6: Indeed! That's why all the researchers are ~~100%~~ volunteer. We're doing what we can with our budget, but we also understand that the community may not support pouring larger resources into experimental research at this point
Re: Point 7: This is almost exclusively the annotation baseline from other LLM research we ran across. I'll do more homework on this, but please feel free to advise if you're aware of alternatives.
Re: Point 8: This is a goal to determine the success of the grant itself, so it needed to be experiment-tied, and a user-testing goal seemed appropriate. You're right, though, and I'm open to revise. I'm hestitant to set a specific goal of factuality improvement because we won't know, obviously, until we conduct the experiment.
Re: Point 9: While I don't disagree, lossy middle layers are not only a reality, but a necessity. As you mention, Wikipedia itself is a mediator of information, simply because most people lack the depth of knowledge and/or the time to digest information directly from secondary sources. LLMs, as far as we know, are here to stay, and this is debate of that reality rather than how it can be improved.
Re: Points 10-11: This is clearly a huge use case, which is literally why we opted for it (over, as I mentioned above, what could be personally more interesting to me in terms of a tool to fact-check Wikipedia sources). For example, my company, which is not special in this in any way, pays for what's easily hundreds of millions of LLM queries a month, mainly to power chatbots. As of now, the vast majority of these chatbots on the internet barely make any attempt at truth-seeking that's analoguous to what we're proposing. The results from our study have the basic purpose of proving or disproving that the approach we're trying can have an impact on factuality. In case it does, that's an improvement on the status quo that will affect millions of users.
Re: Point 13: Yes, strictly speaking, this is a factuality-centered study. Other aspects would fall under a summarizing task.
Re: Points 14 - 16: This is very intentionally designed as an experiment of how existing tools like MiniCheck work. MiniCheck has already been developed, but how do we know if it's doing its job well? The fact that these LLMs have been developed by labs has little to do with who's using them, which extends to researchers, educators, non-profits, amd even Wikimedians. However, the commercial labs obviously don't care that much about how factual their models are in an academic sense, and have had little work in this avenue (otherwise, we would have seen way fewer hallucinations). We're volunteering our time for this because we feel like it's a critical under-researched area, and you're free to think it's worth or not worth your own time. Because this is such a small study, the impact won't be astronomical, but we believe it can be very singificant for both Wikipedia contributors, because our results will show how effective MiniCheck can be as a fact-checker. This will be evidence of whether or not it's usable for Source Verification tool, rather than the simple fact that it exists. Did anyone else systematically test whether the fact-checking framework of that tool is consistent and useable?

~~TBC - the are lots of good points here, I'll come back for the rest as soon as I have the chance!~~ Answered --Abbad (talk) 21:24, 26 November 2025 (UTC).[reply]

@عباد ديرانية ReDeEP looks cool but if I were you I would completely ignore Mixtral and stick to LLaMA. I do not think ReDeEP will be able to fix the problem that the model will mix training data and Wikipedia content.

Please correct me if I am wrong, but if I am reading between the lines I think we mostly agree on the facts (although I would recommend using a different tactic).

While LLM factuality is interesting (and annoyingly under researched by the guys with the big bucks), most Wikipedians are always gonna be more interested in using MiniCheck to determine if a claim in a Wikipedia article is supported by the source (or not).

We Wikipedians are a very simple people of humble peasant farmers like myself who just want results; not an academic study.

So while you do your thing, can you please allow others to use MiniCheck as well? You already know exactly how I want to use it.

Adding "MiniCheck was correct" and "MiniCheck was wrong" buttons is not very complicated.

If we can show the masses practical results, it is much much easier to get them to volunteer/contribute/whatever.

That way we have both academic validation and real-world testing, which benefits both.

I do not agree that our results will show how effective MiniCheck can be as a fact-checker because that is not what is being tested (and you wouldn't need such a complex pipeline to test just that).

Testing whether a complex AI pipeline produces fewer (or filters out more) hallucinations than the base model is interesting, but not relevant to Wikipedia.

I think the study needs to benefit Wikipedia, not just use it as a testbed, before you should be able to get WMF money or Wikipedia volunteers. And I don't really see it doing that at the moment. Polygnotus (talk) 07:15, 27 November 2025 (UTC)[reply]

~500 responses total need evaluation.

At least 300 of those need ≥3 evaluators.

Lets say the remaining 200 get one evaluation each.

So at least 1100 evaluation tasks.

I don't agree that a simple true/false evaluation will lead to meaningful results (point 13 above), but let's assume it is fine.

Each participant will be asked to fact-check about 50 samples (according to your comment above) so you need about 22 people.

Your comment talks about a commitment of 10 - 20 hours in mid December So 220 - 440 hours of volunteer time? Assuming an 8 hour work day we are talking 1.25-2.5 workdays per person and between 27.5 and 55 8-hour days of work sequentially... I am not sure why evaluating 50 samples should take 10-20 hours (12-24 minutes per evaluation for a simple yes/no on a short bit of text??).

The budget talks about 10 evaluators doing 100 responses each in 5 hours, so 3 minutes per evaluation. That is 100 evaluations short and doubles the workload per person. So if the budget allows for 5 hours per evaluator, why ask for 10-20 hours? The budget is $1000 dollar for 10 people doing 100 responses each so that is $1 per evaluation.

The form says The rate will be 100 USD / 30 fact-checked samples, with payment prorated according to the completed samples. but there is only 1000 dollars in the budget allocated so that does not compute. You can only buy 300 evaluations for that money, but you need 1100 evaluations. That is $3.33 per evaluation.

Did an LLM come up with these numbers? Is the plan to pay people 33% of what was promised to them, or to run out of money after 300 evaluations? What will happen if someone did 50 samples and wants the $166.67 that was promised to them?

I find it extremely difficult to outsource items on my todolist, both irl and on Wikipedia. Finding 22 Wikipedians who are willing to spend a significant amount of time doing a very boring task that does not benefit Wikipedia is gonna be real hard. I don't think a symbolic stipend is gonna do much to motivate em.

In summary, the study as proposed won't work. But installing MiniCheck somewhere and giving me an API endpoint and credentials is a good idea. Polygnotus (talk) 08:09, 27 November 2025 (UTC)[reply]

@Polygnotus The discrpancies in numbers are because we decided to increase the pay for evaluators as much as possible, at the cost of minimizing any share we take (practically none at this point). As you rightfully say, we realized that this is a difficult and boring task, and therefore thought it appropriate to increase the amount to at least 3 USD per sample, thus a total of roughly $100 / 30 samples. Indeed, this will reduce the amount of total samples we can analyze, but that's better than unappreciated labor. We will increase the eval share to 1,600, and will be able to thus fund about 500 examples. I admit that the numbers got a bit jumbled (my fault, if LLMs were used I may have gotten them more in line!).

I find it a bit confusing that you agree with this being a symbolic stipend and barely enough motivation, but also disagreed that making this hard evaluating will take 12 - 24 minutes. Anyhow, the hourly ranged are very rough, and I stretched them to be extra safe.

Re: MiniCheck, I'm more than happy to collaborate if you want. MiniCheck is available through HuggingFace, we already have the subscription (which is a negligible 9 USD / month). It's pretty easy to grant access, if all you need is the API, I'll be in touch. The hard part is actually evaluating the results, and methodically checking if they work as well as we'd like them to.

We're already committed to a chatbot experiment for this round of funding, so we do need to proceed with our current methodology in principle. I'm quite happy, though, to work together on a study dedicated for MiniCheck as a standalone (as I transparently mentioned, it is what I'm personally interested in as well!). If I manage to squeeze in any other funding, also happy to make that a proper study, if it's of interest to you --Abbad (talk) 22:05, 27 November 2025 (UTC).[reply]

@عباد ديرانية Yeah it looks like the plan evolved over time, like all good plans do.

The good news is that Wikipedians are usually pretty good with intrinsic motivation.

disagreed that making this hard evaluating will take 12 - 24 minutes. According to the budget it will take only 3 minutes. And the comment near the top of this page says paragraph-long response is supported by the provided sources (each paragraph will be supported by up to 3 citations, the text of each citation is up to a few paragraphs). and you only want a true/false. It seems very very unlikely that it would take me anywhere near 24 minutes on average to read 3x a few paragraphs and decide if they support a paragraph-long LLM response. 3 minutes on average sounds more realistic although it may be too short. I think the number will be somewhere in between. I would probably ask Claude to find the relevant text in those sources, which would speed up the human part of the equation.

it is what I'm personally interested in as well!) Exactly, so you understand why I am far more excited about playing around with MiniCheck. One of my, probably many, flaws is that I am unfit for academia. Although I am very curious if and how well the ReDeEP approach actually works (or you know, SEReDeEP, if we wanna stay up to date). Polygnotus (talk) 23:18, 27 November 2025 (UTC)[reply]

This WikiProject is just getting started (8 members already) and this page doesn't get that many pageviews. Our conversation may be confusing to potential volunteers so I unhatted (incorrect terminology but whatever) the doublepost. You know where to find me for the MiniCheck stuff. Good luck! Polygnotus (talk) 01:41, 28 November 2025 (UTC)[reply]

This study appears to have the goal of encouraging the use of LLMs, based on 'fact-checking' using Wikipedia as a source. Given that Wikipedia makes it entirely clear that it does not consider itself as a reliable source, the study is clearly ill-thought out, or at best, engaging in wishful thinking. And furthermore, any encouragement of this misleading LLM use can only make things worse for Wikipedia itself, as it faces a deluge of LLM-generated garbage, generated by a technology which routinely hallucinates (as has been demonstrated to be mathematically inherent in such software), engages in synthesis, contrary to Wikipedia policy, and mangles source citations to the extent that even if they originate from something genuine (and meeting wikipedia sourcing policy, which LLM citations routinely don't) the amount of effort required to find the actual source is totally disproportionate to their utility. I would advise anyone contemplating engaging with this study to question whether it is in the interest of Wikipedia's contributors, and perhaps more importantly its readers, to do so. AndyTheGrump (talk) — Preceding undated comment added 03:57, November 26, 2025 (UTC)

Checking offline sources

Hi @Polygnotus, I've tried you AI Source Verification tool and it works really well for online sources. Of course in many content areas the majority of sources would be paper books and so it'd nice if the tool supported offline sources too in some way. Have you planned something to make it possible? The simplest approach would be to allow the user to paste the text (I actually built a toy standalone app using this approach) or upload a source. Any other ideas how it can be tackled? My assumption is that many editors would be able to access offline sources, whether using Wikipedia Library, Google Books or some other digital library. Alaexis_¿question? 11:20, 29 November 2025 (UTC)[reply]

@Alaexis Hiya! That is a good idea, and since you forgot to copyright it I will immediately steal it.

It might also partially solve the paywall problem.

I am currently playing around with User:Polygnotus/CitationVerification.

Where can we find this standalone app of yours? Polygnotus (talk) 11:40, 29 November 2025 (UTC)[reply]

No worries at all, happy to suggest improvements :) My app is here [3] - I've just added BYOK, hopefully it hasn't caused any issues. It's very much a beta version.

I think that an addon works much better for Wikipedia editors. I had in mind a different target audience - readers rather than editors - hence a standalone app.

One thing I couldn't find a good solution for is multiple references supporting a single claim. As far as I can see your tool also looks at each reference individually which will produce false positives if a source supports only a part of the claim. Alaexis_¿question? 12:26, 29 November 2025 (UTC)[reply]

@Alaexis Ah thats really cool! I've just added BYOK, hopefully it hasn't caused any issues. It works fine over here. Perhaps you can add it to Wikipedia:WikiProject AI Tools?

It must be possible to deal with 2 refs together supporting 1 claim, but I haven't looked at it yet. Thanks for sharing! Polygnotus (talk) 12:40, 29 November 2025 (UTC)[reply]

Repository of prompts?

The thread immediately above led me to inspect User:Polygnotus/Scripts/AI Source Verification.js to see what prompts they were giving to the APIs. Separately, I've been finessing my instructions for a "Wikipedia research assistant" for initial sanity checks, hosted by Kagi. Maybe this project could have a page for sharing or workshopping examples like this. ClaudineChionh (she/her · talk · email · global) 13:23, 29 November 2025 (UTC)[reply]

That's a good idea. The prompt I used for my citation checker app can be found here [4]. Alaexis_¿question? 14:02, 29 November 2025 (UTC)[reply]

Personally I find it helpful to provide examples of request-response pairs though I'm not sure if it would work for your use case. Alaexis_¿question? 14:03, 29 November 2025 (UTC)[reply]

Oh yes, my file is more like a default set of instructions as a starting point. I can dig into my chat history for more specific examples of prompts and responses. ClaudineChionh (she/her · talk · email · global) 00:10, 30 November 2025 (UTC)[reply]

I've been thinking about having an API platform that could cache AI outputs used to review specific revisions (in case multiple editors send the same AI query, e.g. when patrolling recent changes) and simplify development workflow for new tools, and that could be a helpful use for it!

Beyond that, we don't have a guide yet, and "prompting tricks" would definitely be an essential part of it: feel free to start it! Chaotic Enby (talk · contribs) 20:35, 5 December 2025 (UTC)[reply]

Generating the infobox from the article text?

Has there been any exploration into using AI tools to generate the infobox from the article's existing text? Whatisbetter (talk) 11:24, 2 December 2025 (UTC)[reply]

@Whatisbetter, nothing I'm aware of but it should be pretty straightforward. What kind of articles do you have in mind? Alaexis_¿question? 22:35, 2 December 2025 (UTC)[reply]

There are absolutely no circumstances where it would be appropriate to use AI to "generate the infobox from the article's existing text". AI (or at least LLM's, which are presumably what is being referred to) cannot be trusted. They synthesize. They 'cite' things that don't remotely support the text being cited for. They routinely hallucinate. Wikipedia content (including that in infoboxes) needs to be written by contributors who can ensure that it is correct per a valid source, and are prepared to take responsibility for doing so. If you want LLM-generated content, look elsewhere. AndyTheGrump (talk) 22:42, 2 December 2025 (UTC)[reply]

@AndyTheGrump is correct: current AI technology is unable to summarize a text, or to find the interesting bits. Crafting infoboxes will remain a human task for the foreseeable future. Polygnotus (talk) 04:05, 3 December 2025 (UTC)[reply]

@AndyTheGrump, you're right about the hallucinations and other issues. I do not suggest or condone any violations of the policy. However, I think that it's possible to use LLMs to generate a draft which would have to be checked by a human editor.

Also, a few approaches have been suggested to control hallucinations. Here's one that might work though I haven't tried it myself AI Driven Citation: Controlling Hallucinations With Concrete Sources. It's suggested by Gavin Mendel-Gleason who is working with Peter Turchin. Alaexis_¿question? 06:54, 3 December 2025 (UTC)[reply]

@Alaexis Have you tried MiniCheck? Polygnotus (talk) 06:57, 3 December 2025 (UTC)[reply]

@Polygnotus, not yet, how do I run it? Alaexis_¿question? 15:14, 3 December 2025 (UTC)[reply]

@Alaexis

If you want to run MiniCheck on your own computer, then the answer is this, but this is a binary yes/no.

https://www.bespokelabs.ai/bespoke-minicheck gives out free API keys.

The answer to the question depends very much on how nerdy you are.

How familiar are you with Python? Do you want to run it on your own pc?

It is usually easier to just use their free API. I don't know what Operating System you use (*nix, MacOS, Windows) but usually if you Google "run python script" with the name of your operating system it should provide instructions.

The code on that page is pretty outdated btw. Polygnotus (talk) 15:23, 3 December 2025 (UTC)[reply]

Thanks @Polygnotus, I reviewed the docs and it seems pretty straightforward, though I'm not sure I'm going to use it right now - they claim that it's only marginally better than Claude Sonnet 3.5. Alaexis_¿question? 10:18, 5 December 2025 (UTC)[reply]

@Alaexis In my extremely limited testing I can't really tell if its better or worse than Claude, but Claude is clearly better for our purposes because it can (I am intentionally using a word incorrectly here) explain its thinking. Polygnotus (talk) 10:31, 5 December 2025 (UTC)[reply]

I had found that populating an infobox from Wikidata works OK (when there is data available). Major manual editing and checking will be required, but beats the starting form scratch and actually reading the Wikidata format in infobox instructions. I never tried to use the text as a starting point, but expect it to work too. The modern high-end engines (I mostly use the Google Gemini 3) do a quite decent job in very unexpected areas, search for information in a structured fashion is one of them. I haven't seen an outright hallucination for months. Ask a random question, you might get a random answer. Ask when the term Net load was coined, you will get a wrong answer, but it will point you to quite solid WP:RS, so human can make the same mistake, too - I wouldn't count this type of errors hallucinations. Викидим (talk) 08:25, 3 December 2025 (UTC)[reply]

I take it you are aware that since WikiData isn't WP:RS, you can only use it indirectly, where it actually cites a valid source? Anyway, your 'Major manual editing and checking will be required' comment points to what is likely to be a major issue with AI-assisted infobox generation, given how Wikipedia currently operates in practice - far too many people will simply assume that the AI has got it right, and not check it. AndyTheGrump (talk) 11:27, 3 December 2025 (UTC)[reply]

To me Wikidata item is like an article in foreign language - a source for translation that should be checked. AI helps to navigate through quite complex set of infobox templates, each with its own parameter quirks. AI does not do a good job populating these fields, but it is an OK way to get to the starting point. Викидим (talk) 17:19, 3 December 2025 (UTC)[reply]

Example of using AI to fix the text involving adding infobox: before / after / changes / history. Manual checking and reworks were necessary - but I would have never attempted this repair without an AI assistance (it would be too much hassle). Викидим (talk) 21:31, 4 December 2025 (UTC)[reply]

Discussion at Wikipedia:Village pump (idea lab) § Scope of AI tool use

You are invited to join the discussion at Wikipedia:Village pump (idea lab) § Scope of AI tool use, which is within the scope of this WikiProject. Cf. the previous discussion about whether generating an infobox from an article text would be acceptable. Chaotic Enby (talk · contribs) 20:37, 5 December 2025 (UTC)[reply]