Tech Leader Pro podcast 2024 week 18, The new knowledge acquisition bottleneck

05/05/2024 8 min

Listen "Tech Leader Pro podcast 2024 week 18, The new knowledge acquisition bottleneck"

Episode Synopsis


The new knowledge acquisition bottleneck is accessing silos of private content that can be used to train modern AI models.

Notes:


Decades ago when I was working on expert systems, which were an early kind of AI, one of the main issues facing us was the "knowledge acquisition bottleneck".
Expert systems were rule-based engines, and those rules where defined based upon input from domain experts.
The challenge was getting those rules from the heads of those experts, and into the software as repeatable rules, hence this was labeled the "knowledge acquisition bottleneck".
According to Wikipedia: "Knowledge acquisition is the process used to define the rules and ontologies required for a knowledge-based system. The phrase was first used in conjunction with expert systems to describe the initial tasks associated with developing an expert system, namely finding and interviewing domain experts and capturing their knowledge via rules, objects, and frame-based ontologies.
Expert systems were one of the first successful applications of artificial intelligence technology to real world business problems. Researchers at Stanford and other AI laboratories worked with doctors and other highly skilled experts to develop systems that could automate complex tasks such as medical diagnosis.". Ref: https://en.wikipedia.org/wiki/Knowledge_acquisition
So in simple terms, if you wanted to build an AI that could represent an expert like a doctor, you would begin by interviewing many doctors on how they do they work and then capture that as rules in the system.
It was laborious, time-consuming, error-prone, and often the experts in question would simply refuse to cooperate.
Modern AI systems are no longer knowledge-based expert systems, but instead are neural networks that aim to mimic the behavior of the brain.
Such systems can learn via large quantities of clean data being loaded into their model, which is their representation of the World.
Interviewing experts is no longer required, but vast quantities of data is required in order to train the AI.
In a previous episode, I shared my view that the web is already being mined for data for such models, but that may eventually be exhausted.
I believe a new market will emerge that will sell access to private, offline repos of data, which will become invaluable for training AI.
This is the new oil.
According to an excellent article on this topic from 2018 entitled "Did We Just Replace the ‘Knowledge Bottleneck’ With a ‘Data Bottleneck’?", the author mentions that "Some studies indicate that data scientists spend almost 80% of their time on preparing data, and even after that tedious and time consuming process is done, unexpected results are usually blamed by the data ‘scientist’ on the inadequacy of the data, and another long iteration of data collection, data cleaning, transformation, massaging, etc. goes on.". Ref: https://cacm.acm.org/blogcacm/did-we-just-replace-the-knowledge-bottleneck-with-a-data-bottleneck/
Apart from the quality of the data being ingested, we also need to worry about the reliability of the source: bad data passed to a model on purpose can act as a kind of supply-chain attack on an AI, if not detected and filtered out early enough.
In comparison, the expert systems of old were easier to trust as the models were build by human hands, not by bulk training.
In reality this space is not solved yet, but we are getting closer with each new generation of the technology.
Just this week, the FT has sold access to their content to OpenAI to help train their ChatGPT AI, OpenAI said in their press release "Through the partnership, ChatGPT users will be able to see select attributed summaries, quotes and rich links to FT journalism in response to relevant queries.". Ref: https://openai.com/index/content-partnership-with-financial-times
Weirdly since I read this, OpenAI have apparently removed the announcement from their site.
I expect many more such deals to happen.
What I am working on this week:


Search indexer improvements for greppr.org.

Media I am enjoying this week:


Escape from Tarkov PvE



Notes and subscription links are here: https://techleader.pro/a/643-Tech-Leader-Pro-podcast-2024-week-18,-The-new-knowledge-acquisition-bottleneck