National University of Science and Technology (NUST), National Information Technology Board (NITB) and Telecom network operator Jazz have signed a Memorandum of Understanding (MOU) to develop Pakistan’s first indigenous Large Language Model (LLM) with focus on Urdu, including datasets for Pashto and Punjabi languages. It is aimed at empowering individuals, businesses, and organizations with advanced AI tools in their native languages. The envisioned LLM is expected to drive innovation in Generative AI applications, boosting productivity and accessibility in critical sectors like healthcare, education, and agriculture.
GPT-4 Accuracy Scores. Source: The Economist |
Generative AI tools such as ChatGPT are powered by large language models, or LLMs. These models need to be trained on vast amounts of data in specific languages to be useful. Unfortunately, the Urdu content of the Internet is less than 0.1%. This will present a challenge for the developers of Urdu LLMs.
Online Content of Various Languages. Source: W3Techs |
Lack of Urdu content available for training ChatGPT affects the accuracy of the results for Urdu language users. For example, the GPT-4 accuracy score in question-answer tests in Urdu is just over 70%, compared with 85% accuracy score in the English language, according to data from OpenAI. Other South Asian languages, including Hindi, Bengali, Punjabi, Marathi and Telugu, suffer from the same problem.
It's not just a South Asian problem. These challenges exist in the developing world. Non-European languages are generally poorly represented online. It's a major obstacle for non-European nations in developing their own generative artificial-intelligence (AI) models, which rely on vast amounts of training data. Generative artificial intelligence (AI) can produce biased results due to a number of factors, including the data it's trained on, the algorithms used, and how it's deployed.
The use of AI in developing nations such as Pakistan will remain limited to a small number of people proficient in the use of the English language. Broadening the adoption of AI applications will require LLMs trained on local language content. The absence of this development could cost Pakistan the opportunity to take full advantage of the AI Revolution.
Related Links:
Riaz Haq
UNODC Pakistan provided Law Enforcement with Cutting-Edge Training on Crime Analytics and AI Models to Counter Terrorism
https://www.unodc.org/copak/en/Stories/SP4/unodc-pakistan-provided-...
28 September 2024, Islamabad - UNODC Pakistan organized a comprehensive workshop aimed at building the capacity of National Counter Terrorism Authority analyst’s in using advanced crime analytics and artificial intelligence (AI) to combat terrorism. The workshop covered a wide range of critical topics, equipping participants with the skills and knowledge needed to analyze data and counter terrorism through innovative AI techniques. In total 25 analysts including 7 women participated in the training session.
The participants were introduced to the fundamentals of intelligence gathering, the intelligence cycle, and the development of intelligence products. Practical discussions were held around strategic intelligence and its pivotal role in decision-making. Participants also reviewed products developed in earlier training sessions on i2 Analyst's Notebook and Power BI, enabling them to grasp how past learnings integrate with the current focus on terrorism prevention. The workshop covered data analysis, beginning with an introduction to various data forms and their relevance in crime intelligence. Sessions covered both qualitative and quantitative data, with participants learning how to distinguish between structured and unstructured data and their real-world applications in intelligence work.
The hands-on segment includes Textalyser, an online tool used to analyze qualitative data specially for conducting sentimental analysis allowing participants to experiment with real-world examples. Participants were engaged through thought-provoking case studies, including analyses of social media sentiment and notable incidents such as the Al Qaeda network and the Sialkot lynching case. These examples highlighted the practical value of AI tools like Voyant in unraveling criminal networks and understanding public sentiment related to terrorist activities.
The overall workshop was dedicated to hands-on sessions with low-code and no-code AI platforms, empowering participants to leverage AI without the need for extensive programming knowledge. Practical exercises included case studies using Google Teachable Machines for image classification and Google Cloud AutoML for predictive crime analytics, both of which offer powerful tools for identifying criminal patterns and behaviors in complex datasets.
The workshop concluded with a closing session that recapped the key learnings and allowed participants to discuss the next steps in their professional development.
Nov 8, 2024
Riaz Haq
Generalists vs. Specialists: Evaluating Large Language Models for Urdu
https://arxiv.org/html/2407.04459v1
In this paper, we compare general-purpose pretrained models, (OpenAI's) GPT-4-Turbo and (Meta/Facebook) Llama-3-8b-Instruct with special-purpose models fine-tuned on specific tasks, XLM-Roberta-large, mT5-large, and Llama-3-8b-Instruct. We focus on seven classification and six generation tasks to evaluate the performance of these models on Urdu language. Urdu has 70 million native speakers, yet it remains underrepresented in Natural Language Processing (NLP). Despite the frequent advancements in Large Language Models (LLMs), their performance in low-resource languages, including Urdu, still needs to be explored. We also conduct a human evaluation for the generation tasks and compare the results with the evaluations performed by GPT-4-Turbo and Llama-3-8b-Instruct. We find that special-purpose models consistently outperform general-purpose models across various tasks. We also find that the evaluation done by GPT-4-Turbo for generation tasks aligns more closely with human evaluation compared to the evaluation by Llama-3-8b-Instruct. This paper contributes to the NLP community by providing insights into the effectiveness of general and specific-purpose LLMs for low-resource languages.
Nov 8, 2024
Riaz Haq
Labelers training AI say they're overworked, underpaid and exploited by big American tech companies - CBS News
https://www.cbsnews.com/news/labelers-training-ai-say-theyre-overwo...
Naftali Wambalo: I did labeling for videos and images.
Naftali and digital workers like him, spent eight hours a day in front of a screen studying photos and videos, drawing boxes around objects and labeling them, teaching the AI algorithms to recognize them.
Naftali Wambalo: You'd label, let's say, furniture in a house. And you say "This is a TV. This is a microwave." So you are teaching the AI to identify these items. And then there was one for faces of people. The color of the face. "If it looks like this, this is white. If it looks like this, it's Black. This is Asian." You're teaching the AI to identify them automatically.
Humans tag cars and pedestrians to teach autonomous vehicles not to hit them. Humans circle abnormalities to teach AI to recognize diseases. Even as AI is getting smarter, humans in the loop will always be needed because there will always be new devices and inventions that'll need labeling.
Lesley Stahl: You find these humans in the loop not only here in Kenya but in other countries thousands of miles from Silicon Valley. In India, the Philippines, Venezuela - often countries with large low wage populations - well educated but unemployed.
Nerima Wako-Ojiwa: Honestly, it's like modern-day slavery. Because it's cheap labor–
Lesley Stahl: Whoa. What do you –
Nerima Wako-Ojiwa: It's cheap labor.
Like modern day slavery, says Nerima Wako-Ojiwa, a Kenyan civil rights activist, because big American tech companies come here and advertise the jobs as a ticket to the future. But really, she says, it's exploitation.
Nerima Wako-Ojiwa: What we're seeing is an inequality.
Lesley Stahl: It sounds so good. An AI job! Is there any job security?
Nerima Wako-Ojiwa: The contracts that we see are very short-term. And I've seen people who have contracts that are monthly, some of them weekly, some of them days. Which is ridiculous.
She calls the workspaces AIi sweatshops with computers instead of sewing machines.
Nerima Wako-Ojiwa: I think that we're so concerned with "creating opportunities," but we're not asking, "Are they good opportunities?"
Because every year a million young people enter the job market, the government has been courting tech giants like Microsoft, Google, Apple, and Intel to come here, promoting Kenya's reputation as the Silicon Savannah: tech savvy and digitally connected.
Nerima Wako-Ojiwa: The president has been really pushing for opportunities in AI –
Lesley Stahl: President?
Nerima Wako-Ojiwa: Yes.
--------------
Fasica: I was basically reviewing content which are very graphic, very disturbing contents. I was watching dismembered bodies or drone attack victims. You name it. You know, whenever I talk about this, I still have flashbacks.
Lesley Stahl: Are any of you a different person than they were before you had this job?
Fasica: Yeah. I find it hard now to even have conversations with people. It's just that I find it easier to cry than to speak.
Nathan: You continue isolating you-- yourself from people. You don't want to socialize with others. It's you and it's you alone.
Lesley Stahl: Are you a different person?
Naftali Wambalo: Yeah. I'm a different person. I used to enjoy my marriage, especially when it comes to bedroom fireworks. But after the job I hate sex.
Lesley Stahl: You hated sex?
---------
These three and nearly 200 other digital workers are suing SAMA and Meta over "unreasonable working conditions" that caused psychiatric problems
Nov 24, 2024