source : au.finance.yahoo.com
(Bloomberg) — In her one-room home on a quiet street in Agara, a small village three hours southwest of Bangalore, surrounded by rice paddies and groundnut fields, Preethi P. sits on a stool next to a sewing machine. She normally spent hours mending or sewing clothes, and her work earned her an average of less than $1 a day. However, on this day, she reads a sentence in her native Kannada on an app on a phone. She pauses for a moment and then reads another.
Most read from Bloomberg
Preethi, who goes by a single name as is common in the region, is one of 70 workers hired in Agara and surrounding villages by a startup called Karya to collect text, voice and image data in the Indian vernacular. She is part of a vast, invisible global workforce – active in countries like India, Kenya and the Philippines – that collects and labels the data that AI chatbots and virtual assistants rely on to generate relevant responses. However, unlike many other data contractors, Preethi is paid well for her efforts, at least by local standards.
After working with Karya for three days, Preethi earned 4,500 rupees ($54), more than four times the amount the 22-year-old high school graduate usually earns in an entire month as a tailor. The money is enough, she said, to pay off that month’s installment on a loan taken out to partially repair her home’s crumbling mud walls, which have been carefully patched with colorful saris. “All I need is a telephone and internet.”
Karya was founded in 2021, before the rise of ChatGPT, but this year’s frenzy around generative AI has only increased tech companies’ insatiable demand for data. According to Nasscom, the trade body for the tech industry, India alone is expected to have nearly a million data annotation workers by 2030. Karya differentiates itself from other data providers by offering its contractors – many of them women and mostly in rural communities – as much as 20 times the prevailing minimum wage, promising to produce better quality Indian-language data that tech companies will pay more to obtain.
“Every year, major tech companies spend billions of dollars collecting training data for their AI and machine learning models,” Manu Chopra, the 27-year-old Stanford-educated computer engineer behind the startup, told Bloomberg in an interview. “Poor compensation for such work is a failure of the industry.”
If meager wages are an industry failure, it is a failure for which Silicon Valley bears some responsibility. For years, tech companies have outsourced tasks such as data labeling and content moderation to cheaper foreign contractors. But now some of Silicon Valley’s most prominent names are turning to Karya to tackle one of the biggest challenges for their AI products: finding high-quality data to build tools that better serve billions of potential non-English speaking users operate. These partnerships could represent a powerful shift in the economics of the data industry and Silicon Valley’s relationship with data providers.
Microsoft Corp. has used Karya to collect local voice data for its AI products. The Bill & Melinda Gates Foundation is working with Karya to reduce gender bias in data fed into large language models, the technology that underpins AI chatbots. And Google from Alphabet Inc. leans on Karya and other local partners to collect voice data in 85 Indian districts. Google plans to expand to every district to include the majority language or spoken dialect and build a generative AI model for 125 Indian languages.
Many AI services have been developed disproportionately with English-language internet data, such as articles, books and social media posts. As a result, these AI models poorly represent the diversity of languages for internet users in other countries, who are more likely to access AI-powered smartphones and apps than to learn English. India alone is home to nearly a billion such potential users, as the government pushes to roll out AI tools in every field from healthcare to education and financial services.
“India is the first non-Western country we’re doing this in, and we’re testing Bard in nine Indian languages,” said Manish Gupta, head of Google Research in India, referring to the company’s AI chatbot. “More than seventy Indian languages, spoken by more than a million people, each had zero digital corpus. The problem is so big.”
Gupta ticked off a list of issues that AI companies need to address to serve Indian internet users: Non-English datasets are of woefully low quality; Barely any conversational data exists in Hindi and other Indian languages; and digitized content from books and newspapers in Indian languages is very limited.
When used for South Asian languages, some major language models appear to make up words and struggle with basic grammar. There are also concerns that these AI services may reflect a more distorted view of other cultures. It’s critical to have a broad representation of training data, including non-English language data, so that AI systems “don’t perpetuate harmful stereotypes, produce hate speech, or deliver misinformation,” says Mehran Sahami, professor of computer science at Stanford University. .
Karya, a social impact startup headquartered in Bangalore and backed by grants, can widen the pool of languages represented in part by specifically targeting workers in rural areas who might not otherwise be contracted for such tasks. Karya’s app can work without internet access and provides voice support for people with limited literacy. In India, over 32,000 crowdsourced workers have logged into the app, completing 40 million paid digital tasks such as image recognition, contour alignment, video and voice annotation.
For Chopra, the goal is not only to improve data provision, but also to combat poverty. Karya’s founder grew up in an impoverished neighborhood called Shakur Basti in West Delhi. He won a scholarship to study at an elite school where he was bullied because his classmates said he “smelled poor.” Chopra ended up at Stanford to study computer science, but realized he hated the “how to make a billion dollar” mentality he encountered there.
After graduating in 2017, he began working on his long-held interest: using technology to tackle poverty. “It only takes $1,500 in savings for an Indian to qualify for entry into the middle class,” Chopra said. “But it could take the poor 200 years to reach that level of savings.”
Microsoft, he discovered, had paid a hefty sum to collect voice data, albeit of poor quality, to fuel its AI systems and research. For example, in 2017, although there were 1 million hours of digitized voice data available in Marathi, a language spoken in Mumbai and the Western India region, only 165 hours were available for purchase. His startup has since collected 10,000 hours of Marathi speech data for Microsoft’s AI services, read by men and women from five different regions.
“Tech companies want the data, with accents and all,” says Chopra. “You’re coughing, that’s what they want in the speech – it represents natural language.” Saikat Guha, a researcher at Microsoft Research India who focuses on the ethics of data collection, said he has also used Karya’s content for a project to help people with visual disabilities find work. “The quality of the data is much better than any other source I have used,” says Guha. “If you pay employees fairly, they invest more in their work, and the end result is better data.”
Meanwhile, more than 30,000 young, school-educated women are working with Karya to collect ‘gender-intentional’ data sets – for example that the doctor or boss is not always a he – in six Indian languages for the Bill & Melinda Gates Foundation. It is the largest effort in this area in Indian languages and will serve as a corpus to build datasets to reduce gender bias in LLMs. Karya doesn’t stop at India. The company said it is in talks to sell its platform as a service to organizations in Africa and South America that will do similar work.
For now, women in Yelandur, another village southwest of Bangalore, are eagerly awaiting Karya’s next project: transcribing a Kannada audio recording. Among them is Shambhavi S., 25, who earned a few thousand rupees from a previous assignment while working in the quiet of her home after feeding her in-laws dinner and putting her children to bed.
“I don’t know what artificial intelligence is, I’ve never heard of it,” Shambhavi said. “I want to earn and educate my children so that they can learn how to use it.”
Most read from Bloomberg Businessweek
©2023 Bloomberg LP
source : au.finance.yahoo.com