Leaked Yandex Code Breaks Open the Creepy Black Box of Online Advertising
Leaked Yandex Code Breaks Open the Creepy Black Box of Online Advertising
“They have been trying to maintain this image of a more independent and Western-oriented company that from time to time protested some repressive laws and orders, helping attract foreign investments and business deals,” says Natalia Krapiva, tech-legal counsel at digital rights nonprofit Access Now. “But in practice, Yandex has been losing its independence and caving in to the Russian government demands. The future of the company is uncertain, but it’s likely that the Russia-based part of the company will lose the remaining shreds of independence.”
The Yandex leak is huge. The 45 GB of source code covers almost all of Yandex’s major services, offering a glimpse into the work of its thousands of software engineers. The code appears to date from around July 2022, according to timestamps included within the data, and it mostly uses popular programming languages. It is written in English and Russian, but also includes racist slurs. (When it was leaked in January, Yandex said this was “deeply offensive and completely unacceptable,” and it detailed some ways that parts of the code broke its own company policies.)
McCrea manually inspected two parts of the code: Yandex Metrica and Crypta. Metrica is the firm’s equivalent of Google Analytics, software that places code on participating websites and in apps, through AppMetrica, that can track visitors, including down to every mouse movement. Last year, AppMetrica, which is embedded in more than 40,000 apps in 50 countries, caused national security concerns with US lawmakers after the Financial Times reported the scale of data it was sending back to Russia.
This data, McCrea says, is pulled into Crypta. The tool analyzes people’s online behavior to ultimately show them ads for things they’re interested in. More than 300 “factors” are analyzed, according to the company’s website, and machine learning algorithms group people based on their interests. “Every app or service that Yandex has, which is supposed to be over 90, is funneling data into Crypta for these advertising segments in one form or another,” McCrea says.
Some data collected by Yandex is handed over when people use its services, such as sharing their location to show where they are on a map. Other information is gathered automatically. Broadly, the company can gather information about someone’s device, location, search history, home location, work location, music listening and movie viewing history, email data, and more.
The source code shows AppMetrica collecting data on people’s precise location, including their altitude, direction, and the speed they may be traveling. McCrea questions how useful this is for advertising. It also grabs the names of the Wi-Fi networks people are connecting to. This is fed into Crypta, with the Wi-Fi network name being linked to a person’s overall Yandex ID, the researcher says. At times, its systems attempt to link multiple different IDs together.
“The amount of data that Yandex has through the Metrica is so huge, it's just impossible to even imagine it,” says Grigory Bakunov, a former Yandex engineer and deputy CTO who left the company in 2019. “It's enough to build any grouping, or segmentation of the audience.” The segments created by Crypta appear to be highly specific and show how powerful data about our online lives is when it is aggregated. There are advertising segments for people who use Yandex’s Alice smart speaker, “film lovers” can be grouped by their favorite genre, there are laptop users, people who “searched Radisson on maps,” and mobile gamers who show a long-term interest.