Apple has released a technical paper detailing the models it has developed to power Apple Intelligence, the suite of generative AI features that will roll out to iOS, macOS, and iPadOS in the coming months.
In the document, Apple refutes accusations that it took an ethically questionable approach to training some of its models, reiterating that it did not use private user data and relied on a combination of publicly available and licensed data for Apple Intelligence.
“[The] “The pre-training dataset consists of… data that we have licensed from publishers, publicly available or open-source datasets, and publicly available information crawled by our web crawler, Applebot,” Apple wrote in the document. “Given our commitment to protecting user privacy, we note that no private Apple user data is included in the data mix.”
In July, Proof News reported that Apple had used a dataset called The Pile, which contains captions from hundreds of thousands of YouTube videos, to train a family of models designed for on-device processing. Many YouTube creators whose captions were swept up by The Pile were unaware of this and did not consent to it; Apple later issued a statement saying it had no plans to use these models to power AI features in its products.
The technical paper, which lifts the veil on the models Apple first revealed at WWDC 2024 in June, called Apple Foundation Models (AFM), emphasizes that the training data for the AFM models was obtained “responsibly” — or responsibly by Apple’s definition, at least.
The training data for the AFM models includes publicly available web data as well as licensed data from undisclosed publishers. According to the New York Times, Apple approached several publishers in late 2023, including NBC, Condé Nast, and IAC, about multi-year deals worth at least $50 million to train the models on the publishers’ news archives. Apple’s AFM models were also trained on open-source code hosted on GitHub, including Swift, Python, C, Objective-C, C++, JavaScript, Java, and Go code.
Training models on permissionless code, even open source, is a point of contention among developers. Some developers argue that some open-source codebases are unlicensed or don’t allow AI training in their terms of use. But Apple says it has “licensed” the code to try to include only repositories with minimal usage restrictions, such as those licensed by MIT, ISC, or Apache.
To improve the AFM models’ mathematical skills, Apple specifically included mathematical questions and answers from web pages, math forums, blogs, tutorials, and seminars in the training set, according to the article. The company also leveraged “high-quality, publicly available” datasets (which the article does not name) with “licenses that allow use for training … models,” filtered to remove sensitive information.
In total, the entire AFM model training dataset weighs in at about 6.3 trillion tokens. (Tokens are bite-sized chunks of data that are typically easier for generative AI models to ingest.) For comparison, that’s less than half the number of tokens (15 trillion) that Meta used to train its flagship text generation model, Llama 3.1 405B.
Apple obtained additional data, including data from human feedback and synthetic data, to refine the AFM models and attempt to mitigate any unwanted behavior, such as spewing toxicity.
“Our templates were created with the goal of helping users perform everyday activities on their Apple products, based on
“This is part of Apple’s core values and builds on our principles of responsible AI at every step,” the company says.
This document contains no hard evidence or shocking revelations, and this is due to careful design. It is rare for articles like these to be very revealing, both because of competitive pressures and because they reveal Also There are many things that could put companies in legal trouble.
Some companies that train models by scraping public data from the web claim their practice is protected by the fair use doctrine. But it’s a topic of much debate and a growing number of lawsuits.
Apple says in its document that it allows webmasters to prevent its crawler from scraping their data. But that leaves individual creators in a bind. What can an artist do if, say, their portfolio is hosted on a site that refuses to block Apple’s data scraping?
Legal battles will decide the fate of generative AI models and how they are trained. For now, Apple is trying to position itself as an ethical player while avoiding unwanted legal scrutiny.
0 Comments