IBM Compiles Dataset To Teach Software How Software Is Made: 14m Code Samples, Half Of Which Actually Work
Think IBM has assembled a massive silo of source code for teaching machine-learning programs about programming.
Dubbed Project CodeNet, the set contains, we're told, 14 million code samples totaling 500 million lines in more than 55 programming languages, from Java, C, and Go to COBOL, Pascal, and FORTRAN. Truth be told, more than three-quarters of it all is in C++ and Python.
This source code wasn't taken from production nor in-development applications: it was collected from entries submitted to two programming contests organized in Japan: Aizu and AtCoder. In these contests, competitors are challenged to write the necessary code to turn a given set of inputs into a set of desired outputs. About half of the samples work as expected, and the rest are labeled as either wrong solutions, non-building, or buggy.
Ideally, you would train an AI tool to favorably identify the good programs, and reject the bad ones, for example. For seven million of the samples, the input and required output is included.
Big Blue wants CodeNet to follow in the footsteps of ImageNet, the database of pictures and labels for training computer-vision applications, and become the leading dataset for teaching software to understand the blueprints of software – what code actually looks like, and how it compares to other code. It's hoped CodeNet can be used to train development tools that can, for instance, search application and library source for desired routines, or perhaps translate from one language to another, or recognize faulty or correct implementations.
"IBM believes Project CodeNet will serve as a valuable benchmark dataset for source-to-source translation and transitioning legacy codebases to modern code languages, helping businesses speed up their application of AI," the biz gushed in announcing the project as part of its Think online conference this week.
- IBM wheels out AutoSQL, Watson Orchestrate in bid to fend off cloud irrelevance
- Can your AI code be fooled by vandalized images or clever wording? Microsoft open sources a tool to test for that
- More than 1,000 humans fail to beat AI contender in top crossword battle
- US Army develops natural-language voice-command AI for robots, tanks, etc. For search'n'rescue. For now
The IBM and MIT-IBM Watson AI Lab team behind the dataset has produced a paper [PDF] describing their work, and put all the collated material on the project's GitHub page.
"This dataset is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark: from code similarity and classification for advances in code recommendation algorithms, and code translation between a large variety of programming languages, to advances in code performance improvement techniques," the boffins concluded in their report. ®
Apple Faces Backlash Over Destructive IPad Advertisement
Description of the iPad advertisement:Analysis of the backlash:Impact on Apple's reputation:Apple's response and actions... Read more
Apple's Latest Breakthrough: Unveiling The Most Powerful AI Chip Yet
In a move that underscores its commitment to advancing artificial intelligence (AI) capabilities, Apple has recently int... Read more
Microsoft's Renewable Energy Leap: A Big Step Towards Sustainable Data Centers
In a significant move towards sustainability, Microsoft recently finalized a monumental deal with Brookfield Renewable, ... Read more
Data: The Sword And Shield Against Disinformation
In the age of information overload, distinguishing fact from fiction has become a daunting task. Disinformation, the del... Read more
Taking Flight: Volocopter's Quest To Revolutionize Urban Mobility Gains Momentum
Volocopter, a pioneering company in the field of urban air mobility, is on a mission to transform how people move around... Read more
OnlyFans Under The Microscope: Addressing Concerns Over Child Safety
Concerns Over Child SafetyOnlyFans' ResponseRegulatory ActionsCollaborative EffortsFuture DirectionsConclusion Read more