[N] Github Releases Dataset Of Six Million Methods From Open Source Projects For CodeSearchNet Challenge
Introducting The Github CodeSearchNet Challenge
Searching for code to reuse, call into, or to see how others handle a problem is one of the most common tasks in a software developer’s day. However, search engines for code are often frustrating and never fully understand what we want, unlike regular web search engines. We started using modern machine learning techniques to improve code search but quickly realized that we were unable to measure our progress. Unlike natural language processing with GLUE benchmarks, there is no standard dataset suitable for code search evaluation.
We collected a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. We used our TreeSitter infrastructure for this effort, and we’re also releasing our data preprocessing pipeline for others to use as a starting point in applying machine learning to code. While this data is not directly related to code search, its pairing of code with related natural language description is suitable to train models for this task. Its substantial size also makes it possible to apply high-capacity models based on modern Transformer architectures.
Our fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3, including:
Six million methods overall
Two million of which have associated documentation (docstrings, JavaDoc, and more)
Metadata that indicates the original location (repository or line number, for example) where the data was found
submitted by /u/SpecificTwo
[link] [comments]