[N] Github Releases Dataset Of Six Million Methods From Open Source Projects For CodeSearchNet Challenge
Searching for code to reuse, call into, or to see how others handle a problem is one of the most common tasks in a software developer’s day. However, search engines for code are often frustrating and never fully understand what we want, unlike regular web search engines. We started using modern machine learning techniques to improve code search but quickly realized that we were unable to measure our progress. Unlike natural language processing with GLUE benchmarks, there is no standard dataset suitable for code search evaluation.
Our fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3, including:
Six million methods overall
Two million of which have associated documentation (docstrings, JavaDoc, and more)
Metadata that indicates the original location (repository or line number, for example) where the data was found