[N] Interview with Hamel Husain on semantic code search research at GitHub
“We hope that the community can use this dataset to improve developer tools generally, which may include semantic code search. We hope that the state of the art with regards to representation learning of code is advanced because researchers and practitioners now have a common dataset and a forum in which to discuss results. We also hope that the uniqueness of the dataset will inspire the community to uncover new approaches and techniques for code and natural language understanding.”
That’s a quote from the one of the authors of CodeSearchNet – datasets, tools, and benchmarks for representation learning of code. This research on semantic code search has been posted here before as news, but I thought some people here might be interested to know some of the details behind what goes into a project like this at a big company. I interviewed Hamel Husain, a machine learning engineer at GitHub about how the project started and evolved into a wider open source effort to involve the ML research community. Hope there are useful takeaways for people here.
Here’s a link to the interview: https://sourcesort.com/interview/hamel-husain-on-semantic-code-search
And here’s a link to the original paper on arXiv: https://arxiv.org/abs/1909.09436