Neural Code Search: ML-based code search using natural language queries

Neural Code Search: ML-based code search using natural language queries - 20 minutes read

Neural Code Search: ML-based code search using natural language queries

Engineers work best when they can easily find code examples to guide them on particular coding tasks. For some questions — for example, “How to programmatically close or hide the Android soft keyboard?” — information is readily available from popular resources like Stack Overflow. But questions specific to proprietary code or APIs (or code written in less common programming languages) need a different solution, since they are not typically discussed in those forums.

To address this need, we’ve developed a code search tool that applies natural language processing (NLP) and information retrieval (IR) techniques directly to source code text. This tool, called Neural Code Search (NCS), accepts natural language queries and returns relevant code fragments retrieved directly from the code corpus. Our premise is that with the availability of large codebases, code fragments related to a developer's query are likely to be discoverable somewhere within existing large codebases. In this blog, we introduce two models that accomplish this task:

Leveraging open source Facebook AI tools (including , FAISS, and PyTorch), NCS and UNIF both represent natural language queries and code snippets as vectors, and then train a network such that the vector representations of semantically similar code snippets and queries are close together in the vector space. With these models, we are able to find code snippets directly from code corpus in order to efficiently answer engineers’ questions. To evaluate NCS and UNIF, we used our newly created data set of the public queries on Stack Overflow. Our models can correctly answer questions from this data set, such as:

NCS's performance demonstrates that a relatively simple approach can work well in the domain of source code. UNIF’s performance shows that a simple supervised learning approach can offer significant additional benefits when labeled data is available. In conjunction with other Facebook-built systems, like and , this project provides our engineers with an extensive and growing ML-powered toolkit to help them write and manage code more effectively.

The NCS model captures program semantics (in this case, the intent of the code snippet) by using embeddings — continuous vector representations that, when computed appropriately, have the desirable property of putting semantically similar entities close to each other in the vector space. In the example below, there are two distinct method bodies that both pertain to closing or hiding the Android soft keyboard (the first question above). Since they share similar semantic meanings, even if they do not share the exact same lines of code, they are represented by points in the vector space that are close to each other.

(This graphic shows publicly available code from GitHub shared and under a GNU General Public License.)

We use this concept to build the NCS model. At a high level, each code fragment during model generation is embedded in a vector space at a method-level granularity. Once the model is built, a given query is mapped to the same vector space, and vector distance is used to estimate the relevance of code fragments to the query. This section will describe the model generation and search retrieval pipeline, as shown in the diagram below, in more detail.

To generate a model, NCS must extract words, build word embeddings, and then build document embeddings. (Here “document” refers to a method body.)

(The sample data shown here is from publicly available code shared on GitHub under the Apache 2.0 license.)

To produce a vector that represents a method body, we treat the source code as text and extract from the following syntactic categories: method names, method invocations, enums, string literals, and comments. Then we tokenize it based on standard English conventions (e.g. white space, punctuation) and punctuation relevant to code (e.g. snake and camel case). For example, for the method body “pxToDp” in the graphic above, the source code can be treated as the collection of words: “converts pixel in dp px to dp get resources get display metrics”.

For each method body in our corpus, we can tokenize the source code in this manner and learn an embedding for each word. After this step, the list of words we extracted for each method body resembles a natural language document.

We use to build word embeddings for all the words in the vocabulary corpus. FastText computes vector representations using a two-layer dense neural network that can be trained unsupervised on a large corpus. In particular, we use the skip-gram model, where the embedding for a target token is used to predict embeddings of context tokens within a fixed window size. In our example above, given the embedding for the token “dp” and a window size 2, the skip-gram model would learn to predict the embeddings for the tokens “pixel,” “in,” “px,” and “to.” The objective is to learn an embedding matrix T ∈ R where |V | is the size of the corpus, d is the word embedding dimension, and the k row in T is the embedding for the k word in V.

In this matrix, two vector representations are close together if the corresponding words often occur in similar contexts. We use the converse of this statement to help define semantic relation: words with vectors that are closer together should have related meanings. This is called the distributional hypothesis in the NLP literature, and we believe the same concept holds for source text.

The next step is to express the overall intent of a method body using the words that are present in the method body. To do so, we take a weighted average of the word embedding vectors for the set of words in the method body. We call this a document embedding.

Here, d is a set of words representing a method body, v is the fastText word embedding for the word w, C is the corpus containing all documents, and u is a normalizing function.

We use term frequency–inverse document frequency function (TF–IDF), which assigns a weight for a given word in a given document. Its goal is to highlight the most representative words in the document — a word has a higher weight if it appears frequently in the document, but it’s also penalized if it appears in too many documents in the corpus.

At the end of this step, we have an index of each method body in the corpus to its document vector representation, and the model generation is complete.

The search query is expressed as natural language sentences, such as “Close/hide soft keyboard” or “How to create a dialog without title”. We tokenize the query in the same fashion as for source code, and using the same fastText embedding matrix T, we simply average the vector representations of words to create a document embedding for the query sentence; out-of-vocab words are dropped. We then use a standard similarity search algorithm, , to find the document vectors with closest cosine distance to the query, and return the top results (plus some post-processed ranking, which is explained further in ).

We evaluated NCS’s performance using Stack Overflow questions, with the title as the query and a code snippet from the answers as the desired code answer. Given a query, we measure whether our model was able to retrieve from a collection of GitHub repositories a correct answer in the top 1, 5, and 10 results (labeled Answered, 5, 10 in the table below, respectively). We also report mean reciprocal rank (MRR) to measure at which n result NCS was able to answer the question correctly. We explain the experimental setup in more detail below. Out of the 287 questions in the Stack Overflow evaluation data set we created, NCS answered 175 questions correctly in the top 10 results; this equates to over 60% of the entire data set. We also compared NCS’s performance with that of a traditional IR technique, BM25. As shown in this table, NCS outperformed BM25.

An example of a question that NCS answers well is “Start Android Market from app”, where the top result returned form NCS is:

(The snippet is from publicly available code on Github shared here under the MIT license.)

The crux of NCS is its use of word embeddings. Since NCS is an unsupervised model, it has several significant advantages: It can learn directly over the search corpus, and it can be trained quickly and easily. NCS assumes that the words in the query come from the same domain as the words extracted from source code, as the queries and the code snippets are both mapped to the same vector space. However, this is not always the case. For example, none of the words in the query “Get free space on internal memory”, appear in the code snippet below. What we want is to map the query words “free space” to the word “available” in the code.

(The snippet was taken from publicly available code on Stack Overflow shared here under the CC-By-SA 3.0 license.)

From a data set of 14,005 Stack Overflow posts, we analyzed the overlap of the words in the queries with the words in the source code. We found that out of 13,972 unique words in the queries, fewer than half (6,072 words) also existed in the source code domain. This shows that if a query contains words that do not exist in source code, our model is not going to be effective in retrieving the correct method, since we drop out-of-vocab words. This observation motivated us to explore supervised learning to map words in the query to words in source code.

We decided to experiment with UNIF, a supervised minimal extension of the NCS technique, in order to bridge the gap between natural language words and source code words. In this model, we use supervised learning to modify the word embedding matrix T to produce two embedding matrices, T and T, for code and query tokens, respectively. We also replace the TF-IDF weighting of the code token embeddings with a learned attention-based weighing scheme.

We trained UNIF on the same collection of (c, q) data points as with NCS, where c, and q represent code and query tokens, respectively. (See the section below for details about this dataset.) The model architecture can be described as follows. Let T ∈ R and T ∈ R be two embedding matrices mapping each word from the natural language description and code tokens, respectively, to a vector of length d (V is the query vocabulary corpus, and V is the code vocabulary corpus.) The two matrices are initialized using the same initial weights, T, and modified separately during training (counterpart to fastText). To combine each bag of code token vectors into a document vector, we use an attention mechanism to compute the weighted average. The attention weights, a ∈ R, is a d-dimensional vector that is learned during training, and acts as a counterpart to TF-IDF. Given a bag of code word embedding vector {e, ....., e}, the attention weight a for each e is computed as follows:

The document vector is then computed as the sum of the word embedding vectors weighted by the attention weights:

To create a query document vector, e, we compute a simple average of the query word embedding, similar to the approach in NCS. Our training process learns parameters T, T, and a during classic back-propagation.

Retrieval works in the same fashion as it does for NCS. Given a query, we represent it as a document vector using the approach explained above, and use FAISS to find the document vectors with the closest cosine distance to the query. (In principal, UNIF would also benefit from post-processed ranking, as NCS does.)

We compared NCS and UNIF against the Stack Overflow evaluation data set, to see whether the model correctly answered the query in the top 1, 5, 10 results, along with the MRR score. The table below shows that UNIF improves the number of questions answered significantly over NCS.

This highlights the impressive search performance that a supervised technique could deliver if given access to an ideal training corpus. For example, with the search query “How to exit from the application and show the home screen?”, NCS returned:

(The first snippet is from publicly available code shared to GitHub here under the Apache 2.0 license. The second was shared to GitHub here under the MIT license.)

Another example is with the query “How to get the ActionBar height?” NCS returned:

(The first snippet is from publicly available code shared to GitHub here under the Apache 2.0 license. The second was shared to GitHub here also under the Apache 2.0 license.)

Additional data on UNIF’s performance is available in this paper.

One of the keys to creating a successful machine learning tool is obtaining a high-quality training data set. For our models, we leveraged the availability of the large open source codebase from GitHub. Furthermore, having a high-quality evaluation data set is equally important for assessing the quality of the model. When exploring a relatively new realm of research, such as code search, the lack of available evaluation data sets can limit our ability to assess the performance across various code search tools. So to help benchmark performance in this area, we curated a data set of 287 publicly available data points from Stack Overflow, where each data point consists of a natural language query and a “golden” code snippet answer.

We trained our unsupervised model, NCS, directly over the search corpus by picking the 26,109 most popular Android projects on GitHub. This also became the search corpus from which NCS returns code snippets.

To incorporate supervision for our UNIF model, we needed an aligned pair of data points to learn the mapping. We trained UNIF on a collection of (c, q) data points, where q is the natural language description or query, and c is the corresponding code snippet. We obtained this data set by extracting Stack Overflow question titles and code snippets from data publicly released by Stack Exchange under a CC-BY-SA 3.0 license. After filtering the questions using a variety of heuristics — the code snippet must have an Android tag, for example, or must have a method call, or must not contain XML tags — we ended up with 451,000 training data points. This data set is disjoint from the evaluation queries. (This reflects a best case availability of a training data set; as we note in the paper, a docstring based training did not give good results.)

We evaluated the effectiveness of NCS using Stack Overflow questions. Stack Overflow is a useful resource for evaluation because it contains a large number of natural language queries, along with upvoted answers that can be treated as an acceptable response. Given a Stack Overflow question title as the query, NCS retrieves a list of methods from GitHub. In our work creating and refining NCS, we considered a search successful if at least one of the top results from NCS matched the method described in the Stack Overflow answer code snippet. (For our evaluation, we calculated using top, top, and top.)

We chose the Stack Overflow questions using a script with the following criteria: 1) the question included “Android” and “Java” tags, 2) there was an upvoted code answer, and 3) the ground truth code snippet had at least one match in our corpus of GitHub Android repos. With some manual processing to make sure the questions were acceptable, we obtained a data set of 287 questions.

We found that manually assessing the correctness of search results can be difficult to do in a reproducible fashion, as opinions can vary across authors and people. We decided instead to implement an automated evaluation pipeline using Aroma. Aroma gives a similarity score between search results and a ground truth code snippet to assess whether a query was correctly answered if the score passes a threshold value. With this pipeline, we can evaluate the models in a reproducible fashion. We use code answers found on Stack Overflow as the ground truth for evaluation.

We used the above evaluation procedure to not only compare UNIF and NCS, but also compare UNIF against some of the other code search solutions in the literature. (Details on the comparisons are available here.)

With the widespread availability of large code repositories in production today, machine learning can extract patterns and insights that can boost engineer productivity. At Facebook, these machine learning tools include code-to-code recommendation with Aroma and automatic bug fixing with Getafix. NCS and UNIF are examples of code search models that can bridge the gap between natural language queries and finding relevant code snippets. With tools such as these, engineers will be able to easily find and use relevant code snippets, even when working with proprietary source code or writing in less common programming languages. In the future, we hope to explore other deep learning models in the realm of synthesis to further boost engineers’ productivity.

Source: Facebook.com

Powered by NewsAPI.org

Keywords:

Neural coding • Markup language • Natural language processing • Source code • Computer programming • Android (operating system) • Real-time computing • Computer keyboard • Information • System resource • Stack Overflow • Proprietary software • Source code • Application programming interface • Source code • Programming language • Internet forum • Source code • Web search engine • Programming tool • Natural language processing • Natural language processing • Information retrieval • Information retrieval • Skill • Source code • String (computer science) • Programming tool • Neural coding • Web search engine • Natural language processing • Information retrieval • Source code • Source code • Software developer • Blog • Open-source model • Facebook • Artificial intelligence • Programming tool • Natural language • Information retrieval • Euclidean vector • Computer network • Representation theory • Information retrieval • Vector space • Mathematical model • Data set • Information retrieval • Stack Overflow • Conceptual model • Data set • Graph (discrete mathematics) • Source code • Graph (discrete mathematics) • Supervised learning • Data • Facebook • System • Project management • Military engineering • Markup language • Conceptual model • Semantics (computer science) • Property • Vector space • Scientific method • Android (operating system) • Computer keyboard • Semantics • Semantics • Source lines of code • Vector space • Graphical user interface • GitHub • GNU General Public License • High-level programming language • Machine code • Computer simulation • Vector space • Granularity • Scientific modelling • Vector space • Computer program • Scientific modelling • Information retrieval • Diagram • Scientific modelling • Word • Here document • GitHub • Apache License • Vector graphics • Method (computer programming) • Source code • Plain text • Syntactic category • Method (computer programming) • Method (computer programming) • Enumerated type • String (computer science) • Literal (computer programming) • Comment (computer programming) • Lexical analysis • Standard English • Eg White • Whitespace character • Punctuation • Punctuation • Camel case • Method (computer programming) • Point (typography) • Computer graphics • Source code • Container (abstract data type) • String (computer science) • Pixel • Pixel • Display device • Software development process • Source code • Word • HTML element • Word • Software development process • Text corpus • Natural language • Document • Word • Word • Vocabulary • Text corpus • Vector space • Knowledge representation and reasoning • Sparse matrix • Artificial neural network • Unsupervised learning • Text corpus • N-gram • Lexical analysis • N-gram • Pixel • Pixel • The Matrix • Word embedding • Glossary of commutative algebra • Kaffir (racial term) • Matrix (mathematics) • Vector space • Knowledge representation and reasoning • Word • Theorem • Statement (logic) • Semantics • Word • Vector space • Meaning (linguistics) • Distributional semantics • Natural language processing • Concept • Source text • Intention • Physical body • Physical body • Average • Word embedding • Vector space • Set (mathematics) • Set (mathematics) • String (computer science) • Text corpus • Word embedding • Text corpus • Tf–idf • Frequency response • Tf–idf • Word • Document • Word • Document • Word • Document • Scientific method • Document • Euclidean vector • Knowledge representation and reasoning • Natural language • Computer keyboard • Lexical analysis • Source code • Matrix (mathematics) • Array data structure • Knowledge representation and reasoning • String (computer science) • Document • Vocabulary • String (computer science) • Similarity search • Search algorithm • Euclidean vector • Cosine similarity • Stack Overflow • GitHub • Repository (version control) • Mean reciprocal rank • Maximumrocknroll • Experiment • Stack Overflow • Evaluation • NoCopyrightSounds • BM-25 (MRL) • BM-25 (MRL) • Google Play • Application software • Snippet (programming) • Source code • GitHub • MIT License • Microsoft Word • Thiocyanate • Unsupervised learning • Web search engine • Text corpus • Thiocyanate • String (computer science) • String (computer science) • Source code • Information retrieval • Vector space • Microsoft Word • Source code • Stack Overflow • Creative Commons license • Data set • Stack Overflow • String (computer science) • SQL • String (computer science) • Source code • String (computer science) • Information retrieval • String (computer science) • Source code • String (computer science) • Source code • Scientific modelling • Software development process • Vocabulary • String (computer science) • Observation • Supervised learning • Source code • Natural language • Source code • Conceptual model • Supervised learning • Word embedding • Matrix (mathematics) • Matrix (mathematics) • Lexical analysis • Tf–idf • Directorate of Operations (CIA) • Data set • Scientific modelling • Matrix (mathematics) • Natural language • Linguistic description • Lexical analysis • Euclidean vector • Vowel length • Vocabulary • Text corpus • Vocabulary • Text corpus • Matrix (mathematics) • Vector space • Euclidean vector • Average • Dimension • Vector space • Tf–idf • Code word • Word embedding • Vector space • Vector space • Summation • Word embedding • Vector space • Vector space • Arithmetic mean • Word embedding • Backpropagation • Information retrieval • Euclidean vector • Euclidean vector • Cosine similarity • Ranking • Stack Overflow • Data set • Maximumrocknroll • NoCopyrightSounds • Apache License • MIT License • NoCopyrightSounds • Apache License • GitHub • Apache License • Machine learning • Programming tool • Quality control • Test set • Mathematical model • Open-source software • Codebase • GitHub • Quality control • Data set • Quality control • Mathematical model • Research • Computer program • Evaluation • Computer program • Specification (technical standard) • Data set • Stack Overflow • Natural language processing • Unsupervised learning • Android (operating system) • GitHub • Statistical model • Cartography • Data collection • Natural language • Stack Overflow • Stack Exchange • Creative Commons license • Heuristic • Snippet (programming) • Android (operating system) • Tag (metadata) • Method (computer programming) • XML • HTML element • Information retrieval • Docstring • Stack overflow • Stack Overflow • System resource • Natural language • Stack Overflow • GitHub • Method (computer programming) • Stack Overflow • Stack Overflow • Scripting language • Android (operating system) • Java (programming language) • HTML element • Ground truth • Snippet (programming) • GitHub • Android (operating system) • Data set • Similarity score • Ground truth • Stack Overflow • Ground truth • Computer program • Software repository • Machine learning • Pattern • Audio engineer • Productivity improving technologies • Facebook • Machine learning • Tool • Computer program • Computer program • Automatic bug fixing • Thiocyanate • Natural language • Information retrieval • Snippet (programming) • Programming tool • Snippet (programming) • Proprietary software • Source code • Programming language • Deep learning • Scientific modelling • Boost (C++ libraries) • Productivity improving technologies •

Jarod Productivity

70 views

0 points

Submitted over 5 years ago