Neural Code Search: ML-based code search using natural language queries - 20 minutes read


Neural Code Search: ML-based code search using natural language queries

Engineers work best when they can easily find code examples to guide them on particular coding tasks. For some questions — for example, “How to programmatically close or hide the Android soft keyboard?” — information is readily available from popular resources like Stack Overflow. But questions specific to proprietary code or APIs (or code written in less common programming languages) need a different solution, since they are not typically discussed in those forums.

To address this need, we’ve developed a code search tool that applies natural language processing (NLP) and information retrieval (IR) techniques directly to source code text. This tool, called Neural Code Search (NCS), accepts natural language queries and returns relevant code fragments retrieved directly from the code corpus. Our premise is that with the availability of large codebases, code fragments related to a developer's query are likely to be discoverable somewhere within existing large codebases. In this blog, we introduce two models that accomplish this task:

Leveraging open source Facebook AI tools (including , FAISS, and PyTorch), NCS and UNIF both represent natural language queries and code snippets as vectors, and then train a network such that the vector representations of semantically similar code snippets and queries are close together in the vector space. With these models, we are able to find code snippets directly from code corpus in order to efficiently answer engineers’ questions. To evaluate NCS and UNIF, we used our newly created data set of the public queries on Stack Overflow. Our models can correctly answer questions from this data set, such as:

NCS's performance demonstrates that a relatively simple approach can work well in the domain of source code. UNIF’s performance shows that a simple supervised learning approach can offer significant additional benefits when labeled data is available. In conjunction with other Facebook-built systems, like and , this project provides our engineers with an extensive and growing ML-powered toolkit to help them write and manage code more effectively.

The NCS model captures program semantics (in this case, the intent of the code snippet) by using embeddings — continuous vector representations that, when computed appropriately, have the desirable property of putting semantically similar entities close to each other in the vector space. In the example below, there are two distinct method bodies that both pertain to closing or hiding the Android soft keyboard (the first question above). Since they share similar semantic meanings, even if they do not share the exact same lines of code, they are represented by points in the vector space that are close to each other.

(This graphic shows publicly available code from GitHub shared and under a GNU General Public License.)

We use this concept to build the NCS model. At a high level, each code fragment during model generation is embedded in a vector space at a method-level granularity. Once the model is built, a given query is mapped to the same vector space, and vector distance is used to estimate the relevance of code fragments to the query. This section will describe the model generation and search retrieval pipeline, as shown in the diagram below, in more detail.

To generate a model, NCS must extract words, build word embeddings, and then build document embeddings. (Here “document” refers to a method body.)

(The sample data shown here is from publicly available code shared on GitHub under the Apache 2.0 license.)

To produce a vector that represents a method body, we treat the source code as text and extract from the following syntactic categories: method names, method invocations, enums, string literals, and comments. Then we tokenize it based on standard English conventions (e.g. white space, punctuation) and punctuation relevant to code (e.g. snake and camel case). For example, for the method body “pxToDp” in the graphic above, the source code can be treated as the collection of words: “converts pixel in dp px to dp get resources get display metrics”.

For each method body in our corpus, we can tokenize the source code in this manner and learn an embedding for each word. After this step, the list of words we extracted for each method body resembles a natural language document.

We use to build word embeddings for all the words in the vocabulary corpus. FastText computes vector representations using a two-layer dense neural network that can be trained unsupervised on a large corpus. In particular, we use the skip-gram model, where the embedding for a target token is used to predict embeddings of context tokens within a fixed window size. In our example above, given the embedding for the token “dp” and a window size 2, the skip-gram model would learn to predict the embeddings for the tokens “pixel,” “in,” “px,” and “to.” The objective is to learn an embedding matrix T ∈ R where |V | is the size of the corpus, d is the word embedding dimension, and the k row in T is the embedding for the k word in V.

In this matrix, two vector representations are close together if the corresponding words often occur in similar contexts. We use the converse of this statement to help define semantic relation: words with vectors that are closer together should have related meanings. This is called the distributional hypothesis in the NLP literature, and we believe the same concept holds for source text.

The next step is to express the overall intent of a method body using the words that are present in the method body. To do so, we take a weighted average of the word embedding vectors for the set of words in the method body. We call this a document embedding.

Here, d is a set of words representing a method body, v is the fastText word embedding for the word w, C is the corpus containing all documents, and u is a normalizing function.

We use term frequency–inverse document frequency function (TF–IDF), which assigns a weight for a given word in a given document. Its goal is to highlight the most representative words in the document — a word has a higher weight if it appears frequently in the document, but it’s also penalized if it appears in too many documents in the corpus.

At the end of this step, we have an index of each method body in the corpus to its document vector representation, and the model generation is complete.

The search query is expressed as natural language sentences, such as “Close/hide soft keyboard” or “How to create a dialog without title”. We tokenize the query in the same fashion as for source code, and using the same fastText embedding matrix T, we simply average the vector representations of words to create a document embedding for the query sentence; out-of-vocab words are dropped. We then use a standard similarity search algorithm, , to find the document vectors with closest cosine distance to the query, and return the top results (plus some post-processed ranking, which is explained further in ).

We evaluated NCS’s performance using Stack Overflow questions, with the title as the query and a code snippet from the answers as the desired code answer. Given a query, we measure whether our model was able to retrieve from a collection of GitHub repositories a correct answer in the top 1, 5, and 10 results (labeled Answered, 5, 10 in the table below, respectively). We also report mean reciprocal rank (MRR) to measure at which n result NCS was able to answer the question correctly. We explain the experimental setup in more detail below. Out of the 287 questions in the Stack Overflow evaluation data set we created, NCS answered 175 questions correctly in the top 10 results; this equates to over 60% of the entire data set. We also compared NCS’s performance with that of a traditional IR technique, BM25. As shown in this table, NCS outperformed BM25.

An example of a question that NCS answers well is “Start Android Market from app”, where the top result returned form NCS is:

(The snippet is from publicly available code on Github shared here under the MIT license.)

The crux of NCS is its use of word embeddings. Since NCS is an unsupervised model, it has several significant advantages: It can learn directly over the search corpus, and it can be trained quickly and easily. NCS assumes that the words in the query come from the same domain as the words extracted from source code, as the queries and the code snippets are both mapped to the same vector space. However, this is not always the case. For example, none of the words in the query “Get free space on internal memory”, appear in the code snippet below. What we want is to map the query words “free space” to the word “available” in the code.

(The snippet was taken from publicly available code on Stack Overflow shared here under the CC-By-SA 3.0 license.)

From a data set of 14,005 Stack Overflow posts, we analyzed the overlap of the words in the queries with the words in the source code. We found that out of 13,972 unique words in the queries, fewer than half (6,072 words) also existed in the source code domain. This shows that if a query contains words that do not exist in source code, our model is not going to be effective in retrieving the correct method, since we drop out-of-vocab words. This observation motivated us to explore supervised learning to map words in the query to words in source code.

We decided to experiment with UNIF, a supervised minimal extension of the NCS technique, in order to bridge the gap between natural language words and source code words. In this model, we use supervised learning to modify the word embedding matrix T to produce two embedding matrices, T and T, for code and query tokens, respectively. We also replace the TF-IDF weighting of the code token embeddings with a learned attention-based weighing scheme.

We trained UNIF on the same collection of (c, q) data points as with NCS, where c, and q represent code and query tokens, respectively. (See the section below for details about this dataset.) The model architecture can be described as follows. Let T ∈ R and T ∈ R be two embedding matrices mapping each word from the natural language description and code tokens, respectively, to a vector of length d (V is the query vocabulary corpus, and V is the code vocabulary corpus.) The two matrices are initialized using the same initial weights, T, and modified separately during training (counterpart to fastText). To combine each bag of code token vectors into a document vector, we use an attention mechanism to compute the weighted average. The attention weights, a ∈ R, is a d-dimensional vector that is learned during training, and acts as a counterpart to TF-IDF. Given a bag of code word embedding vector {e, ....., e}, the attention weight a for each e is computed as follows:

The document vector is then computed as the sum of the word embedding vectors weighted by the attention weights:

To create a query document vector, e, we compute a simple average of the query word embedding, similar to the approach in NCS. Our training process learns parameters T, T, and a during classic back-propagation.

Retrieval works in the same fashion as it does for NCS. Given a query, we represent it as a document vector using the approach explained above, and use FAISS to find the document vectors with the closest cosine distance to the query. (In principal, UNIF would also benefit from post-processed ranking, as NCS does.)

We compared NCS and UNIF against the Stack Overflow evaluation data set, to see whether the model correctly answered the query in the top 1, 5, 10 results, along with the MRR score. The table below shows that UNIF improves the number of questions answered significantly over NCS.

This highlights the impressive search performance that a supervised technique could deliver if given access to an ideal training corpus. For example, with the search query “How to exit from the application and show the home screen?”, NCS returned:

(The first snippet is from publicly available code shared to GitHub here under the Apache 2.0 license. The second was shared to GitHub here under the MIT license.)

Another example is with the query “How to get the ActionBar height?” NCS returned:

(The first snippet is from publicly available code shared to GitHub here under the Apache 2.0 license. The second was shared to GitHub here also under the Apache 2.0 license.)

Additional data on UNIF’s performance is available in this paper.

One of the keys to creating a successful machine learning tool is obtaining a high-quality training data set. For our models, we leveraged the availability of the large open source codebase from GitHub. Furthermore, having a high-quality evaluation data set is equally important for assessing the quality of the model. When exploring a relatively new realm of research, such as code search, the lack of available evaluation data sets can limit our ability to assess the performance across various code search tools. So to help benchmark performance in this area, we curated a data set of 287 publicly available data points from Stack Overflow, where each data point consists of a natural language query and a “golden” code snippet answer.

We trained our unsupervised model, NCS, directly over the search corpus by picking the 26,109 most popular Android projects on GitHub. This also became the search corpus from which NCS returns code snippets.

To incorporate supervision for our UNIF model, we needed an aligned pair of data points to learn the mapping. We trained UNIF on a collection of (c, q) data points, where q is the natural language description or query, and c is the corresponding code snippet. We obtained this data set by extracting Stack Overflow question titles and code snippets from data publicly released by Stack Exchange under a CC-BY-SA 3.0 license. After filtering the questions using a variety of heuristics — the code snippet must have an Android tag, for example, or must have a method call, or must not contain XML tags — we ended up with 451,000 training data points. This data set is disjoint from the evaluation queries. (This reflects a best case availability of a training data set; as we note in the paper, a docstring based training did not give good results.)

We evaluated the effectiveness of NCS using Stack Overflow questions. Stack Overflow is a useful resource for evaluation because it contains a large number of natural language queries, along with upvoted answers that can be treated as an acceptable response. Given a Stack Overflow question title as the query, NCS retrieves a list of methods from GitHub. In our work creating and refining NCS, we considered a search successful if at least one of the top results from NCS matched the method described in the Stack Overflow answer code snippet. (For our evaluation, we calculated using top, top, and top.)

We chose the Stack Overflow questions using a script with the following criteria: 1) the question included “Android” and “Java” tags, 2) there was an upvoted code answer, and 3) the ground truth code snippet had at least one match in our corpus of GitHub Android repos. With some manual processing to make sure the questions were acceptable, we obtained a data set of 287 questions.

We found that manually assessing the correctness of search results can be difficult to do in a reproducible fashion, as opinions can vary across authors and people. We decided instead to implement an automated evaluation pipeline using Aroma. Aroma gives a similarity score between search results and a ground truth code snippet to assess whether a query was correctly answered if the score passes a threshold value. With this pipeline, we can evaluate the models in a reproducible fashion. We use code answers found on Stack Overflow as the ground truth for evaluation.

We used the above evaluation procedure to not only compare UNIF and NCS, but also compare UNIF against some of the other code search solutions in the literature. (Details on the comparisons are available here.)

With the widespread availability of large code repositories in production today, machine learning can extract patterns and insights that can boost engineer productivity. At Facebook, these machine learning tools include code-to-code recommendation with Aroma and automatic bug fixing with Getafix. NCS and UNIF are examples of code search models that can bridge the gap between natural language queries and finding relevant code snippets. With tools such as these, engineers will be able to easily find and use relevant code snippets, even when working with proprietary source code or writing in less common programming languages. In the future, we hope to explore other deep learning models in the realm of synthesis to further boost engineers’ productivity.

Source: Facebook.com

Powered by NewsAPI.org

Keywords:

Neural codingMarkup languageNatural language processingSource codeComputer programmingAndroid (operating system)Real-time computingComputer keyboardInformationSystem resourceStack OverflowProprietary softwareSource codeApplication programming interfaceSource codeProgramming languageInternet forumSource codeWeb search engineProgramming toolNatural language processingNatural language processingInformation retrievalInformation retrievalSkillSource codeString (computer science)Programming toolNeural codingWeb search engineNatural language processingInformation retrievalSource codeSource codeSoftware developerBlogOpen-source modelFacebookArtificial intelligenceProgramming toolNatural languageInformation retrievalEuclidean vectorComputer networkRepresentation theoryInformation retrievalVector spaceMathematical modelData setInformation retrievalStack OverflowConceptual modelData setGraph (discrete mathematics)Source codeGraph (discrete mathematics)Supervised learningDataFacebookSystemProject managementMilitary engineeringMarkup languageConceptual modelSemantics (computer science)PropertyVector spaceScientific methodAndroid (operating system)Computer keyboardSemanticsSemanticsSource lines of codeVector spaceGraphical user interfaceGitHubGNU General Public LicenseHigh-level programming languageMachine codeComputer simulationVector spaceGranularityScientific modellingVector spaceComputer programScientific modellingInformation retrievalDiagramScientific modellingWordHere documentGitHubApache LicenseVector graphicsMethod (computer programming)Source codePlain textSyntactic categoryMethod (computer programming)Method (computer programming)Enumerated typeString (computer science)Literal (computer programming)Comment (computer programming)Lexical analysisStandard EnglishEg WhiteWhitespace characterPunctuationPunctuationCamel caseMethod (computer programming)Point (typography)Computer graphicsSource codeContainer (abstract data type)String (computer science)PixelPixelDisplay deviceSoftware development processSource codeWordHTML elementWordSoftware development processText corpusNatural languageDocumentWordWordVocabularyText corpusVector spaceKnowledge representation and reasoningSparse matrixArtificial neural networkUnsupervised learningText corpusN-gramLexical analysisN-gramPixelPixelThe MatrixWord embeddingGlossary of commutative algebraKaffir (racial term)Matrix (mathematics)Vector spaceKnowledge representation and reasoningWordTheoremStatement (logic)SemanticsWordVector spaceMeaning (linguistics)Distributional semanticsNatural language processingConceptSource textIntentionPhysical bodyPhysical bodyAverageWord embeddingVector spaceSet (mathematics)Set (mathematics)String (computer science)Text corpusWord embeddingText corpusTf–idfFrequency responseTf–idfWordDocumentWordDocumentWordDocumentScientific methodDocumentEuclidean vectorKnowledge representation and reasoningNatural languageComputer keyboardLexical analysisSource codeMatrix (mathematics)Array data structureKnowledge representation and reasoningString (computer science)DocumentVocabularyString (computer science)Similarity searchSearch algorithmEuclidean vectorCosine similarityStack OverflowGitHubRepository (version control)Mean reciprocal rankMaximumrocknrollExperimentStack OverflowEvaluationNoCopyrightSoundsBM-25 (MRL)BM-25 (MRL)Google PlayApplication softwareSnippet (programming)Source codeGitHubMIT LicenseMicrosoft WordThiocyanateUnsupervised learningWeb search engineText corpusThiocyanateString (computer science)String (computer science)Source codeInformation retrievalVector spaceMicrosoft WordSource codeStack OverflowCreative Commons licenseData setStack OverflowString (computer science)SQLString (computer science)Source codeString (computer science)Information retrievalString (computer science)Source codeString (computer science)Source codeScientific modellingSoftware development processVocabularyString (computer science)ObservationSupervised learningSource codeNatural languageSource codeConceptual modelSupervised learningWord embeddingMatrix (mathematics)Matrix (mathematics)Lexical analysisTf–idfDirectorate of Operations (CIA)Data setScientific modellingMatrix (mathematics)Natural languageLinguistic descriptionLexical analysisEuclidean vectorVowel lengthVocabularyText corpusVocabularyText corpusMatrix (mathematics)Vector spaceEuclidean vectorAverageDimensionVector spaceTf–idfCode wordWord embeddingVector spaceVector spaceSummationWord embeddingVector spaceVector spaceArithmetic meanWord embeddingBackpropagationInformation retrievalEuclidean vectorEuclidean vectorCosine similarityRankingStack OverflowData setMaximumrocknrollNoCopyrightSoundsApache LicenseMIT LicenseNoCopyrightSoundsApache LicenseGitHubApache LicenseMachine learningProgramming toolQuality controlTest setMathematical modelOpen-source softwareCodebaseGitHubQuality controlData setQuality controlMathematical modelResearchComputer programEvaluationComputer programSpecification (technical standard)Data setStack OverflowNatural language processingUnsupervised learningAndroid (operating system)GitHubStatistical modelCartographyData collectionNatural languageStack OverflowStack ExchangeCreative Commons licenseHeuristicSnippet (programming)Android (operating system)Tag (metadata)Method (computer programming)XMLHTML elementInformation retrievalDocstringStack overflowStack OverflowSystem resourceNatural languageStack OverflowGitHubMethod (computer programming)Stack OverflowStack OverflowScripting languageAndroid (operating system)Java (programming language)HTML elementGround truthSnippet (programming)GitHubAndroid (operating system)Data setSimilarity scoreGround truthStack OverflowGround truthComputer programSoftware repositoryMachine learningPatternAudio engineerProductivity improving technologiesFacebookMachine learningToolComputer programComputer programAutomatic bug fixingThiocyanateNatural languageInformation retrievalSnippet (programming)Programming toolSnippet (programming)Proprietary softwareSource codeProgramming languageDeep learningScientific modellingBoost (C++ libraries)Productivity improving technologies