Spontaneous Code
Recommendation based on
Open Source Code Repository
Hidehiko Masuhara
[email protected]
Tokyo Tech
joint work with Takuya Watanabe,
Naoya Murakami, Tomoyuki Aotani
Do you program with Google?
What are you looking for
with Google?
We can’t remember
all the APIs and usages
• Too many big class libraries
• Non-trivial usages:
what should we write, e.g.,
after created a JFrame?
to format a Date in HH:MM style?
to match a regexp to a string?
to download from an FTP server?
JavaDoc and Google
aren’t sufficient
• API documents and IDE assistance:
do not tell non-trivial usages
• Keyword-based web search tools:
can show
examples
thought interrupting
how to
format
Date?
think
keywords
type-in
keywords
browse
results
as you write
in the editor…
similar programs
are displayed
— serve as
examples
Selene: a spontaneous
Let’s make a window
displaying the current time
12:45
Key Ideas behind Selene
•
Entire editing text
as a query
no need to think about keywords
•
Textual search
fast enough with a large repository
language independent
• A large code repository (2 million files)
Architectural overview
Client
Keyword
search
Server
frontend
Eclipse plugin
Snippet
selection
keyword
extraction
keyword
extraction
3. snippet
selection
& ranking
1. query
extraction
2. find
similar files
Snippet selection
editing
program
similar files
cursor pos.
similar fragments
snippets
to display
code after similar farg.
= what to do next
Research challenges
• Huge design¶meter space
Query extraction: entire text / lines around
cursor / term weighting / ...
Snippet selection: similarity algorithms /
# of displaying snippets & lines
• Code clone problem
Query extraction &
snippet selection
Basic algorithm & variability
• Query extraction
extracts keywords in the editing text
weighting by distance from cursor
• Snippet ranking
against the lines above the cursor
compute vector similarity / LCS
weighting by inverse term freq.
• Snippet display
# of snippets x # of lines
comments
how much?
how many?
which?
how much?
how much?
show/remove?
Parameter optimization thru
mechanical evaluation
• Evaluation method:
randomly
chosen
program
extract
query
Selene
recommended
snippets
recommended
snippets
recommended
snippets
recommended
snippets
recommended
snippets
precision
recall
How many snippets?
• total 120 lines
for all snippets
• more snippets,
less lines/snippet
• Best:
6 snippets x 18 lines
→
snippets window
0
5
10
15
20
25
1 2 3 4 5 6 7 8 9
# snippets
re
c
a
ll (
%
)
overall recall ratio 17.6%
→
25.3%
Optimal parameters
• Query extraction
extracts keywords in the editing text
weighting by distance from cursor
• Snippet ranking
for each fragment in a window
compute vector similarity / LCS
weighting by inverse term freq.
• Snippet display
# of snippets x # of lines
comments
0.5
→
0.11
10 lines
LCS
0.2
→
0.2
6x18
remove comments
Problem: Code Clones!
Problem: Code Clones!
How to remove clones
• Offline vs
online
Offline: 2 M files (or billion lines)
Online: 50 files (or 100-10,000 lines)
• Clone detection algorithms
Matrix based (cf. CCFinder)
Clustering (e.g.,
k
-means)
Freshness count
requires a
Freshness count algorithm
74.0
64.0
63.0
60.0
panel, JFrame, setDefaultCloseOperation, WindowConstants, setDefaultCloseOperation, WindowConstants, setSize, JPanel,TabelLayout
setDefaultCloseOperation, WindowConstants, setSize, JPanel
TableLayout, setPaintBorderLines setDefaultCloseOperation,
TabelPanel, JPanel
TableLayout, setPaintBorderLines, setSize
JPanel, contentPane, BorderLayout setBackground, Color, WHITE,
fieldPanel, TableLayout
similarity
tokens
“known tokens”
Freshness count algorithm
74.0
64.0
63.0
60.0
panel, JFrame, setDefaultCloseOperation, WindowConstants, setDefaultCloseOperation, WindowConstants, setSize, JPanel,TabelLayout
setDefaultCloseOperation, WindowConstants, setSize, JPanel
TableLayout, setPaintBorderLines setDefaultCloseOperation,
TabelPanel, JPanel
TableLayout, setPaintBorderLines, setSize
JPanel, contentPane, BorderLayout setBackground, Color, WHITE,
fieldPanel, TableLayout
Freshness
count
remove known
tokens; count
# of remaining
tokens
3
4
5
8
Freshness count algorithm
74.0
64.0
63.0
60.0
panel, JFrame, setDefaultCloseOperation, WindowConstants, setDefaultCloseOperation, WindowConstants, setSize, JPanel,TabelLayout
setDefaultCloseOperation, WindowConstants, setSize, JPanel
TableLayout, setPaintBorderLines setDefaultCloseOperation,
TabelPanel, JPanel
TableLayout, setPaintBorderLines, setSize
JPanel, contentPane, BorderLayout setBackground, Color, WHITE,
fieldPanel, TableLayout
show the
snippet with
max. (similarity
+ fresness)
3
4
5
8
77.0
77.0
68.0
68.0
68.0
68.0
68.0
68.0
Freshness count algorithm
74.0
64.0
63.0
60.0
panel, JFrame, setDefaultCloseOperation, WindowConstants,setSize, JPanel, TableLayout WindowConstants, setSize, JPanel,setDefaultCloseOperation, TabelLayout
setDefaultCloseOperation, WindowConstants, setSize, JPanel
TableLayout, setPaintBorderLines setDefaultCloseOperation,
TabelPanel, JPanel
TableLayout, setPaintBorderLines, setSize
JPanel, contentPane, BorderLayout setBackground, Color, WHITE,
fieldPanel, TableLayout
add shown
tokens to
“known”;
Freshness count algorithm
74.0
64.0
63.0
60.0
panel, JFrame, setDefaultCloseOperation, WindowConstants,setSize, JPanel, TableLayout WindowConstants, setSize, JPanel,setDefaultCloseOperation, TabelLayout
setDefaultCloseOperation, WindowConstants, setSize, JPanel
TableLayout, setPaintBorderLines setDefaultCloseOperation,
TabelPanel, JPanel
TableLayout, setPaintBorderLines, setSize
JPanel, contentPane, BorderLayout setBackground, Color, WHITE,
fieldPanel, TableLayout