walterShen 's Collections Code LMs Evaluation
updated
Unifying the Perspectives of NLP and Software Engineering: A Survey on
Language Models for Code
Paper
• 2311.07989
• Published
• 26
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper
• 2310.06770
• Published
• 9
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper
• 2401.03065
• Published
• 11
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper
• 2402.14261
• Published
• 10
CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
Paper
• 2302.05527
• Published
• 1
Copilot Refinement: Addressing Code Smells in Copilot-Generated Python
Code
Paper
• 2401.14176
• Published
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
Paper
• 2310.06266
• Published
• 2
TACO: Topics in Algorithmic COde generation dataset
Paper
• 2312.14852
• Published
• 4
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code
Completion
Paper
• 2310.11248
• Published
• 4
DevEval: Evaluating Code Generation in Practical Software Projects
Paper
• 2401.06401
• Published
CodeApex: A Bilingual Programming Evaluation Benchmark for Large
Language Models
Paper
• 2309.01940
• Published
• 2
Improving Natural Language Capability of Code Large Language Model
Paper
• 2401.14242
• Published
• 1
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
Large Language Models for Code Generation
Paper
• 2305.01210
• Published
• 3
A Static Evaluation of Code Completion by Large Language Models
Paper
• 2306.03203
• Published
• 3
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Paper
• 2306.03091
• Published
• 1
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural
Code Generation
Paper
• 2208.08227
• Published
• 1
Large Language Models Are State-of-the-Art Evaluators of Code Generation
Paper
• 2304.14317
• Published
• 2
Textbooks Are All You Need II: phi-1.5 technical report
Paper
• 2309.05463
• Published
• 89
Textbooks Are All You Need
Paper
• 2306.11644
• Published
• 154
Evaluating Large Language Models Trained on Code
Paper
• 2107.03374
• Published
• 8
Design2Code: How Far Are We From Automating Front-End Engineering?
Paper
• 2403.03163
• Published
• 98
Large Language Models Meet NL2Code: A Survey
Paper
• 2212.09420
• Published
• 1
Large Language Models for Software Engineering: A Systematic Literature
Review
Paper
• 2308.10620
• Published
• 1
Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey
Paper
• 2310.17903
• Published
• 1
A Survey on Pretrained Language Models for Neural Code Intelligence
Paper
• 2212.10079
• Published
• 1
An Empirical Comparison of Pre-Trained Models of Source Code
Paper
• 2302.04026
• Published
• 1
Towards an Understanding of Large Language Models in Software
Engineering Tasks
Paper
• 2308.11396
• Published
• 1
StarCoder 2 and The Stack v2: The Next Generation
Paper
• 2402.19173
• Published
• 152
DeepSeek-Coder: When the Large Language Model Meets Programming -- The
Rise of Code Intelligence
Paper
• 2401.14196
• Published
• 70
Unsupervised Evaluation of Code LLMs with Round-Trip Correctness
Paper
• 2402.08699
• Published
• 1
Understanding the Effectiveness of Large Language Models in Detecting
Security Vulnerabilities
Paper
• 2311.16169
• Published
• 1
Magicoder: Source Code Is All You Need
Paper
• 2312.02120
• Published
• 82
On the Effectiveness of Large Language Models in Domain-Specific Code
Generation
Paper
• 2312.01639
• Published
• 2
CodeScope: An Execution-based Multilingual Multitask Multidimensional
Benchmark for Evaluating LLMs on Code Understanding and Generation
Paper
• 2311.08588
• Published
Fusion-Eval: Integrating Evaluators with LLMs
Paper
• 2311.09204
• Published
• 6
CoderEval: A Benchmark of Pragmatic Code Generation with Generative
Pre-trained Models
Paper
• 2302.00288
• Published
Beyond Accuracy: Evaluating Self-Consistency of Code Large Language
Models with IdentityChain
Paper
• 2310.14053
• Published
Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability
of Large Language Model Code Generation
Paper
• 2308.10335
• Published
CodeScore: Evaluating Code Generation by Learning Code Execution
Paper
• 2301.09043
• Published
InterCode: Standardizing and Benchmarking Interactive Coding with
Execution Feedback
Paper
• 2306.14898
• Published
Language Models for Code Completion: A Practical Evaluation
Paper
• 2402.16197
• Published
• 1
DevBench: A Comprehensive Benchmark for Software Development
Paper
• 2403.08604
• Published
• 2
CodeEditorBench: Evaluating Code Editing Capability of Large Language
Models
Paper
• 2404.03543
• Published
• 18
CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code
Authoring
Paper
• 2305.12050
• Published
• 2
Stable Code Technical Report
Paper
• 2404.01226
• Published
• 1
CodeShell Technical Report
Paper
• 2403.15747
• Published
• 1
A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond
Paper
• 2403.14734
• Published
• 21
PRD: Peer Rank and Discussion Improve Large Language Model based
Evaluations
Paper
• 2307.02762
• Published
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
• 2405.01535
• Published
• 124
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of
Diverse Models
Paper
• 2404.18796
• Published
• 71