Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform for Code Profiling #646

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from
Open

Conversation

pankajskku
Copy link
Member

This tranform extracts the base syntactic concepts from the multi-language source codes and represent these concepts in an unified langauge-agnostic representation that can be further used for multi-lnaguage data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form.

Why are these changes needed?

Data profiling, in the context of machine learning, is the process of examining and analyzing data to create
useful statistics. These statistics are used both as an aid for better comprehension of the properties of data as
well as for a variety of downstream data processing tasks such as data valuation (assessing the value of data
relative to the business objectives at hand) and data curation (filtering and prioritizing training data based on
derived thresholds). In the Large Language Model (LLM) setting, training data is typically unstructured in
nature comprising natural language text, images, and code. In this work, we specifically focus on code-LLMs,
where the quality of code training data substantially affects the model accuracy of LLM-based coding tasks
such as code generation and summarization. Therefore, having the capabilities to characterize code data in
terms of programming language concepts aids in both deriving insights related to code training/evaluation
data and in the downstream curation of code training data. In this work, we address the problem of profiling
multi-lingual code datasets by extracting an extensible user-defined set of syntactic concepts
over arbitrary programming languages.

Related issue number (if any).

This tranform extracts the base syntactic concepts from the multi-language source codes and represent these concepts in an unified langauge-agnostic representation that can be further used for multi-lnaguage data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form.

Signed-off-by: Pankaj Thorat <thorat.pankaj9@gmail.com>
@pankajskku pankajskku force-pushed the dev-pankaj branch 7 times, most recently from eea6e72 to 47b9dcd Compare September 30, 2024 21:05
daw3rd

This comment was marked as resolved.

@pankajskku pankajskku force-pushed the dev-pankaj branch 21 times, most recently from cc43bb7 to 6294b2d Compare October 2, 2024 10:21
@pankajskku pankajskku force-pushed the dev-pankaj branch 2 times, most recently from 1aeb5cc to 6af8404 Compare October 5, 2024 14:50
Signed-off-by: Pankaj Thorat <thorat.pankaj9@gmail.com>
Signed-off-by: Pankaj Thorat <thorat.pankaj9@gmail.com>
@pankajskku pankajskku force-pushed the dev-pankaj branch 3 times, most recently from dba3567 to 627b4db Compare October 10, 2024 08:22
@pankajskku
Copy link
Member Author

@daw3rd Please let me know your opinion on the updated PR.

@touma-I touma-I self-requested a review October 12, 2024 15:29
Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pankajskku Please slack me when you have a chance. Internal ID: touma@us.ibm.com

@pankajskku pankajskku force-pushed the dev-pankaj branch 2 times, most recently from 46d3e2d to 7e183d0 Compare October 15, 2024 09:09
Signed-off-by: Pankaj Thorat <thorat.pankaj9@gmail.com>
@pankajskku pankajskku changed the title Transform for Syntactic Construct Extractor Transform for Code Profiling Oct 15, 2024
Signed-off-by: Pankaj Thorat <thorat.pankaj9@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants