-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform for Code Profiling #646
Open
pankajskku
wants to merge
10
commits into
IBM:dev
Choose a base branch
from
pankajskku:dev-pankaj
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+11,138
−0
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This tranform extracts the base syntactic concepts from the multi-language source codes and represent these concepts in an unified langauge-agnostic representation that can be further used for multi-lnaguage data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form. Signed-off-by: Pankaj Thorat <thorat.pankaj9@gmail.com>
pankajskku
force-pushed
the
dev-pankaj
branch
7 times, most recently
from
September 30, 2024 21:05
eea6e72
to
47b9dcd
Compare
pankajskku
force-pushed
the
dev-pankaj
branch
21 times, most recently
from
October 2, 2024 10:21
cc43bb7
to
6294b2d
Compare
daw3rd
requested changes
Oct 4, 2024
daw3rd
requested changes
Oct 4, 2024
transforms/code/syntactic_concept_extractor/python/src/syntactic_concept_extractor_transform.py
Outdated
Show resolved
Hide resolved
pankajskku
force-pushed
the
dev-pankaj
branch
2 times, most recently
from
October 5, 2024 14:50
1aeb5cc
to
6af8404
Compare
Signed-off-by: Pankaj Thorat <thorat.pankaj9@gmail.com>
Signed-off-by: Pankaj Thorat <thorat.pankaj9@gmail.com>
Signed-off-by: Pankaj Thorat <thorat.pankaj9@gmail.com>
pankajskku
force-pushed
the
dev-pankaj
branch
3 times, most recently
from
October 10, 2024 08:22
dba3567
to
627b4db
Compare
@daw3rd Please let me know your opinion on the updated PR. |
touma-I
requested changes
Oct 12, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pankajskku Please slack me when you have a chance. Internal ID: touma@us.ibm.com
pankajskku
force-pushed
the
dev-pankaj
branch
2 times, most recently
from
October 15, 2024 09:09
46d3e2d
to
7e183d0
Compare
Signed-off-by: Pankaj Thorat <thorat.pankaj9@gmail.com>
pankajskku
force-pushed
the
dev-pankaj
branch
from
October 15, 2024 09:11
7e183d0
to
4f0bdd4
Compare
pankajskku
changed the title
Transform for Syntactic Construct Extractor
Transform for Code Profiling
Oct 15, 2024
Signed-off-by: Pankaj Thorat <thorat.pankaj9@gmail.com>
pankajskku
force-pushed
the
dev-pankaj
branch
from
October 15, 2024 13:30
143c054
to
39158a5
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This tranform extracts the base syntactic concepts from the multi-language source codes and represent these concepts in an unified langauge-agnostic representation that can be further used for multi-lnaguage data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form.
Why are these changes needed?
Data profiling, in the context of machine learning, is the process of examining and analyzing data to create
useful statistics. These statistics are used both as an aid for better comprehension of the properties of data as
well as for a variety of downstream data processing tasks such as data valuation (assessing the value of data
relative to the business objectives at hand) and data curation (filtering and prioritizing training data based on
derived thresholds). In the Large Language Model (LLM) setting, training data is typically unstructured in
nature comprising natural language text, images, and code. In this work, we specifically focus on code-LLMs,
where the quality of code training data substantially affects the model accuracy of LLM-based coding tasks
such as code generation and summarization. Therefore, having the capabilities to characterize code data in
terms of programming language concepts aids in both deriving insights related to code training/evaluation
data and in the downstream curation of code training data. In this work, we address the problem of profiling
multi-lingual code datasets by extracting an extensible user-defined set of syntactic concepts
over arbitrary programming languages.
Related issue number (if any).