Skip to content

Conversation

@windyriki
Copy link

Problem 1

Currently feature bins are shown correctly in the output but not saved in their respective programs, s.t. they are logged as the following in the program:
"complexity": 0.0,
"diversity": 0.0,

Solution 1

Additionally to appending them to coords save features.

Problem 2

Even though llm_feedback_weight is promoted to be reflected in the final score, the final score has a fixed weighting (0.3) and does not consider llm_feedback_weight at all.

Solution 2

Removed fixed weighting and use of llm_feedback_weight in final score.

Files changed

  • openevolve/database.py -Storing built-in feature bins in program
  • openevolve/evaluator.py -Using llm_feedback_weight in final score

Copilot AI review requested due to automatic review settings February 9, 2026 10:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two inconsistencies in OpenEvolve’s evolution bookkeeping: (1) built-in MAP-Elites feature bins (complexity/diversity) were being computed for coordinates but not persisted on the Program object, and (2) the final combined_score weighting ignored llm_feedback_weight and used a fixed constant.

Changes:

  • Persist built-in MAP-Elites complexity and diversity bin indices onto the Program before saving.
  • Update combined_score recomputation to weight LLM feedback using self.config.llm_feedback_weight instead of a fixed 0.3.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
openevolve/database.py Stores computed built-in feature bin indices (complexity, diversity) into the Program so they persist to disk.
openevolve/evaluator.py Uses llm_feedback_weight when recomputing combined_score after LLM evaluation.

Comment on lines 212 to 214
eval_result.metrics["combined_score"] = (
accuracy * 0.7 + llm_average * 0.3
accuracy * (1-self.config.llm_feedback_weight) + llm_average * self.config.llm_feedback_weight
)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combined_score now depends on llm_feedback_weight, but there’s no guard ensuring the weight is within [0.0, 1.0]. If a user misconfigures this, the score can become negative or exceed expected bounds; consider clamping or raising a clear config error before using it here.

Copilot uses AI. Check for mistakes.
Comment on lines +213 to 214
accuracy * (1-self.config.llm_feedback_weight) + llm_average * self.config.llm_feedback_weight
)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line exceeds the configured Black line length (100) and is missing spaces around operators (e.g., 1 - weight). Reformatting will improve readability and avoid formatting/lint churn in future diffs.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines 208 to 214
if "combined_score" in eval_result.metrics:
# Original combined_score is just accuracy
accuracy = eval_result.metrics["combined_score"]
# Combine with LLM average (70% accuracy, 30% LLM quality)
eval_result.metrics["combined_score"] = (
accuracy * 0.7 + llm_average * 0.3
accuracy * (1-self.config.llm_feedback_weight) + llm_average * self.config.llm_feedback_weight
)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage: the combined_score weighting behavior changed to depend on llm_feedback_weight, but there doesn’t appear to be a unit test asserting the new formula. Adding a focused test (including edge weights like 0.0/1.0) would prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines 861 to 865
# Use code length as complexity measure
complexity = len(program.code)
bin_idx = self._calculate_complexity_bin(complexity)
program.complexity = bin_idx # Store complexity bin in program
coords.append(bin_idx)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning the bin index into Program.complexity/Program.diversity is semantically ambiguous (the dataclass defines these as derived feature values, currently typed as float). Consider either casting to float for consistency, or introducing explicit fields like complexity_bin/diversity_bin to avoid confusing bins with raw feature values.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines 867 to 874
# Use cached diversity calculation with reference set
if len(self.programs) < 2:
bin_idx = 0
else:
diversity = self._get_cached_diversity(program)
bin_idx = self._calculate_diversity_bin(diversity)
program.diversity = bin_idx # Store diversity bin in program
coords.append(bin_idx)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the cold-start branch (len(self.programs) < 2) diversity’s bin_idx is forced to 0 but program.diversity is not updated, so saved programs may still show the default value rather than the computed bin. Set program.diversity in this branch as well for consistency with the complexity handling.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines 860 to 874
elif dim == "complexity":
# Use code length as complexity measure
complexity = len(program.code)
bin_idx = self._calculate_complexity_bin(complexity)
program.complexity = bin_idx # Store complexity bin in program
coords.append(bin_idx)
elif dim == "diversity":
# Use cached diversity calculation with reference set
if len(self.programs) < 2:
bin_idx = 0
else:
diversity = self._get_cached_diversity(program)
bin_idx = self._calculate_diversity_bin(diversity)
program.diversity = bin_idx # Store diversity bin in program
coords.append(bin_idx)
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage: this change is intended to persist built-in feature bins into the saved Program, but there’s no test asserting that Program.complexity/diversity are updated after coordinate calculation/add(). Add a unit test that loads/saves a program and verifies these fields are non-default when built-in dimensions are used.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

1 similar comment
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants