Skip to content

add datasets - teaching#89

Open
xadupre wants to merge 3 commits intomainfrom
sess1
Open

add datasets - teaching#89
xadupre wants to merge 3 commits intomainfrom
sess1

Conversation

@xadupre
Copy link
Collaborator

@xadupre xadupre commented Feb 15, 2026

No description provided.

columns="Session",
values=cible,
)
# .dropna(axis=0) # fails
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot why it does not work?

Copy link

Copilot AI commented Feb 26, 2026

@xadupre I've opened a new pull request, #90, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds teaching materials for a machine learning course, including a new example script demonstrating Parcoursup data analysis (2021-2025) and several documentation updates.

Changes:

  • Added a new Python example file for analyzing Parcoursup enrollment data using scikit-learn
  • Fixed French spelling errors ("pair" → "paire", "condidatures" → "candidatures")
  • Updated teaching roadmap documentation with additional datasets and formatting improvements
  • Added RUF001 linter exception for the new example file

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pyproject.toml Added linter exception (RUF001) for the new plot_template_data.py file
_doc/practice/index_python.rst Fixed French spelling: "pair" → "paire"
_doc/practice/index_algo.rst Fixed French spelling: "pair" → "paire"
_doc/examples/ml/plot_template_data.py New example script for Parcoursup data analysis with ML pipeline (contains multiple bugs)
_doc/articles/2026/2026-03-15-route2026-ml.rst Added skrub module, fixed typo, reformatted dataset list, and added new dataset link

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,124 @@
"""
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the '# coding: utf-8' header that is present in all other example files in the _doc/examples directory. This header is particularly important for this file since it contains French text with special characters.

Copilot uses AI. Check for mistakes.
columns = set(df.columns)
assert set(keys) & set(columns) == set(
keys
), f"Missing columns {set(keys) - set(keys) & set(columns)} in {sorted(df.columns)}"
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion logic is incorrect. The expression set(keys) - set(keys) & set(columns) will always result in an empty set due to operator precedence. The & operator has higher precedence than -, so it evaluates as set(keys) - (set(keys) & set(columns)), which removes all common keys from the set of keys. This should be set(keys) - (set(keys) & set(columns)) or more simply set(keys) - set(columns) to show the missing columns.

Suggested change
), f"Missing columns {set(keys) - set(keys) & set(columns)} in {sorted(df.columns)}"
), f"Missing columns {set(keys) - set(columns)} in {sorted(df.columns)}"

Copilot uses AI. Check for mistakes.
return mean_absolute_error(piv[2025], piv[2024])


def split_train_test(table, cuble):
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter name has a typo: 'cuble' should be 'cible' to match the variable name used throughout the codebase and in the function body.

Suggested change
def split_train_test(table, cuble):
def split_train_test(table, cible):

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +91
test_X = X[train_test].drop(drop, axis=1)
test_y = y[train_test]
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both test_X and test_y are incorrectly using the training data filter. Lines 90-91 should use ~train_test instead of train_test to select the test set (Session == 2025), otherwise the test set will be identical to the training set.

Suggested change
test_X = X[train_test].drop(drop, axis=1)
test_y = y[train_test]
test_X = X[~train_test].drop(drop, axis=1)
test_y = y[~train_test]

Copilot uses AI. Check for mistakes.


def make_pipeline(table, cible):
vars = [c for c in table.columns if c != "cible"]
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition filters by the literal string "cible" instead of using the parameter variable. This should be c != cible to correctly filter out the target column.

Suggested change
vars = [c for c in table.columns if c != "cible"]
vars = [c for c in table.columns if c != cible]

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +109
num_cols = ["Capacité de l’établissement par formation"]
cat_cols = [c for c in vars if c not in num_cols]

model = Pipeline(
[
(
"preprocessing",
ColumnTransformer(
[
("num", StandardScaler(), num_cols),
("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols),
]
),
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column "Capacité de l'établissement par formation" is not present in the 'keys' list defined in select_variables_and_clean, and therefore won't exist in the table. This will cause a KeyError when the pipeline attempts to apply StandardScaler to this non-existent column. Either add this column to the keys list in select_variables_and_clean, or use a column that is actually present in the filtered table.

Suggested change
num_cols = ["Capacité de l’établissement par formation"]
cat_cols = [c for c in vars if c not in num_cols]
model = Pipeline(
[
(
"preprocessing",
ColumnTransformer(
[
("num", StandardScaler(), num_cols),
("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols),
]
),
# Candidate numeric feature; include it only if it exists in the table to avoid KeyError.
numeric_feature = "Capacité de l’établissement par formation"
num_cols = [numeric_feature] if numeric_feature in table.columns else []
cat_cols = [c for c in vars if c not in num_cols]
transformers = []
if num_cols:
transformers.append(("num", StandardScaler(), num_cols))
if cat_cols:
transformers.append(
("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols)
)
model = Pipeline(
[
(
"preprocessing",
ColumnTransformer(transformers),

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +77
# .dropna(axis=0) # fails
.sort_index()
)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the question about why dropna fails: The pivot_table operation creates a DataFrame with multi-column index (2024 and 2025 as columns). When there are missing values in either year, dropna(axis=0) would remove those rows. The likely issue is that after removing rows with NaN values, the resulting DataFrame might be empty or nearly empty, causing the mean_absolute_error calculation on line 78 to fail. The pivot table requires matching keys between 2024 and 2025 data, and if many formations only appear in one year, many NaN values would be created. Consider using fillna(0) or only comparing formations that exist in both years by using an inner join approach in the pivot_table.

Suggested change
# .dropna(axis=0) # fails
.sort_index()
)
.sort_index()
)
# Keep only rows where both 2024 and 2025 have non-missing values
piv = piv.dropna(axis=0, how="any")
if piv.empty:
raise ValueError(
"Not enough overlapping data between 2024 and 2025 to compute oracle."
)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants