Conversation
| columns="Session", | ||
| values=cible, | ||
| ) | ||
| # .dropna(axis=0) # fails |
There was a problem hiding this comment.
Pull request overview
This PR adds teaching materials for a machine learning course, including a new example script demonstrating Parcoursup data analysis (2021-2025) and several documentation updates.
Changes:
- Added a new Python example file for analyzing Parcoursup enrollment data using scikit-learn
- Fixed French spelling errors ("pair" → "paire", "condidatures" → "candidatures")
- Updated teaching roadmap documentation with additional datasets and formatting improvements
- Added RUF001 linter exception for the new example file
Reviewed changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Added linter exception (RUF001) for the new plot_template_data.py file |
| _doc/practice/index_python.rst | Fixed French spelling: "pair" → "paire" |
| _doc/practice/index_algo.rst | Fixed French spelling: "pair" → "paire" |
| _doc/examples/ml/plot_template_data.py | New example script for Parcoursup data analysis with ML pipeline (contains multiple bugs) |
| _doc/articles/2026/2026-03-15-route2026-ml.rst | Added skrub module, fixed typo, reformatted dataset list, and added new dataset link |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,124 @@ | |||
| """ | |||
There was a problem hiding this comment.
Missing the '# coding: utf-8' header that is present in all other example files in the _doc/examples directory. This header is particularly important for this file since it contains French text with special characters.
| columns = set(df.columns) | ||
| assert set(keys) & set(columns) == set( | ||
| keys | ||
| ), f"Missing columns {set(keys) - set(keys) & set(columns)} in {sorted(df.columns)}" |
There was a problem hiding this comment.
The assertion logic is incorrect. The expression set(keys) - set(keys) & set(columns) will always result in an empty set due to operator precedence. The & operator has higher precedence than -, so it evaluates as set(keys) - (set(keys) & set(columns)), which removes all common keys from the set of keys. This should be set(keys) - (set(keys) & set(columns)) or more simply set(keys) - set(columns) to show the missing columns.
| ), f"Missing columns {set(keys) - set(keys) & set(columns)} in {sorted(df.columns)}" | |
| ), f"Missing columns {set(keys) - set(columns)} in {sorted(df.columns)}" |
| return mean_absolute_error(piv[2025], piv[2024]) | ||
|
|
||
|
|
||
| def split_train_test(table, cuble): |
There was a problem hiding this comment.
Parameter name has a typo: 'cuble' should be 'cible' to match the variable name used throughout the codebase and in the function body.
| def split_train_test(table, cuble): | |
| def split_train_test(table, cible): |
| test_X = X[train_test].drop(drop, axis=1) | ||
| test_y = y[train_test] |
There was a problem hiding this comment.
Both test_X and test_y are incorrectly using the training data filter. Lines 90-91 should use ~train_test instead of train_test to select the test set (Session == 2025), otherwise the test set will be identical to the training set.
| test_X = X[train_test].drop(drop, axis=1) | |
| test_y = y[train_test] | |
| test_X = X[~train_test].drop(drop, axis=1) | |
| test_y = y[~train_test] |
|
|
||
|
|
||
| def make_pipeline(table, cible): | ||
| vars = [c for c in table.columns if c != "cible"] |
There was a problem hiding this comment.
The condition filters by the literal string "cible" instead of using the parameter variable. This should be c != cible to correctly filter out the target column.
| vars = [c for c in table.columns if c != "cible"] | |
| vars = [c for c in table.columns if c != cible] |
| num_cols = ["Capacité de l’établissement par formation"] | ||
| cat_cols = [c for c in vars if c not in num_cols] | ||
|
|
||
| model = Pipeline( | ||
| [ | ||
| ( | ||
| "preprocessing", | ||
| ColumnTransformer( | ||
| [ | ||
| ("num", StandardScaler(), num_cols), | ||
| ("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols), | ||
| ] | ||
| ), |
There was a problem hiding this comment.
The column "Capacité de l'établissement par formation" is not present in the 'keys' list defined in select_variables_and_clean, and therefore won't exist in the table. This will cause a KeyError when the pipeline attempts to apply StandardScaler to this non-existent column. Either add this column to the keys list in select_variables_and_clean, or use a column that is actually present in the filtered table.
| num_cols = ["Capacité de l’établissement par formation"] | |
| cat_cols = [c for c in vars if c not in num_cols] | |
| model = Pipeline( | |
| [ | |
| ( | |
| "preprocessing", | |
| ColumnTransformer( | |
| [ | |
| ("num", StandardScaler(), num_cols), | |
| ("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols), | |
| ] | |
| ), | |
| # Candidate numeric feature; include it only if it exists in the table to avoid KeyError. | |
| numeric_feature = "Capacité de l’établissement par formation" | |
| num_cols = [numeric_feature] if numeric_feature in table.columns else [] | |
| cat_cols = [c for c in vars if c not in num_cols] | |
| transformers = [] | |
| if num_cols: | |
| transformers.append(("num", StandardScaler(), num_cols)) | |
| if cat_cols: | |
| transformers.append( | |
| ("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols) | |
| ) | |
| model = Pipeline( | |
| [ | |
| ( | |
| "preprocessing", | |
| ColumnTransformer(transformers), |
| # .dropna(axis=0) # fails | ||
| .sort_index() | ||
| ) |
There was a problem hiding this comment.
Regarding the question about why dropna fails: The pivot_table operation creates a DataFrame with multi-column index (2024 and 2025 as columns). When there are missing values in either year, dropna(axis=0) would remove those rows. The likely issue is that after removing rows with NaN values, the resulting DataFrame might be empty or nearly empty, causing the mean_absolute_error calculation on line 78 to fail. The pivot table requires matching keys between 2024 and 2025 data, and if many formations only appear in one year, many NaN values would be created. Consider using fillna(0) or only comparing formations that exist in both years by using an inner join approach in the pivot_table.
| # .dropna(axis=0) # fails | |
| .sort_index() | |
| ) | |
| .sort_index() | |
| ) | |
| # Keep only rows where both 2024 and 2025 have non-missing values | |
| piv = piv.dropna(axis=0, how="any") | |
| if piv.empty: | |
| raise ValueError( | |
| "Not enough overlapping data between 2024 and 2025 to compute oracle." | |
| ) |
No description provided.