add datasets - teaching by xadupre · Pull Request #89 · sdpython/teachpyx

xadupre · 2026-02-15T13:52:25Z

No description provided.

xadupre · 2026-02-26T23:34:40Z

_doc/examples/ml/plot_template_data.py

+            columns="Session",
+            values=cible,
+        )
+        # .dropna(axis=0)  # fails


@copilot why it does not work?

Copilot · 2026-02-26T23:34:47Z

@xadupre I've opened a new pull request, #90, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot

Pull request overview

This PR adds teaching materials for a machine learning course, including a new example script demonstrating Parcoursup data analysis (2021-2025) and several documentation updates.

Changes:

Added a new Python example file for analyzing Parcoursup enrollment data using scikit-learn
Fixed French spelling errors ("pair" → "paire", "condidatures" → "candidatures")
Updated teaching roadmap documentation with additional datasets and formatting improvements
Added RUF001 linter exception for the new example file

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
pyproject.toml	Added linter exception (RUF001) for the new plot_template_data.py file
_doc/practice/index_python.rst	Fixed French spelling: "pair" → "paire"
_doc/practice/index_algo.rst	Fixed French spelling: "pair" → "paire"
_doc/examples/ml/plot_template_data.py	New example script for Parcoursup data analysis with ML pipeline (contains multiple bugs)
_doc/articles/2026/2026-03-15-route2026-ml.rst	Added skrub module, fixed typo, reformatted dataset list, and added new dataset link

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-26T23:39:41Z

_doc/examples/ml/plot_template_data.py

@@ -0,0 +1,124 @@
+"""


Missing the '# coding: utf-8' header that is present in all other example files in the _doc/examples directory. This header is particularly important for this file since it contains French text with special characters.

Copilot · 2026-02-26T23:39:42Z

_doc/examples/ml/plot_template_data.py

+    columns = set(df.columns)
+    assert set(keys) & set(columns) == set(
+        keys
+    ), f"Missing columns {set(keys) - set(keys) & set(columns)} in {sorted(df.columns)}"


The assertion logic is incorrect. The expression set(keys) - set(keys) & set(columns) will always result in an empty set due to operator precedence. The & operator has higher precedence than -, so it evaluates as set(keys) - (set(keys) & set(columns)), which removes all common keys from the set of keys. This should be set(keys) - (set(keys) & set(columns)) or more simply set(keys) - set(columns) to show the missing columns.

Suggested change

), f"Missing columns {set(keys) - set(keys) & set(columns)} in {sorted(df.columns)}"

), f"Missing columns {set(keys) - set(columns)} in {sorted(df.columns)}"

Copilot · 2026-02-26T23:39:42Z

_doc/examples/ml/plot_template_data.py

+    return mean_absolute_error(piv[2025], piv[2024])
+
+
+def split_train_test(table, cuble):


Parameter name has a typo: 'cuble' should be 'cible' to match the variable name used throughout the codebase and in the function body.

Suggested change

def split_train_test(table, cuble):

def split_train_test(table, cible):

Copilot · 2026-02-26T23:39:42Z

_doc/examples/ml/plot_template_data.py

+    test_X = X[train_test].drop(drop, axis=1)
+    test_y = y[train_test]


Both test_X and test_y are incorrectly using the training data filter. Lines 90-91 should use ~train_test instead of train_test to select the test set (Session == 2025), otherwise the test set will be identical to the training set.

Suggested change

test_X = X[train_test].drop(drop, axis=1)

test_y = y[train_test]

test_X = X[~train_test].drop(drop, axis=1)

test_y = y[~train_test]

Copilot · 2026-02-26T23:39:42Z

_doc/examples/ml/plot_template_data.py

+
+
+def make_pipeline(table, cible):
+    vars = [c for c in table.columns if c != "cible"]


The condition filters by the literal string "cible" instead of using the parameter variable. This should be c != cible to correctly filter out the target column.

Suggested change

vars = [c for c in table.columns if c != "cible"]

vars = [c for c in table.columns if c != cible]

Copilot · 2026-02-26T23:39:42Z

_doc/examples/ml/plot_template_data.py

+    num_cols = ["Capacité de l’établissement par formation"]
+    cat_cols = [c for c in vars if c not in num_cols]
+
+    model = Pipeline(
+        [
+            (
+                "preprocessing",
+                ColumnTransformer(
+                    [
+                        ("num", StandardScaler(), num_cols),
+                        ("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols),
+                    ]
+                ),


The column "Capacité de l'établissement par formation" is not present in the 'keys' list defined in select_variables_and_clean, and therefore won't exist in the table. This will cause a KeyError when the pipeline attempts to apply StandardScaler to this non-existent column. Either add this column to the keys list in select_variables_and_clean, or use a column that is actually present in the filtered table.

Suggested change

num_cols = ["Capacité de l’établissement par formation"]

cat_cols = [c for c in vars if c not in num_cols]

model = Pipeline(

[

(

"preprocessing",

ColumnTransformer(

[

("num", StandardScaler(), num_cols),

("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols),

]

),

# Candidate numeric feature; include it only if it exists in the table to avoid KeyError.

numeric_feature = "Capacité de l’établissement par formation"

num_cols = [numeric_feature] if numeric_feature in table.columns else []

cat_cols = [c for c in vars if c not in num_cols]

transformers = []

if num_cols:

transformers.append(("num", StandardScaler(), num_cols))

if cat_cols:

transformers.append(

("cats", OneHotEncoder(handle_unknown="ignore"), cat_cols)

)

model = Pipeline(

[

(

"preprocessing",

ColumnTransformer(transformers),

Copilot · 2026-02-26T23:39:43Z

_doc/examples/ml/plot_template_data.py

+        # .dropna(axis=0)  # fails
+        .sort_index()
+    )


Regarding the question about why dropna fails: The pivot_table operation creates a DataFrame with multi-column index (2024 and 2025 as columns). When there are missing values in either year, dropna(axis=0) would remove those rows. The likely issue is that after removing rows with NaN values, the resulting DataFrame might be empty or nearly empty, causing the mean_absolute_error calculation on line 78 to fail. The pivot table requires matching keys between 2024 and 2025 data, and if many formations only appear in one year, many NaN values would be created. Consider using fillna(0) or only comparing formations that exist in both years by using an inner join approach in the pivot_table.

Suggested change

# .dropna(axis=0) # fails

.sort_index()

)

.sort_index()

)

# Keep only rows where both 2024 and 2025 have non-missing values

piv = piv.dropna(axis=0, how="any")

if piv.empty:

raise ValueError(

"Not enough overlapping data between 2024 and 2025 to compute oracle."

)

xadupre added 3 commits February 15, 2026 14:51

add datasets

c1bfc74

fix pretraitement image

f5dd663

update

6363be9

xadupre commented Feb 26, 2026

View reviewed changes

Copilot AI mentioned this pull request Feb 26, 2026

Fix four bugs in plot_template_data.py causing compute_oracle and pipeline to fail #90

Draft

xadupre requested a review from Copilot February 26, 2026 23:34

Copilot started reviewing on behalf of xadupre February 26, 2026 23:35 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add datasets - teaching#89

add datasets - teaching#89
xadupre wants to merge 3 commits intomainfrom
sess1

xadupre commented Feb 15, 2026

Uh oh!

xadupre Feb 26, 2026

Uh oh!

Copilot AI commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	), f"Missing columns {set(keys) - set(keys) & set(columns)} in {sorted(df.columns)}"
	), f"Missing columns {set(keys) - set(columns)} in {sorted(df.columns)}"

		return mean_absolute_error(piv[2025], piv[2024])


		def split_train_test(table, cuble):

	def split_train_test(table, cuble):
	def split_train_test(table, cible):

		test_X = X[train_test].drop(drop, axis=1)
		test_y = y[train_test]



		def make_pipeline(table, cible):
		vars = [c for c in table.columns if c != "cible"]

-        # .dropna(axis=0)  # fails
-        .sort_index()
-    )
+        .sort_index()
+    )
+    # Keep only rows where both 2024 and 2025 have non-missing values
+    piv = piv.dropna(axis=0, how="any")
+    if piv.empty:
+        raise ValueError(
+            "Not enough overlapping data between 2024 and 2025 to compute oracle."
+        )

Conversation

xadupre commented Feb 15, 2026

Uh oh!

xadupre Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants