[23파이썬특강] 3강. 자유자재의 데이터 가공(메서드 체이닝의 쉬운 사례)

2024-01-06 4 분 소요

1. 전체: 데이터 및 실행 코드

[과제]

다음 데이터는 실습을 위해 임의로 A고등학교 학생들의 2023년 모의고사 성적을 상정한 것이다. 과목은 영어, 수학, 역사이다. 2023년 1학기의 세 과목 총점 평균 1~3위의 학생 이름과 총점을 출력하라

import pandas as pd
import numpy as np

# 지정된 데이터셋 생성
df = pd.DataFrame({
    'student': ['철수', '영희', '철수', '영희', '철수', '영희', np.nan, '민수', '민수', '민수', '민지', '민지', '준호', '준호', '민지', '민수', '영희'],
    'english': [82.0, 90.0, 78.0, 92.0, 85.0, np.nan, 85.0, 88.0, 79.0, 81.0, np.nan, 91.0, 75.0, 90.0, 100.0, 99.0, 80.0],
    'math': [90.0, np.nan, 88.0, 92.0, np.nan, np.nan, 84.0, 82.0, 80.0, 78.0, 77.0, 79.0, 83.0, 85.0, 95.0, 95.0, 70.0],
    'history': [80.0, 85.0, 87.0, 89.0, 91.0, 93.0, 76.0, 78.0, 80.0, np.nan, 75.0, 77.0, 72.0, 74.0, 60.0, 90.0, 100.0],
    'term': ['2023-1', '2023-1', '2023-1', '2023-1', '2023-2', '2023-2', '2023-1', '2023-1', '2023-2', '2023-2', '2023-1', '2023-1', '2023-1', '2023-1', '2023-1', '2023-1', '2023-1']
})


# 판다스 메서드 체인을 사용한 데이터 처리
result = df.dropna(subset=['student', 'english', 'math', 'history']) \
           .query("term == '2023-1'") \
           .assign(total_score = lambda x: x['english'] + x['math'] + x['history']) \
           .groupby('student') \
           .agg(mean_score = ('total_score', 'mean')) \
           .sort_values(by='mean_score', ascending=False) \
           .head(3)
         
result

	mean_score
student
민수	266.0
영희	261.5
철수	252.5

2. 메서드체이닝 실습 데이터

# 지정된 데이터셋 생성
df = pd.DataFrame({
    'student': ['철수', '영희', '철수', '영희', '철수', '영희', np.nan, '민수', '민수', '민수', '민지', '민지', '준호', '준호', '민지', '민수', '영희'],
    'english': [82.0, 90.0, 78.0, 92.0, 85.0, np.nan, 85.0, 88.0, 79.0, 81.0, np.nan, 91.0, 75.0, 90.0, 100.0, 99.0, 80.0],
    'math': [90.0, np.nan, 88.0, 92.0, np.nan, np.nan, 84.0, 82.0, 80.0, 78.0, 77.0, 79.0, 83.0, 85.0, 95.0, 95.0, 70.0],
    'history': [80.0, 85.0, 87.0, 89.0, 91.0, 93.0, 76.0, 78.0, 80.0, np.nan, 75.0, 77.0, 72.0, 74.0, 60.0, 90.0, 100.0],
    'term': ['2023-1', '2023-1', '2023-1', '2023-1', '2023-2', '2023-2', '2023-1', '2023-1', '2023-2', '2023-2', '2023-1', '2023-1', '2023-1', '2023-1', '2023-1', '2023-1', '2023-1']
})
df

	student	english	math	history	term
0	철수	82.0	90.0	80.0	2023-1
1	영희	90.0	NaN	85.0	2023-1
2	철수	78.0	88.0	87.0	2023-1
3	영희	92.0	92.0	89.0	2023-1
4	철수	85.0	NaN	91.0	2023-2
5	영희	NaN	NaN	93.0	2023-2
6	NaN	85.0	84.0	76.0	2023-1
7	민수	88.0	82.0	78.0	2023-1
8	민수	79.0	80.0	80.0	2023-2
9	민수	81.0	78.0	NaN	2023-2
10	민지	NaN	77.0	75.0	2023-1
11	민지	91.0	79.0	77.0	2023-1
12	준호	75.0	83.0	72.0	2023-1
13	준호	90.0	85.0	74.0	2023-1
14	민지	100.0	95.0	60.0	2023-1
15	민수	99.0	95.0	90.0	2023-1
16	영희	80.0	70.0	100.0	2023-1

3. 메서드 체이닝 코드

# 판다스 메서드 체인을 사용한 데이터 처리
result = df.dropna(subset=['student', 'english', 'math', 'history']) \
           .query("term == '2023-1'") \
           .assign(total_score = lambda x: x['english'] + x['math'] + x['history']) \
           .groupby('student') \
           .agg(mean_score = ('total_score', 'mean')) \
           .sort_values(by='mean_score', ascending=False) \
           .head(3)
         
result

	mean_score
student
민수	266.0
영희	261.5
철수	252.5

4. 코드 세부 검토

4.1. 결측치 제거

result = df.dropna(subset=['student', 'english', 'math', 'history'])
result

	student	english	math	history	term
0	철수	82.0	90.0	80.0	2023-1
2	철수	78.0	88.0	87.0	2023-1
3	영희	92.0	92.0	89.0	2023-1
7	민수	88.0	82.0	78.0	2023-1
8	민수	79.0	80.0	80.0	2023-2
11	민지	91.0	79.0	77.0	2023-1
12	준호	75.0	83.0	72.0	2023-1
13	준호	90.0	85.0	74.0	2023-1
14	민지	100.0	95.0	60.0	2023-1
15	민수	99.0	95.0	90.0	2023-1
16	영희	80.0	70.0	100.0	2023-1

4.2. 2023-1 추출

result = df.dropna(subset=['student', 'english', 'math', 'history']) \
           .query("term == '2023-1'")
result

	student	english	math	history	term
0	철수	82.0	90.0	80.0	2023-1
2	철수	78.0	88.0	87.0	2023-1
3	영희	92.0	92.0	89.0	2023-1
7	민수	88.0	82.0	78.0	2023-1
11	민지	91.0	79.0	77.0	2023-1
12	준호	75.0	83.0	72.0	2023-1
13	준호	90.0	85.0	74.0	2023-1
14	민지	100.0	95.0	60.0	2023-1
15	민수	99.0	95.0	90.0	2023-1
16	영희	80.0	70.0	100.0	2023-1

4.3. 파생변수로 성적 합계(= 총점) 변수를 추가

result = df.dropna(subset=['student', 'english', 'math', 'history']) \
           .query("term == '2023-1'") \
           .assign(total_score = lambda x: x['english'] + x['math'] + x['history'])
result

	student	english	math	history	term	total_score
0	철수	82.0	90.0	80.0	2023-1	252.0
2	철수	78.0	88.0	87.0	2023-1	253.0
3	영희	92.0	92.0	89.0	2023-1	273.0
7	민수	88.0	82.0	78.0	2023-1	248.0
11	민지	91.0	79.0	77.0	2023-1	247.0
12	준호	75.0	83.0	72.0	2023-1	230.0
13	준호	90.0	85.0	74.0	2023-1	249.0
14	민지	100.0	95.0	60.0	2023-1	255.0
15	민수	99.0	95.0	90.0	2023-1	284.0
16	영희	80.0	70.0	100.0	2023-1	250.0

4.4. 학생별 그룹화

result = df.dropna(subset=['student', 'english', 'math', 'history']) \
           .query("term == '2023-1'") \
           .assign(total_score = lambda x: x['english'] + x['math'] + x['history']) \
           .groupby('student')
result.groups

{'민수': [7, 15], '민지': [11, 14], '영희': [3, 16], '준호': [12, 13], '철수': [0, 2]}

4.5. 학생별 총점 평균

result = df.dropna(subset=['student', 'english', 'math', 'history']) \
           .query("term == '2023-1'") \
           .assign(total_score = lambda x: x['english'] + x['math'] + x['history']) \
           .groupby('student') \
           .agg(mean_score = ('total_score', 'mean'))
result

	mean_score
student
민수	266.0
민지	251.0
영희	261.5
준호	239.5
철수	252.5

4.6. 총점 평균을 정렬

result = df.dropna(subset=['student', 'english', 'math', 'history']) \
           .query("term == '2023-1'") \
           .assign(total_score = lambda x: x['english'] + x['math'] + x['history']) \
           .groupby('student') \
           .agg(mean_score = ('total_score', 'mean')) \
           .sort_values(by='mean_score', ascending=False)
result

	mean_score
student
민수	266.0
영희	261.5
철수	252.5
민지	251.0
준호	239.5

4.7. 상위점수 학생 3명 출력

result = df.dropna(subset=['student', 'english', 'math', 'history']) \
           .query("term == '2023-1'") \
           .assign(total_score = lambda x: x['english'] + x['math'] + x['history']) \
           .groupby('student') \
           .agg(mean_score = ('total_score', 'mean')) \
           .sort_values(by='mean_score', ascending=False) \
           .head(3)
         
result

	mean_score
student
민수	266.0
영희	261.5
철수	252.5

Twitter Facebook LinkedIn

[23파이썬특강] 3강. 자유자재의 데이터 가공(메서드 체이닝의 쉬운 사례)

1. 전체: 데이터 및 실행 코드

2. 메서드체이닝 실습 데이터

3. 메서드 체이닝 코드

4. 코드 세부 검토

4.1. 결측치 제거

4.2. 2023-1 추출

4.3. 파생변수로 성적 합계(= 총점) 변수를 추가

4.4. 학생별 그룹화

4.5. 학생별 총점 평균

4.6. 총점 평균을 정렬

4.7. 상위점수 학생 3명 출력

공유하기

댓글남기기

참고

[개벽의 사회주의] 01. 사회주의, 개벽, TNA

[개벽의 사회주의] 00. 환경설정

[23파이썬특강] 7강. TNA 5단계

[23파이썬특강] 6-8강. 『개벽』 데이터 분석