[멋사 AI 7기] EDA2

2022-10-11 2 분 소요

EDA

Pandas

Tidy Data
깔끔한 데이터 / 분석하기 좋은 데이터
각 변수가 열이고 각 관측치가 행이 되도록 배열된 데이터 - Hadley Wickham

melt로 Tidy data 만들기
pd.melt(df, id_vars, value_vars, var_name, value_name)

Subset Observations - rows

df.nlargest(n, “value”)
df.nsmallest(n, “value”)

Subset Variables - columns

select single column with specific name

df[“colname”]
df.colname * 특수문자, 띄어쓰기 등 주의

Reshaping Data

Pandas Crosstab

normalize : bool, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False
Normalize by dividing all values by the sum of values - 해당 항목의 수 / 전체 빈도수

If passed ‘all’ or True, will normalize over all values
If passed ‘index’ will normalize over each row
If passed ‘columns’ will normalize over each column
If margins is True, will also normalize margin values

컬럼 제거하기

df.drop(labels=[“col1”, “col2”], axis=1)
df.drop(columns=[“col1”, “col2”])
labels는 명시적으로 axis 설정 필요

컬럼명 변경하기

df.columns = [colname_list]
df = df.rename(columns={“변경전” : “변경후”})

데이터 타입 변경

pd.to_numeric
- errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
  If ‘raise’, then invalid parsing will raise an exception.
  If ‘coerce’, then invalid parsing will be set as NaN.
  If ‘ignore’, then invalid parsing will return the input.

Series

Handling

	Series	DataFrame	대체값 일치여부
replace	O	O	완전히 일치 시 대체 (정규 표현식은 일부여도 OK)
str.replace	O	X	일부만 일치해도 대체

참고

정규 표현식
메타문자는 기억해둘 만하다
regex=True
Series Accessor
.str 접근자는 시리즈 문자열 형식에만 사용할 수 있습니다.

str.split(pat, expand=True)
expand : bool, default False
Expand the split strings into separate columns.

If True, return DataFrame/MultiIndex expanding dimensionality.
If False, return Series/Index, containing lists of strings.

함수 적용되는 형태 주의

값 찾기(Filtering)

str.isin Dataframe.isin
str.contains

오늘의 이모저모

시각화

Seaborn

heatmap
pairplot 각 열의 조합에 대해서 산점도를 그리고, 같은 데이터가 만나는 대각선 영역에는 해당 데이터의 히스토그램을 그린다.

범주 위치 조정
plt.legend(loc, bbox_to_anchor)

loc : 바운딩 박스 안에서 위치 조정
bbox_to_anchor : 바운딩 박스 밖에서 위치 조정

annot : 과학적 표기법 출력 여부
fmt : 표기 형식
cmap : 색상 - print(plt.colormaps())
palette

point plot : 점과 ci ->errorbar

hist => kde(density) => violin
scatter => strip => swarm

Plotly

plotly.express
px.histogram : seaborn 의 barplot 과 유사한 기능
hisfunc : seaborn 의 estimate 기능과 유사함
histfunc: str (default ‘count’ if no arguments are provided, else ‘sum’)
One of ‘count’, ‘sum’, ‘avg’, ‘min’, or ‘max’.Function used to aggregate values for summarization (note: can be normalized with histnorm).
The arguments to this function are the values of y(x) if orientation is ‘v’(‘h’).

color
barmode
facet_row[col]
marginal

option

모든 컬럼 출력 설정
pd.options.display.max_columns = None

출처

- Tidy Data
https://vita.had.co.nz/papers/tidy-data.pdf
- Pandas Cheat Sheet
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- Pandas Crosstab
https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
- 정규 표현식
https://ko.wikipedia.org/wiki/%EC%A0%95%EA%B7%9C_%ED%91%9C%ED%98%84%EC%8B%9D
- Pandas Style
https://pandas.pydata.org/docs/reference/style.html
- pairplot
https://velog.io/@addison/%EB%8D%B0%EC%9D%B4%ED%84%B0-%EB%B6%84%EC%84%9D-3-7-%ED%83%90%EC%83%89%EC%A0%81-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EB%B6%84%EC%84%9D-%EC%83%81%EA%B4%80%EA%B4%80%EA%B3%84-%EB%B6%84%EC%84%9D
- Seaborn
https://seaborn.pydata.org/tutorial/function_overview.html#
- 범주 위치 조정 https://dailyheumsi.tistory.com/97

포스팅 공지

작성한 포스팅은 멋쟁이 사자처럼 AI SCHOOl의 수업 내용입니다.

Twitter Facebook LinkedIn

[멋사 AI 7기] EDA2

Pandas

Subset Observations - rows

Subset Variables - columns

Reshaping Data

Pandas Crosstab

컬럼 제거하기

컬럼명 변경하기

데이터 타입 변경

Series

Handling

Pandas Style

값 찾기(Filtering)

오늘의 이모저모

시각화

Seaborn

Plotly

option

공유하기

댓글남기기

참고

[멋사 AI 7기] 랜덤포레스트

[멋사 AI 7기] 머신러닝 기본

[멋사 AI 7기] Git과 Streamlit

[멋사 AI 7기] 절약