IT박스

목록의 팬더 열, 각 목록 요소에 대한 행을 만듭니다.

itboxs 2020. 7. 10. 08:12
반응형

목록의 팬더 열, 각 목록 요소에 대한 행을 만듭니다.


일부 셀에 여러 값 목록이 포함 된 데이터 프레임이 있습니다. 셀에 여러 값을 저장하는 대신 목록의 각 항목이 다른 모든 열의 동일한 값을 가진 자체 행을 갖도록 데이터 프레임을 확장하고 싶습니다. 그래서 내가 가지고 있다면 :

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'trial_num': [1, 2, 3, 1, 2, 3],
     'subject': [1, 1, 1, 2, 2, 2],
     'samples': [list(np.random.randn(3).round(2)) for i in range(6)]
    }
)

df
Out[10]: 
                 samples  subject  trial_num
0    [0.57, -0.83, 1.44]        1          1
1    [-0.01, 1.13, 0.36]        1          2
2   [1.18, -1.46, -0.94]        1          3
3  [-0.08, -4.22, -2.05]        2          1
4     [0.72, 0.79, 0.53]        2          2
5    [0.4, -0.32, -0.13]        2          3

긴 형식으로 변환하는 방법 (예 :

   subject  trial_num  sample  sample_num
0        1          1    0.57           0
1        1          1   -0.83           1
2        1          1    1.44           2
3        1          2   -0.01           0
4        1          2    1.13           1
5        1          2    0.36           2
6        1          3    1.18           0
# etc.

인덱스는 중요하지 않으므로 기존 열을 인덱스로 설정해도되며 최종 순서는 중요하지 않습니다.


lst_col = 'samples'

r = pd.DataFrame({
      col:np.repeat(df[col].values, df[lst_col].str.len())
      for col in df.columns.drop(lst_col)}
    ).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]

결과:

In [103]: r
Out[103]:
    samples  subject  trial_num
0      0.10        1          1
1     -0.20        1          1
2      0.05        1          1
3      0.25        1          2
4      1.32        1          2
5     -0.17        1          2
6      0.64        1          3
7     -0.22        1          3
8     -0.71        1          3
9     -0.03        2          1
10    -0.65        2          1
11     0.76        2          1
12     1.77        2          2
13     0.89        2          2
14     0.65        2          2
15    -0.98        2          3
16     0.65        2          3
17    -0.30        2          3

추신 : 여기에 좀 더 일반적인 해결책이 있습니다.


업데이트 : 일부 설명 : IMO이 코드를 이해하는 가장 쉬운 방법은 단계별로 실행하는 것입니다.

다음 줄에서 한 열에 값을 반복합니다. N여기서 N-는 해당 목록의 길이입니다.

In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)

스칼라 값을 포함하는 모든 열에 대해 일반화 할 수 있습니다.

In [11]: pd.DataFrame({
    ...:           col:np.repeat(df[col].values, df[lst_col].str.len())
    ...:           for col in df.columns.drop(lst_col)}
    ...:         )
Out[11]:
    trial_num  subject
0           1        1
1           1        1
2           1        1
3           2        1
4           2        1
5           2        1
6           3        1
..        ...      ...
11          1        2
12          2        2
13          2        2
14          2        2
15          3        2
16          3        2
17          3        2

[18 rows x 2 columns]

를 사용하여 열 ( ) np.concatenate()의 모든 값을 평평하게 하고 1D 벡터를 얻을 수 있습니다.listsamples

In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32,  0.82, -0.59, -0.34,  0.25,  2.09,  0.12,  0.83, -0.88,  0.68,  0.55, -0.56,  0.65, -0.04,  0.36, -0.31])

이 모든 것을 하나로 모으기 :

In [13]: pd.DataFrame({
    ...:           col:np.repeat(df[col].values, df[lst_col].str.len())
    ...:           for col in df.columns.drop(lst_col)}
    ...:         ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
    trial_num  subject  samples
0           1        1    -1.04
1           1        1    -0.58
2           1        1    -1.32
3           2        1     0.82
4           2        1    -0.59
5           2        1    -0.34
6           3        1     0.25
..        ...      ...      ...
11          1        2     0.68
12          2        2     0.55
13          2        2    -0.56
14          2        2     0.65
15          3        2    -0.04
16          3        2     0.36
17          3        2    -0.31

[18 rows x 3 columns]

를 사용 pd.DataFrame()[df.columns]하면 원래 순서대로 열을 선택할 수 있습니다.


예상보다 조금 길다 :

>>> df
                samples  subject  trial_num
0  [-0.07, -2.9, -2.44]        1          1
1   [-1.52, -0.35, 0.1]        1          2
2  [-0.17, 0.57, -0.65]        1          3
3  [-0.82, -1.06, 0.47]        2          1
4   [0.79, 1.35, -0.09]        2          2
5   [1.17, 1.14, -1.79]        2          3
>>>
>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'sample'
>>>
>>> df.drop('samples', axis=1).join(s)
   subject  trial_num  sample
0        1          1   -0.07
0        1          1   -2.90
0        1          1   -2.44
1        1          2   -1.52
1        1          2   -0.35
1        1          2    0.10
2        1          3   -0.17
2        1          3    0.57
2        1          3   -0.65
3        2          1   -0.82
3        2          1   -1.06
3        2          1    0.47
4        2          2    0.79
4        2          2    1.35
4        2          2   -0.09
5        2          3    1.17
5        2          3    1.14
5        2          3   -1.79

순차 색인을 원하는 경우 reset_index(drop=True)결과에 적용 할 수 있습니다.

업데이트 :

>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()
>>> res = res.reset_index()
>>> res.columns = ['subject','trial_num','sample_num','sample']
>>> res
    subject  trial_num  sample_num  sample
0         1          1           0    1.89
1         1          1           1   -2.92
2         1          1           2    0.34
3         1          2           0    0.85
4         1          2           1    0.24
5         1          2           2    0.72
6         1          3           0   -0.96
7         1          3           1   -2.72
8         1          3           2   -0.11
9         2          1           0   -1.33
10        2          1           1    3.13
11        2          1           2   -0.65
12        2          2           0    0.10
13        2          2           1    0.65
14        2          2           2    0.15
15        2          3           0    0.64
16        2          3           1   -0.10
17        2          3           2   -0.76

팬더> = 0.25

Series 및 DataFrame 메서드는 .explode()목록을 별도의 행으로 분해하는 메서드를 정의합니다 . 목록 같은 열 분해 에 대한 문서 섹션을 참조하십시오 .

df = pd.DataFrame({
    'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan], 
    'var2': [1, 2, 3, 4]
})
df
        var1  var2
0  [a, b, c]     1
1     [d, e]     2
2         []     3
3        NaN     4

df.explode('var1')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
2  NaN     3  # empty list converted to NaN
3  NaN     4  # NaN entry preserved as-is

# to reset the index to be monotonically increasing...
df.explode('var1').reset_index(drop=True)

  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5  NaN     3
6  NaN     4

이것은 또한리스트와 스칼라의 혼합 열과 빈리스트와 NaN을 적절히 처리합니다 (이것은 repeat기반 솔루션 의 단점입니다 ).

However, you should note that explode only works on a single column (for now).

P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode. See this (very much) related answer by me.


you can also use pd.concat and pd.melt for this:

>>> objs = [df, pd.DataFrame(df['samples'].tolist())]
>>> pd.concat(objs, axis=1).drop('samples', axis=1)
   subject  trial_num     0     1     2
0        1          1 -0.49 -1.00  0.44
1        1          2 -0.28  1.48  2.01
2        1          3 -0.52 -1.84  0.02
3        2          1  1.23 -1.36 -1.06
4        2          2  0.54  0.18  0.51
5        2          3 -2.18 -0.13 -1.35
>>> pd.melt(_, var_name='sample_num', value_name='sample', 
...         value_vars=[0, 1, 2], id_vars=['subject', 'trial_num'])
    subject  trial_num sample_num  sample
0         1          1          0   -0.49
1         1          2          0   -0.28
2         1          3          0   -0.52
3         2          1          0    1.23
4         2          2          0    0.54
5         2          3          0   -2.18
6         1          1          1   -1.00
7         1          2          1    1.48
8         1          3          1   -1.84
9         2          1          1   -1.36
10        2          2          1    0.18
11        2          3          1   -0.13
12        1          1          2    0.44
13        1          2          2    2.01
14        1          3          2    0.02
15        2          1          2   -1.06
16        2          2          2    0.51
17        2          3          2   -1.35

last, if you need you can sort base on the first the first three columns.


Trying to work through Roman Pekar's solution step-by-step to understand it better, I came up with my own solution, which uses melt to avoid some of the confusing stacking and index resetting. I can't say that it's obviously a clearer solution though:

items_as_cols = df.apply(lambda x: pd.Series(x['samples']), axis=1)
# Keep original df index as a column so it's retained after melt
items_as_cols['orig_index'] = items_as_cols.index

melted_items = pd.melt(items_as_cols, id_vars='orig_index', 
                       var_name='sample_num', value_name='sample')
melted_items.set_index('orig_index', inplace=True)

df.merge(melted_items, left_index=True, right_index=True)

Output (obviously we can drop the original samples column now):

                 samples  subject  trial_num sample_num  sample
0    [1.84, 1.05, -0.66]        1          1          0    1.84
0    [1.84, 1.05, -0.66]        1          1          1    1.05
0    [1.84, 1.05, -0.66]        1          1          2   -0.66
1    [-0.24, -0.9, 0.65]        1          2          0   -0.24
1    [-0.24, -0.9, 0.65]        1          2          1   -0.90
1    [-0.24, -0.9, 0.65]        1          2          2    0.65
2    [1.15, -0.87, -1.1]        1          3          0    1.15
2    [1.15, -0.87, -1.1]        1          3          1   -0.87
2    [1.15, -0.87, -1.1]        1          3          2   -1.10
3   [-0.8, -0.62, -0.68]        2          1          0   -0.80
3   [-0.8, -0.62, -0.68]        2          1          1   -0.62
3   [-0.8, -0.62, -0.68]        2          1          2   -0.68
4    [0.91, -0.47, 1.43]        2          2          0    0.91
4    [0.91, -0.47, 1.43]        2          2          1   -0.47
4    [0.91, -0.47, 1.43]        2          2          2    1.43
5  [-1.14, -0.24, -0.91]        2          3          0   -1.14
5  [-1.14, -0.24, -0.91]        2          3          1   -0.24
5  [-1.14, -0.24, -0.91]        2          3          2   -0.91

For those looking for a version of Roman Pekar's answer that avoids manual column naming:

column_to_explode = 'samples'
res = (df
       .set_index([x for x in df.columns if x != column_to_explode])[column_to_explode]
       .apply(pd.Series)
       .stack()
       .reset_index())
res = res.rename(columns={
          res.columns[-2]:'exploded_{}_index'.format(column_to_explode),
          res.columns[-1]: '{}_exploded'.format(column_to_explode)})

I found the easiest way was to:

  1. Convert the samples column into a DataFrame
  2. Joining with the original df
  3. Melting

Shown here:

    df.samples.apply(lambda x: pd.Series(x)).join(df).\
melt(['subject','trial_num'],[0,1,2],var_name='sample')

        subject  trial_num sample  value
    0         1          1      0  -0.24
    1         1          2      0   0.14
    2         1          3      0  -0.67
    3         2          1      0  -1.52
    4         2          2      0  -0.00
    5         2          3      0  -1.73
    6         1          1      1  -0.70
    7         1          2      1  -0.70
    8         1          3      1  -0.29
    9         2          1      1  -0.70
    10        2          2      1  -0.72
    11        2          3      1   1.30
    12        1          1      2  -0.55
    13        1          2      2   0.10
    14        1          3      2  -0.44
    15        2          1      2   0.13
    16        2          2      2  -1.44
    17        2          3      2   0.73

It's worth noting that this may have only worked because each trial has the same number of samples (3). Something more clever may be necessary for trials of different sample sizes.


Very late answer but I want to add this:

A fast solution using vanilla Python that also takes care of the sample_num column in OP's example. On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds. The accepted solution completely breaks down with that amount of data and leads to a memory error on my system that has 128GB of RAM.

df = df.reset_index(drop=True)
lstcol = df.lstcol.values
lstcollist = []
indexlist = []
countlist = []
for ii in range(len(lstcol)):
    lstcollist.extend(lstcol[ii])
    indexlist.extend([ii]*len(lstcol[ii]))
    countlist.extend([jj for jj in range(len(lstcol[ii]))])
df = pd.merge(df.drop("lstcol",axis=1),pd.DataFrame({"lstcol":lstcollist,"lstcol_num":countlist},
index=indexlist),left_index=True,right_index=True).reset_index(drop=True)

import pandas as pd
df = pd.DataFrame([{'Product': 'Coke', 'Prices': [100,123,101,105,99,94,98]},{'Product': 'Pepsi', 'Prices': [101,104,104,101,99,99,99]}])
print(df)
df = df.assign(Prices=df.Prices.str.split(',')).explode('Prices')
print(df)

Try this in pandas >=0.25 version

참고URL : https://stackoverflow.com/questions/27263805/pandas-column-of-lists-create-a-row-for-each-list-element

반응형