파이썬에서 목록에서 중복 dict 제거

IT박스

파이썬에서 목록에서 중복 dict 제거

itboxs 2020. 8. 3. 08:37

파이썬에서 목록에서 중복 dict 제거

dicts 목록이 있으며 동일한 키 및 값 쌍으로 dicts를 제거하고 싶습니다.

이 목록의 경우 : [{'a': 123}, {'b': 123}, {'a': 123}]

나는 이것을 돌려주고 싶다 : [{'a': 123}, {'b': 123}]

다른 예시:

이 목록의 경우 : [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

나는 이것을 돌려주고 싶다 : [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

이 시도:

[dict(t) for t in {tuple(d.items()) for d in l}]

전략은 사전 목록을 튜플에 사전 항목이 포함 된 튜플 목록으로 변환하는 것입니다. 튜플을 해시 할 수 있기 때문에 set( 여기에서 설정된 파이썬을 사용하면 이전의 파이썬 대안이 될 것입니다 set(tuple(d.items()) for d in l))을 사용하여 복제본을 제거하고 그 후을 사용하여 튜플에서 사전을 다시 만들 수 있습니다 dict.

어디:

l 원래 목록입니다
d 목록의 사전 중 하나입니다.
t 사전에서 만든 튜플 중 하나입니다.

편집 : 순서를 유지하려면 위의 한 줄짜리가 작동하지 않으므로 작동 set하지 않습니다. 그러나 몇 줄의 코드로 다음을 수행 할 수도 있습니다.

l = [{'a': 123, 'b': 1234},
        {'a': 3222, 'b': 1234},
        {'a': 123, 'b': 1234}]

seen = set()
new_l = []
for d in l:
    t = tuple(d.items())
    if t not in seen:
        seen.add(t)
        new_l.append(d)

print new_l

출력 예 :

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

참고 : @alexis가 지적한 것처럼 동일한 키와 값을 가진 두 개의 사전이 동일한 튜플을 생성하지 않을 수 있습니다. 다른 키 추가 / 제거 키 기록을 거치면 발생할 수 있습니다. 그것이 문제의 경우라면, d.items()그가 제안한대로 정렬을 고려하십시오 .

목록 이해에 기반한 또 다른 하나의 라이너 :

>>> d = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> [i for n, i in enumerate(d) if i not in d[n + 1:]]
[{'b': 123}, {'a': 123}]

여기서는 dict비교 를 사용할 수 있으므로 나머지 초기 목록에없는 요소 만 유지합니다 (이 개념은 index를 통해서만 액세스 할 수 n있으므로를 사용합니다 enumerate).

역 직렬화 된 JSON 객체와 같은 중첩 된 사전에서 작업하는 경우 다른 답변이 작동하지 않습니다. 이 경우 다음을 사용할 수 있습니다.

import json
set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
X = [json.loads(t) for t in set_of_jsons]

때로는 구식 루프가 여전히 유용합니다. 이 코드는 jcollado보다 약간 길지만 읽기 쉽습니다.

a = [{'a': 123}, {'b': 123}, {'a': 123}]
b = []
for i in range(0, len(a)):
    if a[i] not in a[i+1:]:
        b.append(a[i])

주문을 유지하려면 할 수 있습니다

from collections import OrderedDict
print OrderedDict((frozenset(item.items()),item) for item in data).values()
# [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

순서가 중요하지 않으면 할 수 있습니다

print {frozenset(item.items()):item for item in data}.values()
# [{'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

타사 패키지를 사용해도 괜찮다면 다음을 사용할 수 있습니다 iteration_utilities.unique_everseen.

>>> from iteration_utilities import unique_everseen
>>> l = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> list(unique_everseen(l))
[{'a': 123}, {'b': 123}]

그것은 원래 목록의 순서를 유지하고 유타도 (느린 알고리즘에 다시 하락에 의해 사전 같은 unhashable 항목을 처리 할 수있는 원본 목록의 요소입니다 원래 목록 대신에 독특한 요소 ). 키와 값을 모두 해시 할 수있는 경우 해당 함수 의 인수를 사용 하여 "고유성 테스트"에 대한 해시 가능 항목을 작성할 수 있습니다 (그래서 작동 함 ).O(n*m)nmO(n)keyO(n)

사전 (순서와 무관하게 비교하는)의 경우,이를 비교하는 다른 데이터 구조에 맵핑해야합니다 frozenset. 예를 들면 다음과 같습니다.

>>> list(unique_everseen(l, key=lambda item: frozenset(item.items())))
[{'a': 123}, {'b': 123}]

tuple동일한 사전이 반드시 같은 순서를 가질 필요가 없기 때문에 (정렬하지 않고) 간단한 접근 방식을 사용해서는 안됩니다 (정렬 순서가 아닌 삽입 순서 가 보장 되는 Python 3.7에서도 ).

>>> d1 = {1: 1, 9: 9}
>>> d2 = {9: 9, 1: 1}
>>> d1 == d2
True
>>> tuple(d1.items()) == tuple(d2.items())
False

키를 정렬 할 수 없으면 튜플 정렬도 작동하지 않을 수 있습니다.

>>> d3 = {1: 1, 'a': 'a'}
>>> tuple(sorted(d3.items()))
TypeError: '<' not supported between instances of 'str' and 'int'

기준

이러한 접근 방식의 성능이 어떻게 비교되는지 보는 것이 도움이 될 것이라고 생각했기 때문에 작은 벤치 마크를 수행했습니다. 벤치 마크 그래프는 중복이 포함되지 않은 목록을 기반으로 한 시간 대 목록 크기입니다 (임의로 선택되었으므로 중복을 많이 추가하면 런타임이 크게 변경되지 않습니다). 로그-로그 플롯이므로 전체 범위가 포함됩니다.

절대 시간 :

가장 빠른 접근 방식과 관련된 타이밍 :

The second approach from thefourtheye is fastest here. The unique_everseen approach with the key function is on the second place, however it's the fastest approach that preserves order. The other approaches from jcollado and thefourtheye are almost as fast. The approach using unique_everseen without key and the solutions from Emmanuel and Scorpil are very slow for longer lists and behave much worse O(n*n) instead of O(n). stpks approach with json isn't O(n*n) but it's much slower than the similar O(n) approaches.

The code to reproduce the benchmarks:

from simple_benchmark import benchmark
import json
from collections import OrderedDict
from iteration_utilities import unique_everseen

def jcollado_1(l):
    return [dict(t) for t in {tuple(d.items()) for d in l}]

def jcollado_2(l):
    seen = set()
    new_l = []
    for d in l:
        t = tuple(d.items())
        if t not in seen:
            seen.add(t)
            new_l.append(d)
    return new_l

def Emmanuel(d):
    return [i for n, i in enumerate(d) if i not in d[n + 1:]]

def Scorpil(a):
    b = []
    for i in range(0, len(a)):
        if a[i] not in a[i+1:]:
            b.append(a[i])

def stpk(X):
    set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
    return [json.loads(t) for t in set_of_jsons]

def thefourtheye_1(data):
    return OrderedDict((frozenset(item.items()),item) for item in data).values()

def thefourtheye_2(data):
    return {frozenset(item.items()):item for item in data}.values()

def iu_1(l):
    return list(unique_everseen(l))

def iu_2(l):
    return list(unique_everseen(l, key=lambda inner_dict: frozenset(inner_dict.items())))

funcs = (jcollado_1, Emmanuel, stpk, Scorpil, thefourtheye_1, thefourtheye_2, iu_1, jcollado_2, iu_2)
arguments = {2**i: [{'a': j} for j in range(2**i)] for i in range(2, 12)}
b = benchmark(funcs, arguments, 'list size')

%matplotlib widget
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = '8, 6'

b.plot(relative_to=thefourtheye_2)

For completeness here is the timing for a list containing only duplicates:

# this is the only change for the benchmark
arguments = {2**i: [{'a': 1} for j in range(2**i)] for i in range(2, 12)}

The timings don't change significantly except for unique_everseen without key function, which in this case is the fastest solution. However that's just the best case (so not representative) for that function with unhashable values because it's runtime depends on the amount of unique values in the list: O(n*m) which in this case is just 1 and thus it runs in O(n).

Disclaimer: I'm the author of iteration_utilities.

If you are using Pandas in your workflow, one option is to feed a list of dictionaries directly to the pd.DataFrame constructor. Then use drop_duplicates and to_dict methods for the required result.

import pandas as pd

d = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

d_unique = pd.DataFrame(d).drop_duplicates().to_dict('records')

print(d_unique)

[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Not a universal answer, but if your list happens to be sorted by some key, like this:

l=[{'a': {'b': 31}, 't': 1},
   {'a': {'b': 31}, 't': 1},
 {'a': {'b': 145}, 't': 2},
 {'a': {'b': 25231}, 't': 2},
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 25231}, 't': 2}, 
 {'a': {'b': 112}, 't': 3}]

then the solution is as simple as:

import itertools
result = [a[0] for a in itertools.groupby(l)]

Result:

[{'a': {'b': 31}, 't': 1},
{'a': {'b': 145}, 't': 2},
{'a': {'b': 25231}, 't': 2},
{'a': {'b': 112}, 't': 3}]

중첩 된 사전과 함께 작동하며 순서를 유지합니다.

세트를 사용할 수 있지만 dicts를 해시 가능 유형으로 바꿔야합니다.

seq = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
unique = set()
for d in seq:
    t = tuple(d.iteritems())
    unique.add(t)

고유는 이제

set([(('a', 3222), ('b', 1234)), (('a', 123), ('b', 1234))])

받아쓰기를하려면 :

[dict(x) for x in unique]

참고 URL : https://stackoverflow.com/questions/9427163/remove-duplicate-dict-in-list-in-python

'IT박스' 카테고리의 다른 글

SQL-하나의 쿼리에서 여러 레코드 업데이트 (0)	2020.08.03
데이터 정렬의 잘못된 혼합 MySQL 오류 (0)	2020.08.03
특정 파이썬 버전에 대해 pip를 사용하여 모듈 설치 (0)	2020.08.03
기본적으로 HTML 선택 비활성화 옵션을 표시하는 방법은 무엇입니까? (0)	2020.08.03
쉘에서 파일 크기 (바이트)를 얻는 휴대용 방법? (0)	2020.08.03

현재글파이썬에서 목록에서 중복 dict 제거

itboxs

파이썬에서 목록에서 중복 dict 제거

파이썬에서 목록에서 중복 dict 제거

기준

'IT박스' 카테고리의 다른 글

'IT박스'의 다른글

티스토리툴바

파이썬에서 목록에서 중복 dict 제거

파이썬에서 목록에서 중복 dict 제거

기준

'IT박스' 카테고리의 다른 글

'IT박스'의 다른글

관련글

티스토리툴바