USPTO Patent View API에서 특허 데이터 추출하기

T&DI LAB/토픽모델링

USPTO Patent View API에서 특허 데이터 추출하기

뱃놀이가자 2024. 1. 5. 22:31

728x90

기술경영 분야에서 주로 활용하는 방법 중 하나가 특허분석이다.
과거에는 특허분석에 있어서 USPTO에서 쉽게 특허정보를 추출할 수 있었지만 최근 들어 API를 요청한 후 접근하여 데이터를 뽑아야 하는 상황이다.

https://patentsview-support.atlassian.net/servicedesk/customer/portal/1/group/1/create/18

Jira Service Management

{"xsrfToken":"c33f85ac9706597918f8805287af4c68a6e84ef1_lout","branding":{"id":"1","key":"pvs","name":"PatentsView Support","portalBaseUrl":"/servicedesk/customer/portal/1"},"helpCenterBranding":{"logoUrl":"https://api.media.atlassian.com/file/c29e1420-2edb

patentsview-support.atlassian.net

해당 링크를 통해 USPTO Patent view를 통해서 개인만의 API key 를 얻을 수 있다.
이후 과정은 크롤링하는 과정이라고 생각하는게 편하다.

json으로 접근해서 쿼리식을 작성한 후에 데이터를 얻을 수 있다.
Github에 다양한 Patent View API wrapper와 관련된 코드를 참고할 수 있고 내가 참고한 깃허브 코드 링크는 아래 첨부한다.
https://docs.ropensci.org/patentsview/index.html

An R Client to the PatentsView API

Provides functions to simplify the PatentsView API (<https://patentsview.org/apis/purpose>) query language, send GET and POST requests to the API's seven endpoints, and parse the data that comes back.

docs.ropensci.org

R언어에 존재하긴 하나 꽤나 오류가 많은 편이다(비추)

내가 작성한 코드를 공유하자면

import requests
import json
import pandas as pd
import urllib.parse

api_key = # 본인이 발급받은 Key 를 사용할 것

def fetch_page(query_url, page, api_key):
    options = {#사용자지정}
    url = query_url + '&f=' + json.dumps(fields) + '&o=' + json.dumps(options)
    response = requests.get(url, headers={"X-Api-Key": api_key})
    if response.status_code == 200:
        return response.json()
    else:
        return None
        

# Your existing code to build the query
query = {
   # 본인이 작성한 쿼리
    ]
}
fields = #본인이 작성한 Field
query_string = json.dumps(query)
encoded_query = urllib.parse.quote(query_string)

base_url = 'https://api.patentsview.org/patents/query?q='
query_url = base_url + encoded_query

# Initialize an empty list to store DataFrames
dataframes = []

# Initialize a variable for the current page
current_page = 1

# Loop to fetch each page
while True:
    data = fetch_page(query_url, current_page, api_key)
    if data and data.get('patents'):
        # Convert the data to a DataFrame
        df = pd.DataFrame(data.get('patents'))
        dataframes.append(df)
        
        # Increment the page number for the next request
        current_page += 1
    else:
        # Exit the loop if there are no more results
        break

# Concatenate all DataFrames into a single DataFrame
all_patents = pd.concat(dataframes, ignore_index=True)

# Save the complete DataFrame to a CSV file
csv_file_path = 'all_patents.csv'
all_patents.to_csv(csv_file_path, index=False)
print(f"Data successfully saved to {csv_file_path}")

https://patentsview.org/apis/api-query-language

API Query Language | PatentsView

patentsview.org

쿼리식은 해당 링크에서 아주 잘 설명해준다

잘 나오긴 한 것 같다.

728x90

저작자표시 비영리 (새창열림)

'T&DI LAB > 토픽모델링' 카테고리의 다른 글

USPTO Patent View API에서 특허 데이터 추출 후 전처리하기 (0)	2024.01.15
STM 실습 (4) 토픽모델링의 해석 / 의의와 한계 (1)	2023.11.15
STM 실습 (3) 토픽모델링 적용 (1)	2023.11.15
STM 실습 (2) 데이터 가공 (0)	2023.11.14
STM 실습 (1) 데이터 탐색 및 목표 설정 (1)	2023.11.14

현재글USPTO Patent View API에서 특허 데이터 추출하기

티스토리에 담긴 나의 히스토리

컨설턴트를 향한 발자국

머신러닝, 산업공학, Python, USPTO, C++기초, 데이터사이언스, technology forecasting, 텍스트마이닝, 자료구조, 한빛미디어, 토픽모델링, 논문리뷰, c++, 파이썬, 데이터분석, 경희대학교, STM, 기술경영, LDA, 특허분석,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

티스토리에 담긴 나의 히스토리