Oxylabs Real-Time Crawler

如何使用百度实时爬虫 [第 4 部分]:Oxylabs 用于百度的实时爬虫

Do you know how to use OxyLabs Real-time Crawler for Baidu? This is the most comprehensive introduction from OxyLabs official.

Quick Start

Real-Time Crawler is built for heavy-duty data retrieval operations. You can use Real-Time Crawler to access various Baidu pages. It enables effortless web data extraction from search engines without any delays or errors.

Real-Time Crawler for Baidu uses basic HTTP authentication that requires sending username and password.

This is by far the fastest way to start using Real-Time Crawler for Baidu. You will send a query adidas to baidu_search using Realtime integration method. Don't forget to replace USERNAME 和 PASSWORD with your proxy user credentials.

curl --user "USERNAME:PASSWORD" 'https://realtime.oxylabs.io/v1/queries' -H "Content-Type: application/json" -d '{"source": "baidu_search", "domain": "com", "query": "adidas"}'

If you have any questions not covered by this documentation, please contact your account manager or our support staff at [email protected].


Integration Methods

Real-Time Crawler for Baidu supports three integration methods which have their unique benefits:

  • Push-Pull. Using this method it is now required to mainain an active connection with our endpoint to retrieve the data. Upon making a request, our system is able to automatically ping users server when the job is done (see Callback). This method saves computing resources and can be scaled easily.
  • Realtime. The method requires user to maintain an active connection with our endpoint in order to get the results successfully when the job is completed. This method can be implemented into one service while Push-Pull method is a two step process.
  • SuperAPI. This method is very similar to Realtime but instead posting data to our endpoint, user can use HTML Cralwer as a proxy. To retrieve the data, user must set up a proxy endpoint and make GET request to a desired URL. Additional parameters must be added using headers.

Our recommended data extraction method is Push-Pull.


Push-Pull

This is the most simple yet the most reliable and recommended data delivery method. In Push-Pull scenario you send us a query, we return you job id, and once the job is done you can use that id to retrieve content from /results endpoint. You can check job completion status yourself, or you can set up a simple listener that is able to accept POST queries.

This way we will send you a callback message once the job is ready to be retrieved. In this particular example the results will be automatically uploaded to your S3 bucket named YOUR_BUCKET_NAME.


Single Query

The following endpoint will handle single queries for one keyword or URL. The API will return a confirmation message containing job information, including job id. You can check job completion status using that id, or you can ask us to ping your callback endpoint once the scraping task is finished by adding callback_url in the query.

POST https://data.oxylabs.io/v1/queries

You need to post query parameters as data in JSON body.

curl --user user:pass1 'https://data.oxylabs.io/v1/queries' -H "Content-Type: application/json" 
 -d '{"source": "baidu_search", "domain": "com", "query": "adidas", "callback_url": "https://your.callback.url", "storage_type": "s3", "storage_url": "YOUR_BUCKET_NAME"}'

The API will respond with query information in JSON format, by printing it in response body, similar to this:

{
  "callback_url": "https://your.callback.url",
  "client_id": 5,
  "created_at": "2019-10-01 00:00:01",
  "domain": "com",
  "geo_location": null,
  "id": "12345678900987654321",
  "limit": 10,
  "locale": null,
  "pages": 1,
  "render": null,
  "query": "adidas",
  "source": "baidu_search",
  "start_page": 1,
  "status": "pending",
  "storage_type": "s3",
  "storage_url": "YOUR_BUCKET_NAME/12345678900987654321.json",
  "subdomain": "www",
  "updated_at": "2019-10-01 00:00:01",
  "user_agent_type": "desktop",
  "_links": [
    {
      "rel": "self",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321",
      "method": "GET"
    },
    {
      "rel": "results",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
      "method": "GET"
    }
  ]
}

Check Job Status

If your query had callback_url, we will send you a message containing link to content once the scraping task is done. However, if there was no callback_url in the query, you will need to check job status yourself. For that you need to use the URL in href under rel:self in the response message you received after submitting your query to our API. It should look similar to this: http://data.oxylabs.io/v1/queries/12345678900987654321.

GET https://data.oxylabs.io/v1/queries/{id}

Querying this link will return job information, including its status. There are 3 possible status values:

pending The job is still in the queue and has not been completed.
done The job is completed, you may retrieve the result by querying the URL in href under rel:results : http://data.oxylabs.io/v1/queries/12345678900987654321/results
faulted There was an issue with the job and we couldn't complete it, most likely due to a server error on the target site's side.
curl --user user:pass1 'http://data.oxylabs.io/v1/queries/12345678900987654321'

The API will respond with query information in JSON format, by printing it in response body. Notice that job status has been changed to done. You can now retrieve content by querying http://data.oxylabs.io/v1/queries/12345678900987654321/results.

You can also see that the task has been updated_at 2019-10-01 00:00:15 – the query took 14 seconds to complete.

{
  "client_id": 5,
  "created_at": "2019-10-01 00:00:01",
  "domain": "com",
  "geo_location": null,
  "id": "12345678900987654321",
  "limit": 10,
  "locale": null,
  "pages": 1,
  "render": null,
  "query": "adidas",
  "source": "baidu_search",
  "start_page": 1,
  "status": "done",
  "subdomain": "www",
  "updated_at": "2019-10-01 00:00:15",
  "user_agent_type": "desktop",
  "_links": [
    {
      "rel": "self",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321",
      "method": "GET"
    },
    {
      "rel": "results",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
      "method": "GET"
    }
  ]
}

Retrieve Job Content

Once you know the job is ready to retrieved either by checking its status or receiving a callback from us, you can GET it using the URL in href under rel:results in either our initial response or in callback message. It should look similar to this: http://data.oxylabs.io/v1/queries/12345678900987654321/results.

GET https://data.oxylabs.io/v1/queries/{id}/results

The results can be automatically retrieved without periodically checking job status by setting up Callback service. User needs to specfy the IP or domain of the server where the Callback service is running. When our system completes a job, it will send a message to the provided IP or domain and the Callback service will download the results as described in the Callback implementation example.

curl --user user:pass1 'http://data.oxylabs.io/v1/queries/12345678900987654321/results'

The API will return job content:

{
  "results": [
    {
      "content": "<!doctype html>
        CONTENT      
      ",
      "created_at": "2019-10-01 00:00:01",
      "updated_at": "2019-10-01 00:00:15",
      "page": 1,
      "url": "https://www.baidu.com/search?q=adidas&hl=en&gl=US",
      "job_id": "12345678900987654321",
      "status_code": 200
    }
  ]
}

Callback

A callback is a 职位 request we send to your machine, informing that the data extraction task is completed and providing URL to download scraped content. This means that you no longer need to check job status manually. Once the data is here, we will let you know, and all you need to do now is retrieve it.

# Please see code samples in Python and PHP.

Sample callback output

{  
   "created_at":"2019-10-01 00:00:01",
   "updated_at":"2019-10-01 00:00:15",
   "locale":null,
   "client_id":163,
   "user_agent_type":"desktop",
   "source":"baidu_search",
   "pages":1,
   "subdomain":"www",
   "status":"done",
   "start_page":1,
   "render":null,
   "priority":0,
   "ttl":0,
   "origin":"api",
   "persist":true,
   "id":"12345678900987654321",
   "callback_url":"http://your.callback.url/",
   "query":"adidas",
   "domain":"com",
   "limit":10,
   "geo_location":null,
   {...}
   "_links":[
      {  
         "href":"https://data.oxylabs.io/v1/queries/12345678900987654321",
         "method":"GET",
         "rel":"self"
      },
      {  
         "href":"https://data.oxylabs.io/v1/queries/12345678900987654321/results",
         "method":"GET",
         "rel":"results"
      }
   ],
}

Batch Query

Real-Time Crawler also supports executing multiple keywords, up to 1,000 keywords with each batch. The following endpoint will submit multiple keywords to the extraction queue.

POST https://data.oxylabs.io/v1/queries/batch

You need to post query parameters as data in JSON body.

The system will handle every keyword as a separate request. If you provided callback URL, you will get a separate call for each keyword. Otherwise, our initial response will contain job ids for all keywords. For example, if you sent 50 keywords, we will return 50 unique job ids.

Important! 询问 is the only parameter that can have multiple values. All other parameters are the same for that batch query.

curl --user user:pass1 'https://data.oxylabs.io/v1/queries/batch' -H 'Content-Type: application/json'
 -d '@keywords.json'

keywords.json content:

{  
   "query":[  
      "adidas",
      "nike",
      "reebok"
   ],
   "source": "baidu_search",
   "domain": "com",
   "callback_url": "https://your.callback.url"
}

The API will respond with query information in JSON format, by printing it in response body, similar to this:

{
  "queries": [
    {
      "callback_url": "https://your.callback.url",
      {...}
      "created_at": "2019-10-01 00:00:01",
      "domain": "com",
      "id": "12345678900987654321",
      {...}
      "query": "adidas",
      "source": "baidu_search",
      {...}
          "rel": "results",
          "href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
          "method": "GET"
        }
      ]
    },
    {
      "callback_url": "https://your.callback.url",
      {...}
      "created_at": "2019-10-01 00:00:01",
      "domain": "com",
      "id": "12345678901234567890",
      {...}
      "query": "nike",
      "source": "baidu_search",
      {...}
          "rel": "results",
          "href": "http://data.oxylabs.io/v1/queries/12345678901234567890/results",
          "method": "GET"
        }
      ]
    },
    {
      "callback_url": "https://your.callback.url",
      {...}
      "created_at": "2019-10-01 00:00:01",
      "domain": "com",
      "id": "01234567899876543210",
      {...}
      "query": "reebok",
      "source": "baidu_search",
      {...}
          "rel": "results",
          "href": "http://data.oxylabs.io/v1/queries/01234567899876543210/results",
          "method": "GET"
        }
      ]
    }
  ]
}

Get Notifier IP Address List

You may want to whitelist the IPs sending you callback messages or get the list of these IPs for other purposes. This can be done by 获取ing this endpoint: https://data.oxylabs.io/v1/info/callbacker_ips.

curl --user user:pass1 'https://data.oxylabs.io/v1/info/callbacker_ips'

The API will return the list of IPs making callback requests to your system:

{
    "ips": [
        "x.x.x.x",
        "y.y.y.y"
    ]
}

Upload to Storage

By default RTC job results are stored in our databases. This means that you will need to query our results endpoint and retrieve content yourself. Custom storage feature allows you to store results in your own cloud storage. The advantage of this feature is that you don't have to make extra requests in order to fetch results – everything goes directly to your storage bucket.

We support Amazon S3 and Google Cloud Storage. If you would like to use a different type of storage, please contact your account manager to discuss the feature delivery timeline.

Amazon S3

To get your job results uploaded to your Amazon S3 bucket, please set up access permissions for our service. To do that, go to https://s3.console.aws.amazon.com/ > S3 > Storage > Bucket Name (if don't have one, create new) > Permissions > Bucket Policy

Oxylabs Real-Time Crawler for Baidu Upload to Storage

You can find bucket policy in this JSON or in code sample area on the right. Don't forget to change bucket name under YOUR_BUCKET_NAME. This policy allows us to write to your bucket, give access to uploaded files to you, and know bucket location.

Google Cloud Storage

To get your job results uploaded to your Google Cloud Storage bucket, please set up special permissions for our service. To do that, please create a custom role with the storage.objects.create permission and assign it to the Oxylabs service account email [email protected].

Oxylabs Real-Time Crawler for Baidu Upload to Storage1

Oxylabs Real-Time Crawler for Baidu Upload to Storage2

Usage

To use this feature, please specify two additional parameters in your requests. Learn more 这里.

The upload path looks like this: YOUR_BUCKET_NAME/job_ID.json. You will find job ID in response body that you receive from us after submitting a request. In this example job ID is 12345678900987654321.


Realtime

The data submission is the same as in Push-Pull method, but Realtime case we will return the content on open connection. You send us a query, the connection remains open, we retrieve the content and bring it to you. The endpoint that handles that is this:

POST https://realtime.oxylabs.io/v1/queries

There is a timeout limit of 150 seconds for open connections, therefore in rare cases of heavy load we may not be able to ensure the data gets to you.

You need to post query parameters as data in JSON body. Please see example for more details.

curl --user user:pass1 'https://realtime.oxylabs.io/v1/queries' -H "Content-Type: application/json" 
 -d '{"source": "baidu_search", "domain": "com", "query": "adidas"}'

Example response body that will be returned on open connection:

{
  "results": [
    {
      "content": "
      CONTENT
      "
      "created_at": "2019-10-01 00:00:01",
      "updated_at": "2019-10-01 00:00:15",
      "id": null,
      "page": 1,
      "url": "https://www.baidu.com/search?q=adidas&hl=en&gl=US",
      "job_id": "12345678900987654321",
      "status_code": 200
    }
  ]
}

SuperAPI

If you ever used regular proxies for data scraping, integrating SuperAPI delivery method will be a breeze. All that needs to be done is to use our entry node as proxy, authorize with Real-Time Crawler credentials, and ignore certificates. In cURL it's -k 或 --insecure. Your data will reach you on open connection.

GET realtime.oxylabs.io:60000

SuperAPI only supports a handful of parameters since it only works with Direct data source where full URL is provided. These parameters should be sent as headers. This is a list of accepted parameters:

X-OxySERPs-User-Agent-Type There is no way to indicate a specific User-Agent, but you can let us know which browser and platform to use. A list of supported User-Agents can be found 这里.

If you need help setting up SuperAPI, drop a line at [email protected].

curl -k -x realtime.oxylabs.io:60000 -U user:pass1 -H "X-OxySERPs-User-Agent-Type: desktop_chrome" "https://www.baidu.com/search?q=adidas"

Data Sources

There are multiple approaches how to retrieve data from Baidu using Real-Time Crawler. You can give us full URL via Direct, or you can specify parameters via specifically built data sources, such as 搜索, Shopping Product or Images.

If you are unsure which way to choose, drop us a line at [email protected] or contact your account manager.


Direct

Oxylabs Real-Time Crawler for Baidu Direct

baidu source is designed to retrieve content of direct URLs of various Baidu pages. This means that instead of sending multiple parameters, you can provide us with a direct URL to required Baidu page. We do not strip any parameters or alter your URLs in any other way.

查询参数

参数 说明 Default Value
source Data source baidu
网址 Direct URL (link) to Baidu page
user_agent_type Device type and browser. The full list can be found 这里。 desktop
callback_url URL to your callback endpoint
storage_type Storage service provider. We support Amazon S3 and Google Cloud Storage. The storage_type parameter values for these storage providers are, correspondingly, s3 and gcs. The full implementation can be found on the Upload to Storage page. This feature only works via Push-Pull (Callback) method.
storage_url Your storage bucket name. Only works via Push-Pull (Callback) method.
   – required parameter

In this example the API will retrieve Baidu search for keyword adidas in Push-Pull method:

curl --user user:pass1 'https://data.oxylabs.io/v1/queries' -H "Content-Type: application/json"
 -d '{"source": "baidu", "url": "http://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=adidas"}'

Here is the same example in Realtime:

curl --user user:pass1 'https://realtime.oxylabs.io/v1/queries' -H "Content-Type: application/json"
 -d '{"source": "baidu", "url": "http://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=adidas"}'

And via SuperAPI:

curl -k -x realtime.oxylabs.io:60000 -U user:pass1 "http://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=adidas"

Oxylabs Real-Time Crawler for Baidu Search

baidu_search source is designed to retrieve Baidu Search results (SERP) in HTML format.

查询参数

参数 说明 Default Value
source Data source baidu_search
domain Domain localization com
询问 UTF-encoded keyword
start_page Starting page number 1
pages Number of pages to retrieve 1
limit Number of results to retrieve in each page 10
user_agent_type Device type and browser. The full list can be found 这里。 desktop
callback_url URL to your callback endpoint
storage_type Storage service provider. We support Amazon S3 and Google Cloud Storage. The storage_type parameter values for these storage providers are, correspondingly, s3 and gcs. The full implementation can be found on the Upload to Storage page. This feature only works via Push-Pull (Callback) method.
storage_url Your storage bucket name. Only works via Push-Pull (Callback) method.
   – required parameter

Real-Time Crawler makes request to baidu.com to retrieve search results pages from number 11 to number 20 for keyword adidas. Real-Time Crawler will post the download URL to raw HTML page output to your.callback.url once the data retrieval task is successfully finished.

curl --user user:pass1 'https://data.oxylabs.io/v1/queries' -H "Content-Type: application/json"
 -d '{"source": "baidu_search", "domain": "com", "query": "adidas", "start_page": 11, "pages": 10, "callback_url": "https://your.callback.url"}'

And here is the same example in Realtime:

curl --user user:pass1 'https://realtime.oxylabs.io/v1/queries' -H "Content-Type: application/json"
 -d '{"source": "baidu_search", "domain": "com", "query": "adidas", "start_page": 11, "pages": 10, "callback_url": "https://your.callback.url"}'

Parameter Values

User-Agent

Download full list of user_agent_type values in JSON 这里.

[
  {
    "user_agent_type": "desktop",
    "description": "Random desktop browser User-Agent"
  },
  {
    "user_agent_type": "desktop_firefox",
    "description": "Random User-Agent of one of the latest versions of desktop Firefox"
  },
  {
    "user_agent_type": "desktop_chrome",
    "description": "Random User-Agent of one of the latest versions of desktop Chrome"
  },
  {
    "user_agent_type": "desktop_opera",
    "description": "Random User-Agent of one of the latest versions of desktop Opera"
  },
  {
    "user_agent_type": "desktop_edge",
    "description": "Random User-Agent of one of the latest versions of desktop Edge"
  },
  {
    "user_agent_type": "desktop_safari",
    "description": "Random User-Agent of one of the latest versions of desktop Safari"
  },
  {
    "user_agent_type": "mobile",
    "description": "Random mobile browser User-Agent"
  },
  {
    "user_agent_type": "mobile_android",
    "description": "Random User-Agent of one of the latest versions of Android browser"
  },
  {
    "user_agent_type": "mobile_ios",
    "description": "Random User-Agent of one of the latest versions of iPhone browser"
  },
  {
    "user_agent_type": "tablet",
    "description": "Random tablet browser User-Agent"
  },
  {
    "user_agent_type": "tablet_android",
    "description": "Random User-Agent of one of the latest versions of Android tablet"
  },
  {
    "user_agent_type": "tablet_ios",
    "description": "Random User-Agent of one of the latest versions of iPad tablet"
  }
]

Account Status

Usage Statistics

You can find your usage statistics by querying the following endpoint:

GET https://data.oxylabs.io/v1/stats

By default the API will return all time usage statistics. Adding ?group_by=month will return monthly stats, while ?group_by=day will return daily numbers.

This query will return all time statistics. You can find your daily and monthly usage by adding either ?group_by=day 或 ?group_by=month

curl --user user:pass1 'https://data.oxylabs.io/v1/stats'

Sample output:

{
    "data": {
        "sources": [
            {
                "realtime_results_count": "90",
                "results_count": "10",
                "title": "baidu"
            },
            {
                "realtime_results_count": "19",
                "results_count": "87",
                "title": "baidu_search"
            }
        ]
    },
    "meta": {
        "group_by": null
    }
}

Limits

The following endpoint will give your monthly commitment information as well as how much has already been used:

GET https://data.oxylabs.io/v1/stats/limits
curl --user user:pass1 'https://data.oxylabs.io/v1/stats/limits'

Sample output:

{
    "monthly_requests_commitment": 4500000,
    "used_requests": 985000
}

Response Codes

Code Status 说明
204 No Content You are trying to retrieve a job that has not been completed yet.
400 Multiple error messages Bad request structure, could be a misspelled parameter or invalid value. Response body will have more specific error message.
401 ‘Authorization header not provided' / ‘Invalid authorization header' / ‘Client not found' Missing authorization header or incorrect login credentials.
403 Forbidden Your account does not have access to this resource.
404 Not Found Job ID you are looking for is no longer available.
429 Too many requests Exceeded rate limit. Please contact your account manager to increase limits.
500 Unknown Error Service unavailable.
524 Timeout Service unavailable.
612 Undefined Internal Error Something went wrong and we failed the job you submitted. You can try again at no extra cost, as we don't charge you for faulted jobs. If that doesn't work, give us a shout.
613 Faulted After Too Many Retries We tried scraping the job you submitted, but gave up after reaching our retry limit. You can try again at no extra cost, as we don't charge you for faulted jobs. If that doesn't work, give us a shout.

Cloud storage upload response codes:

Code Status 说明
10001 Unexpected Exception Something terribly wrong happened. We probably know about this already and are fixing it. Let us know anyway.
13000 Upload Success All good!
13001 Upload Failed We couldn't upload job results your bucket.
13102 No Such Path We couldn't find a bucket with such name. Please double check.
13103 Access Denied Bucket doesn't have required permissions. To find out how to give us required access, see 这里.

References

 


Disclaimer: This part of the content is mainly from the merchant. If the merchant does not want it to be displayed on my website, please 联系我们 删除您的内容。

最后更新于 5 月 16, 2022

您推荐代理服务吗?

点击奖杯即可颁奖!

平均评分 0 /5.计票: 0

目前没有投票!成为第一个给本帖评分的人。

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注

zh_CNChinese
滚动到顶部