Sample twitter data

  "_id": "516799596153307136",
  "lang": "en",
  "plt": -5.799,
  "uid": "67763278",
  "tlt": -5.822,
  "cc": "BR",
  "f": "tw201492918305",
  "p": "a4ddc3856053f7e1",
  "flrs": 1014,
  "acr": {
    "$date": 1250900341000
  "t": "@barrosmirella questão de ideias e conceitos. Você se definiu homofóbica nessa frase. Ngm precisa aceitar e/ou apoiar a homossexualidade*",
  "cr": {
    "$date": 1412049600000
  "pln": -35.221,
  "tln": -35.229,
  "flng": 273

  • mongoimport --db test --collection tweets_collection --file tweets_collection.json

  • db.all_tweets.ensureIndex({ t: "text" })

  • Append "," to end of every line sed 's/$/,/' all_tweets.json > all_tweets1.json

  • Converting mongo to json using, manually need to edit out certain things.

  • Removing substring: sed -r 's/^{u'$date': u'//' data.csv, sed -r 's/^'}//' data.csv

  • Make sure you drop the collection first.

  • Use to upload data.

cd data/Tagged/
sh ../
cd ../Untagged/
sh ../
  • Ensure index
use twitter
db.tweets_collection.ensureIndex({ t: "text" })

290726 tweets unlocated, total

  • Dumping data from mongo: mongodump --db test --collection tweets_collection -o /home/kaustubh/

Data stored in .bson format in ~/test/

  • Convert bson to json: bsondump tweets_collection.bson > tweets_collection.json


40,524 tweets. 14299 untagged.

Tweets from 2017-09-12 04:05:05.000Z to 2017-10-13 07:20:43.000Z

Min date query:

         _id: {},
         minDate: { $min: "$cr" }

Analysis to do:

  • Day wise analysis: Flood #, Dengue #, min, max, average for untagged and tagged separately. python ../ in folders

  • Containing both words: db.getCollection('all_tweets').find({$text: {$search: "\"flood\" \"dengue\""} }).count()

  • Containing flood, not dengue: db.getCollection('all_tweets').find({$text: {$search: "flood -dengue"}}).count()

  • Containing flood / dengue: db.getCollection('all_tweets').find({$text: {$search: "flood,dengue"} }).count() 265688

  • Locations mentioned, frequency, floods, dengue separately [Done together for now]

  • Locations: db.getCollection('all_tweets').distinct('loc') 962

# Redundant
        $match: {
            loc: { $not: {$size: 0} }
    { $unwind: "$loc" },
        $group: {
            _id: {$toLower: '$loc'},
            count: { $sum: 1 }
        $match: {
            count: { $gte: 1 }
    { $sort : { count : -1} },
    { $limit : 100 }

/might need to copy paste from mongo shell, not robo3t/ Saved in location_counts.json. Run

   {$group : { _id : '$loc', count : {$sum : 1}}}
), then

  • locations frequency 'wordcloud', urban rural, state, state, heatmap.

  • All tagged. Generated by us, and actual geotagged. Actual geotagged : 1099

  • db.getCollection('all_tweets').find({"f": {$eq: "1e222211"} }).count() => assigned location

  • db.getCollection('all_tweets').find({"f": {$nin: [ "1e222211", ""] } }).count() => Actual geotagged

  • % of tweets identified, test case 1000 manually checked. => how many mistaken,

  • why untagged. Should have been tagged.

  • Improve ? Stemming.

  • w2v for locations ?

  • Improve coverage ? P, R, F, Roc, Auc, ?

  • Random sample

   [ { $sample: { size: 200 } } ]

** Retweets ?

** Remove en languages - spanish, etc.

  • Get wordclouds data, and frequency of how many locations appear 3 times.

  • Print only tweet text from mongo: db.Jan22_tweets.find({}, {t: 1, _id:0})

  • Export only tweets: mongoexport -d test -c Jan22_tweets -f t -o tweets_Jan22.txt

  • Sort hashtags used:

  • db.getCollection('Jan22_tweets').find({"p": {$exists: true, "$ne": ""} ,$text: {$search: "dengue"}}) does location exist ?

  • No location: db.getCollection('Jan22_tweets').find({"p": {$eq :""} ,$text: {$search: "dengue"}}).count()

Randomly sampling some tweets into a file

mongo | tee out.txt
> DBQuery.shellBatchSize = 40000
> db.Jan30_tweets.aggregate(
   { $sample: { size: 40000 } }